In a previous blog article, we analyzed from a theoretical point of view which are the main libraries for data science and machine learning used by data scientists to treat data efficiently and extract its full potential.
In this post, we will put it into practice through simple examples that show how to work with these libraries, with special emphasis on what they offer to the user interested in the world of machine learning.
Machine Learning with Python
Python is one of the most widely used programming languages today, with an upward trend in its use over the last few years. This statement can be clearly reflected in the annual ranking used as a reference to know the most used programming languages, the TIOBE index.
This index is an indicator of the popularity of these languages based on elements such as the number of engineers around the world, courses or the amount of external providers. All the information gathered for the calculation of this index comes from the most popular search engines (Google, Bing, Yahoo, Baidu…), which means that this index does not indicate which programming language is the best or which one has the most lines of code written, but its general popularity.
Python’s popularity is due to the large number of advantages and options it has over its competitors, starting with the fact that it is an open source programming language, which means that there are practically no limitations to its use by anyone and gives the possibility of collaborating to improve it by forming part of its large community.
In addition to this, Python has a simple, readable and easy to understand syntax, which makes it both suitable for users who are just starting to program and for users who are looking to save time and resources when developing software efficiently. All this, added to the large number of libraries, frameworks and paradigms that Python offers, makes it one of the most versatile and versatile programming languages that exist, being a general-purpose language suitable for the vast majority of users regardless of their specialities or interests.
Finally, the fact that it is open source and that it is so popular nowadays, means that Python has a large and solid community, willing to help other users with their doubts and that the Python ecosystem continues to grow and improve for everyone.
These advantages, combined with the existence of a large number of libraries, tools and frameworks suitable for data processing, make Python the ideal programming language for carrying out tasks related to the field of machine learning and, in general, the field of artificial intelligence and data science.
Most popular libraries for Machine Learning in Python
Pandas is the reference library in Python for data manipulation and analysis. It is an open source library that is committed to a flexible, efficient and available to all analysis and management of one of the most valuable resources for the company, the data.
In order to carry out this task, Pandas offers mainly two key data structures, the series as a one-dimensional structure, and the DataFrame as a two-dimensional structure represented in the form of a table, which together allow the management of most datasets in an intuitive and simple way.
Some of the most common and relevant utilities of this library are the following:
- Thanks to the use of the DataFrame as the main data structure, it allows operations such as grouping, integration, iteration or indexing of data in a fast way.
- It has tools to perform data reading and writing operations, handling different data structures and formats.
- It allows the treatment of missing values.
- It allows the execution of chains of operations on data in a vectorized way.
- Contains tools for the processing and transformation of time series.
- It offers a unified representation of the data that allows the integration of the data with other Python libraries in a simple way.
Below are some code cells with an example of a simple use of the Pandas library. In this example, we first read a dataset in CSV format in order to visualize it, then we obtain and display a subset of this data that corresponds to the passengers of the Titanic who survived and meet certain conditions.
NumPy (Numeric Python) is another one of those libraries that cannot be missing in your arsenal if you intend to tackle problems in the field of data science, as it offers different essential tools for scientific computing and matrix-oriented numerical computation.
This library introduces a new kind of object called Array, which is nothing more than an N-dimensional array data structure for efficient handling. The need for this new representation arises because the manipulation of large native Python lists is very inefficient, largely because of the flexibility they provide as data structures, allowing to dynamically resize them or to hold data of different types in the same structure.
NumPy’s arrays, on the other hand, allow for internal storage and efficient handling thanks to an internal binary representation in which the number of bytes associated with each element of an array is fixed (the data that make up the array are all of the same type) and only the value of the data is represented, leaving aside added information or metadata.
Another contribution of NumPy with respect to Python is the support it offers for invoking functions in a vectorized way, allowing optimized calculations by eliminating loops and avoiding having to access sequentially to each of the elements that make up the matrix.
In addition to all this, NumPy offers tools and utilities such as the following:
- Indexing and filtering of Arrays in a simple and efficient way, allowing to obtain new Arrays that are a subarray of the original one and meet a certain condition.
- Simple mathematical operations between Arrays in an intuitive and vectorized way, using classic operators such as +, -, *, /, …
- Algebraic operations with vectors and matrices such as the scalar product of two vectors, the product of two matrices, the transpose of a matrix, calculation of the determinant of a matrix, autovectors of a matrix…
- Integrated methods for the generation of random data matrices according to different probability distributions.
As was done with the Pandas library, here are two cells of code that show, with a simple example, some of the functionalities provided by the NumPy library, from the definition of arrays to their use to perform different mathematical and algebraic operations.
The third library we are going to review in this post is Scikit-learn, one of the most widely used free software libraries to carry out tasks related to the field of machine learning. This is largely due to the number of classification, regression and clustering algorithms it supports in a unified way, allowing their training and evaluation in a simple way, with a very simple syntax and abstracted from the details of the implementation of these algorithms.
In addition to this, it also provides methods for pre-processing and transforming the data prior to training the algorithms, which, together with the variety of models, allows for fast and efficient experimentation.
Some of the algorithms and methods offered by Scikit-Learn are the following:
- Supervised learning: decision trees, support vector machines (SVM), neural networks, K-nearest neighbours, naive bayes, linear regression, logistic regression, random forest, gradient boosting…
- Unsupervised learning: K-means, DBSCAN, BIRCH, agglomerative clustering, OPTICS…
- Preprocessing and transformation: feature selection, missing value imputation, label encoder, one hot encoder, normalization, dimensionality reduction…
In addition to all this, this library supports the definition of pipelines to concatenate and manage sets of elements in groups, thus maintaining the coherence and consistency of the development process from start to finish. Finally, regarding model evaluation, Scikit-learn has implementations of the most commonly used metrics for most machine learning tasks, which are integrated in a simple way for model selection by hyperparameter search.
An example of how to instantiate, train and evaluate a machine learning model using only Scikit-learn can be visualized in the following code cell. In it, an example dataset is imported and split into training and evaluation, the data is normalized using standard scaling, a random forest is trained using only the training set, and finally the model is evaluated using the mean squared error over the evaluation set as a metric.
Matplotlib, unlike the above mentioned, is a library specialized in the creation of two-dimensional graphs for data visualization, which is built on top of NumPy arrays and allows easy integration with a large number of Python libraries such as Pandas.
Some of the chart types supported by Matplotlib are the following:
- Bar chart
- Pie chart
- Box and whisker diagram
- Violin diagram
- Scatter diagram
- Area diagram
- Line graph
- Heat map
In the gallery of the official Matplotlib website you can see many examples of all these graphs and some more along with the associated Python code to generate them. In addition to these examples, in this post we have included a code cell with the creation of a simple scatter plot using Matplotlib, in which the same example dataset used previously in the previous section has been used.
Finally, it should be noted that there are other libraries oriented to the creation of graphs that use Matplotlib as a base, facilitating its use and extending it to create good quality visualisations with the least possible intervention, as is the case of the Seaborn library.
TensorFlow / Keras
We can define TensorFlow as a high-performance open-source library oriented to numerical computation, which makes use of directed acyclic graphs (DAG) to represent such operations and how they follow one another when performing complex processes.
Tensors are mathematical objects that act as a generalization of scalar numbers, vectors and matrices of different dimensions and are the building block for representing the data in this library, a concept from which TensorFlow derives part of its name. This library was developed by Google and subsequently released as open source in 2015, from which point it was widely adopted by industry and the research community due to the wide range of advantages it offers when tackling machine learning problems.
The use of graphs to represent the computations to be performed makes these operations completely independent of the underlying code, making them portable between different devices and environments, allowing for example to develop a model in Python, store it in some intermediate format, and then load it into a C++ program on another device to run it in a more optimized way. So you can use virtually the same code to run a set of operations using only the computer’s CPU, the graphics card (GPU) to accelerate the computations, or even multiple graphics cards together.
All these advantages make TensorFlow a perfect platform for the development of machine learning models, especially for the construction and training of neural networks. It is at this point where Keras appears, an API designed to work with TensorFlow that offers intuitive high-level abstractions that aim to greatly simplify the development of this type of algorithms, minimizing the number of user actions required for most common use cases. Keras is not only able to work and run on top of TensorFlow, but can also make use of other low-level libraries such as Theano or Microsoft Cognitive Toolkit.
The use of TensorFlow is not limited to the development of machine learning models, but also allows to perform all kinds of tasks related to such development, such as the following:
- Data preparation, from data loading to pre-processing prior to model input.
- Evaluation and monitoring of model output in a precise and simple way.
- Running models on mobile devices and embedded systems with TensorFlow Lite.
- Deployment of models in production environments, helping to implement best practices for applying an MLOps methodology.
- Exploration and use of models previously trained by the community using TensorFlow Hub.
The following example shows a use case in which Keras is used with TensorFlow as a backend to define a neural network that is able to solve the same problem previously addressed using Scikit-learn. For this, first a simple multilayer neural network is defined, which is trained using the data set already used for 10 epochs, and then evaluated using a proprietary method provided by Keras.
The last library we are going to describe in this post is Scipy, an open source library that offers a large number of mathematical tools and algorithms to solve problems of optimization, interpolation, algebraic equations, differential equations or statistics, among others. This library is built to work with NumPy arrays, offering a simple interface to tackle mathematical problems that is free, powerful, easy to install, and can be used in the most common operating systems.
One of the great advantages of Scipy is its great speed, as it makes use of highly optimized implementations written in low-level languages such as Fortran, C or C++. This allows you to benefit from the great flexibility that Python provides as a programming language without giving up the speed of execution of the compiled code necessary to perform this type of task. On top of this, Scipy offers a high-level syntax that is easy to use and interpret, which makes it accessible and productive for both inexperienced programmers and professionals.
The following code cell is intended to represent a simple use case of the Scipy library, in which the roots of a second-degree mathematical equation are to be found. For this, use is made of the root method provided by Scipy, with which the roots for linear and non-linear equations can be obtained.
Python is one of the programming languages of the moment, standing out among other reasons for the large number of libraries, frameworks and paradigms it offers. In this post we have reviewed in detail the functionalities provided by some of the most widely used Python libraries in the field of data processing, highlighting their practical applicability and showing some simple code examples to familiarize the reader with these tools.