Data preprocessing is one of the most important steps in Machine Learning. This step cannot be avoided especially if data is in unstructured form. In this post, I’ll discuss the different steps using Scikit-Learn and Pandas.

“I’m assuming that you have some basic knowledge of Numpy and Pandas. If you don’t know Numpy and Pandas then first learn these topics.”

What is Data Preprocessing?
At the end of the day, Machine Learning algorithms are just some mathematical equations. That means they require numerical data. But different datasets that we use contains different types of data in different types of format.
For example, a dataset may contain genders or some date. These are not in numerical form. That means we need to convert these data into a numerical form. Or sometimes different data are in completely different ranges. It can cause some problem in training. So we need to convert these types of data into a fixed range. The process by which we convert these data into a more valid numerical form for a Machine Learning algorithm is known as Data Preprocessing.
Data Preprocessing is one of the most important step Machine Learning. This is the first step in a Machine Learning pipeline.
Data preprocessing consists of different parts. In each part, we apply some modifications to our data so that we can use the data.

Why it is so important?
Now as we know what Data Preprocessing is, we can ask another question, can we skip Data Preprocessing? Is it important?
The answer is “Big No”. Different datasets contain different data in different formats, in different ranges. We cannot use these types of data directly into a Machine Learning algorithm. We need some type of conversation mechanism and that is Data Preprocessing.
That means Data Preprocessing is one of the most important steps in Machine Learning pipeline. And we cannot avoid this step.

Why Scikit-Learn and Pandas?
In my opinion, Data Preprocessing is a very tedious work. I personally don’t like it. And honestly, it is hard also. So using some type of Library for Data Preprocessing is actually a very good idea.

Scikit-Learn is one of the most popular libraries in Machine Learning developed and maintained by Google. Scikit-Learn support different types of Machine Learning algorithms including SVM, Random Forest, Neural Networks etc. And of course, it supports different types of Data Preprocessing methods.
Scikit-Learn is very simple and easy-to-learn library. It provides very good documentation. It is very stable also.
On the other hand, pandas is also very popular and stable tool. Pandas is not only used for Data Preprocessing, but it is also a very popular tool for Data Analysis.
Handling Missing Data
Handling Missing Data is very common in Machine Learning. It means your dataset does not contain any information for a certain feature in a specific row. Almost all datasets come with some missing values.
We know that Machine Learning algorithms are just some math equations. That means we cannot throw these empty (missing) values into those algorithms.
There are two very commonly used methods for Handling Missing Data.
1. Removing Data
2. Imputation

Removing Data
In many cases, the solution is just removing the specific row. If we use pandas, then it is very simple. We just need to use one pandas methods called dropna().
Let’s say we have a pandas DataFrame df. This DataFrame contains some missing values, then we can write a simple code to delete those specific rows.

new_df = df.dropna(axis=1)

But this approach is not the best solution. A better solution to this problem is Imputation.
Imputation
Imputation is another very popular methods for handling missing values. In Imputation, instead of deleting the rows, we fill the rows with some other values. The imputed value is not the right number but it is very accurate to the right value.
Scikit-Learn provides for methods for Imputation. We can use the SimpleImputer method for Imputation. Here is code for that

import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy=’mean’)
new_data = imp.fit_transform(data)

SimpleImputer supports different strategies for imputation. In the above example, we have used mean strategy. That means this method replaces the missing values with the mean value for that column. We can also use a constant value for as a strategy so that it’ll replace all the missing values with a constant value.

Feature Scaling
Feature Scaling is another very important step in Machine Learning. Most of the times, datasets contains features that are in completely different ranges and units. This course some problems in many Machine Learning algorithms.

But why it cause problems?
Most of the Machine Learning algorithms use the Euclidean distance of the points in their computations. That means they calculate the magnitude of those data points. So if different features are in different ranges or different units, the feature with higher magnitude will weigh a lot more than the others.

To solve these issues we need to convert all the features into the same range of magnitudes. And this can be achieved by Scaling.

There are different ways to scale features. In this post, I’ll discuss the most common two types of scaling methods.

1. Mean Normalization
Mean Normalization distributes the features with values between -1 and 1.

2. Standardization
Standardization is one of the most popular methods for scaling features. It basically replaces the values with their Z scores.
This method redistributes the features with their mean = 0 and standard deviation = 1.
Scikit-Learn provides a preprocessing module that contains different preprocessing methods including standardization.
Here is a simple code to demonstrate that.
from sklearn.preprocessing import scale

# Lets assume that we have a numpy array with some values
# And we want to scale the values of the array
sc = scale(X)

For information about the scale method, read the documentation here.

Data Encoding
We know that Machine Learning algorithms require data in numerical form. But many times, datasets contains some features in some other form. So we need to convert these value into numerical form.
Scikit-Learn provides different encoding methods for Data Encoding.

Label Encoding
To understand Label Encoding, first, let’s assume a dataset contains three columns age, salary, and gender. Now in this dataset, the gender column is not in numerical form. That means we need to convert it into some type of numerical form.
To achieve that, we can use Label Encoding. We know that the gender columns contains two unique values, Male and Female. If we apply Label Encoding algorithm on this column, then it replaces the values by 0 and 1 (Male = 0 and Female = 1).

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x = labelencoder.fit_transform(x).toarray()

One Hot Encoding
Label Encoding sometimes causes different problems. Sometimes by using Label Encoding, the Machine Learning algorithm may confuse and assume that the data have some type of hierarchical order. To avoid this, we can use One Hot Encoding.
What one hot encoding does is, it takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()

Splitting the Dataset
The last step that I want to discuss is Dataset Splitting. In this step, we split the dataset into multiple parts. Generally one for training and one for testing.
Scikit-Learn provides a very simple method for achieving it and that ie train_test_split method. Here is a simple code that demonstrates the application of this method.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=20)

Here I’m spliting the dataset into 80% 20% ratio. So 80% of data is for training and 20% data is for testing.

Share this post on: