Machine Learning in Python

December 18, 2018

Pre-processing the data before applying a Machine Learning Algorithm

Rescale data
Standardize data
normalize data
Barnarize data

Rescale Data:

Rescale all the attributes to have the same scale. Generally attributes are often rescaled into the range between 0 and 1 for better optimization.

We can rescale the data using scikit-learn using the MinMaxScaler class.

Standardize data:

Standardization is a useful technique to transform all attributes to a standard Gaussian distribution with Mean 0 and Standard deviation 1 for better optimization.

We can standardize the rescaled data using scikit-learn with the StandardScaler class.

Normalize Data:

Normalizing refers to rescaling each observation (row) to have a length of 1(unit norm).
We can Normalize data with scikit-learn using the Normalizer class. This pre-processing is used for sparse datasets(attributes having lot of zero values)

Binarize Data:

We can transform the data using the binary threshold. All values above the threshold are marked as 1 and all equal to or below are marked as 0.
Making crisp values when we have probabilities as feature values.
We use a scikit-learn binarizer class

Comments