In order to be able to analyze your data and carry out any preprocessing or reduction processing, it is very important to properly normalize, standardize and resize your data. Here are the tutorials.
Contents
ToggleTutorial on normalizing, standardizing and resizing your data
Before we dive into this topic, let's start with some definitions.
“Rescaling” a vector means adding or subtracting a constant, then multiplying or dividing by a constant, just as you would when changing data measurement units, for example, to convert a temperature from Celsius to Fahrenheit.
“Normalizing” a vector most often means dividing by a norm of the vector. It also often refers to rescaling by the minimum and range of the vector, so that all elements fall between 0 and 1, thus bringing all the values of the numeric columns of the dataset to scale common.
“Standardizing” a vector most often means subtracting a location measure and dividing by a scale measure. For example, if the vector contains random values with a Gaussian distribution, you can subtract the mean and divide by the standard deviation, thus obtaining a "standard normal" random variable with a mean of 0 and a standard deviation of 1.
Why do it?
Standardization:
Standardizing features around the center and 0 with a standard deviation of 1 is important when comparing measurements that have different units. Variables measured at different scales do not contribute equally to the analysis and can end up creating a bias.
For example, a variable between 0 and 1000 will outweigh a variable between 0 and 1. Using these variables without standardization will result in the variable with the weight of the widest range of 1000 in the analysis. Transforming the data to comparable scales can avoid this problem. Typical data normalization procedures equalize the range and/or variability of the data.
Standardization:
Similarly, the purpose of normalization is to change the values of the numeric columns of the dataset to a common scale, without distorting the differences in the ranges of values. For machine learning, each data set does not require normalization. It is required only when features have different ranges.
For example, consider a dataset containing two characteristics, age and income (x2). Where age ranges from 0 to 100, while income ranges from 0 to 100,000 and above. Income is about 1,000 times higher than age. Thus, these two characteristics are in very different ranges. When we do a deeper analysis, like the regression linear multivariate, for example, the assigned income will inherently influence the outcome more due to its higher value. But that doesn't necessarily mean it's more important as a predictor. So we normalize the data to bring all the variables into the same range.
When to do it?
Normalization is a good technique to use when you don't know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k nearest neighbors and networks of neurons artificial.
Standardization assumes that your data has a Gaussian distribution (bell curve). This doesn't have to be true, but the technique is more efficient if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using makes assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.
Standardization
As we saw earlier, standardization (or Z-score normalization) means centering the variable at zero and standardizing the variance at 1. The procedure involves subtracting the mean of each observation and then dividing by the standard deviation.
The result of the normalization is that the features will be scaled so that they have the properties of a standard normal distribution with
μ=0 and σ=1
where μ is the mean (mean) and σ is the standard deviation from the mean.
StandardScaler from scikit-learn removes the mean and scales the data by unit variance. We can import the StandardScaler method from sci-kit learn and apply it to our dataset.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() data_scaled = scaler.fit_transform(data)
Now let's check the mean and standard deviation values.
print(data_scaled.mean(axis=0)) print(data_scaled.std(axis=0))
As expected, the mean of each variable is now around zero and the standard deviation is set to 1. So all the values of the variables are in the same range.
print('Min values (Loan Amount, Int rate and Installment): ', data_scaled.min(axis=0)) print('Max values (Loan Amount, Int rate and Installment): ', data_scaled.max(axis=0 ))
However, the minimum and maximum values vary depending on the initial spread of the variable and are strongly influenced by the presence of outliers.
Standardization
In this approach, the data is scaled in a fixed range — typically 0 to 1.
Unlike normalization, the cost of having this range delimited is that we will end up with smaller standard deviations, which can remove the effect of outliers. Thus, MinMax Scalar is sensitive to outliers.
Min-Max scaling is usually done via the following equation:
Let's import MinMaxScalar from Scikit-learn and apply it to our dataset.
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() data_scaled = scaler.fit_transform(data)
Now let's check the mean and standard deviation values.
print('means (Loan Amount, Int rate and Installment): ', data_scaled.mean(axis=0)) print('std (Loan Amount, Int rate and Installment): ', data_scaled.std(axis=0))
After MinMaxScaling, the distributions are not centered on zero and the standard deviation is not 1.
print('Min (Loan Amount, Int rate and Installment): ', data_scaled.min(axis=0)) print('Max (Loan Amount, Int rate and Installment): ', data_scaled.max(axis=0))
But the minimum and maximum values are normalized across the variables, different from what happens with standardization.
Robust scaling
Scaling using the median and quantiles consists of subtracting the median from all observations and then dividing by the interquartile difference. It scales features using statistics that are resistant to outliers.
The interquartile difference is the difference between the 75th and the 25th quantile:
IQR = 75th quantile — 25th quantile
The equation to calculate the scaled values:
X_scaled = (X — X.median) / IQR
First, import RobustScalar from Scikit learn.
from sklearn.preprocessing import RobustScaler scaler = RobustScaler() data_scaled = scaler.fit_transform(data)
Now check the mean and standard deviation values.
print('means (Loan Amount, Int rate and Installment): ', data_scaled.mean(axis=0)) print('std (Loan Amount, Int rate and Installment): ', data_scaled.std(axis=0))
As you can see, the distributions are not centered on zero and the standard deviation is not 1.
print('Min (Loan Amount, Int rate and Installment): ', data_scaled.min(axis=0)) print('Max (Loan Amount, Int rate and Installment): ', data_scaled.max(axis=0))
The min and max values are also not set to some upper and lower bounds like in MinMaxScaler.