Data cleaning

Feature selection, the process of finding and selecting the most useful features in a dataset, is a crucial step in the machine learning pipeline. Unnecessary features decrease the learning speed, decrease the interpretability of the model and, above all, decrease the generalization performance on the test set. The objective is therefore data cleaning.

data cleaning

The FeatureSelector includes some of the most common feature selection methods:

  • Features with a high percentage of missing values
  • Collinear characteristics (highly correlated)
  • Unimportant features in a tree model
  • Features of low importance
  • Features with a single unique value

In this article, we'll walk through the use of FeatureSelector on an example machine learning dataset. We'll see how this allows us to quickly implement these methods, allowing for a more efficient workflow.

The Feature Picker offers five methods to find features to remove. We can access any identified features and manually remove them from the data, or use the delete function in the feature selector.

Here we will go through each of the identification methods and also show how all 5 can be performed at the same time. The FeatureSelector additionally has several plotting capabilities, as visual inspection of data is a crucial part of machine learning.

Missing values

The first way to find features to delete is simple: find features with a fraction of missing values above a specified threshold. The call below identifies features with more than 60 % of missing values (bold is the output).

fs.identify_missing(missing_threshold = 0.6)17 features with greater than 0.60 missing values.

We can see the fraction of missing values in each column of a dataframe:

fs.missing_stats.head()

To see the features flagged for removal, we access the ops attribute of the FeatureSelector, a Python dict with features listed in the values.

missing_features = fs.ops['missing']
missing_features[:5]
['OWN_CAR_AGE',
'YEARS_BUILD_AVG',
'COMMONAREA_AVG',
'FLOORSMIN_AVG',
'LIVINGAPARTMENTS_AVG']

Finally, we have a graph of the distribution of missing values across features:

fs.plot_missing()
data cleaning data cleaning

Collinear columns

Collinear features are features that are strongly correlated to each other. In machine learning, these lead to decreased generalization performance on the test set due to high variance and lower interpretability of the model.

The identify_collinear method finds collinear features given a coefficient value of correlation specified. For each pair of correlated features, it identifies one of the features to remove (since we only need to remove one):

fs.identify_collinear(correlation_threshold = 0.98)21 features with a correlation magnitude greater than 0.98.

A neat visualization we can do with correlations is a heatmap. This shows all characteristics that have at least one correlation above the threshold:

fs.plot_collinear()
data cleaning data cleaning

As before, we can access the full list of correlated features that will be removed, or view highly correlated feature pairs in a database.

# list of collinear features to remove
collinear_features = fs.ops['collinear']
# dataframe of collinear features
fs.record_collinear.head()
data cleaning data cleaning

If we want to investigate our dataset, we can also plot all the correlations in the data by passing plot_all = True to the call:

data cleaning data cleaning

Zero importance columns

The two previous methods can be applied to any set of structured data and are deterministic: the results will be the same each time for a given threshold. The following method is designed only for supervised machine learning problems where we have labels to train a model and is non-deterministic. The identify_zero_importance function finds features that have no importance according to a gradient boosting machine learning (GBM) model.

With tree-based machine learning models, like a boosting set, we can find the importance of features. The absolute value of importance is not as important as the relative values, which we can use to determine the most relevant features for a task. We can also use feature importances for feature selection by removing features with zero importance. In a tree-based model, unimportant features are not used to split nodes, so we can remove them without affecting model performance.

The FeatureSelector finds feature importances using the LightGBM library's gradient enhancement machine. Feature importances are averaged over 10 GBM training runs to reduce variance. Also, the model is trained using an early stop with a validation set (there is an option to turn this off) to avoid overfitting the training data.

The code below calls the method and extracts the features of zero importance:

# Pass in the appropriate parameters
fs.identify_zero_importance(task = 'classification',
eval_metric = 'auc',
n_iterations = 10,
early_stopping = True)
# list of zero importance features
zero_importance_features = fs.ops['zero_importance']
63 features with zero importance after one-hot encoding.

The parameters we pass are:

  • task : either “classification” or “ regression » corresponding to our problem
  • eval_metric: metric to use for early termination (useless if early termination is disabled)
  • n_iterations : number of training runs to average feature importances
  • early_stopping: whether or not to use early stop for model training

This time we get two plots with plot_feature_importances:

# plot the feature importances
fs.plot_feature_importances(threshold=0.99, plot_n=12)
124 features required for 0.99 of cumulative importance
data cleaning data cleaning
data cleaning data cleaning

At the top we have the most important features of plot_n (plotted in terms of normalized importance where the total equals 1).

At the bottom, we have the cumulative importance versus the number of features. The vertical line is drawn at the cumulative importance threshold, in this case 99 %.

Two notes are worth remembering for importance-based methods:

  • The training of the gradient boosting machine is stochastic, which means that the importance of the features will change each time the model is run

This shouldn't have a major impact (the most important features won't suddenly become the least important) but it will change the order of some features. It may also affect the number of unimportant features identified. Don't be surprised if the importance of features changes every time!

  • To train the machine learning model, features are first hot-coded. This means that some of the features identified as having an importance of 0 may be hot-coded features added during modeling.

Columns of little importance

The following method builds on the zero importance function, using the model feature importances for further selection. The identifier_low_importance function finds the least important features that do not contribute to a specified total importance.

For example, the call below finds the least important features that are not required to reach 99 % of total importance:

fs.identify_low_importance(cumulative_importance = 0.99)123 features required for cumulative importance of 0.99 after one hot encoding.
116 features do not contribute to cumulative importance of 0.99.

Based on the cumulative importance plot and this information, the gradient boosting machine considers many features to be irrelevant for learning. Again, the results of this method will change with each workout.

To display all feature importances in a dataframe:

fs.feature_importants.head(10)
data cleaning data cleaning

The method of low significance borrows from one of the methods of using principal component analysis (PCA) where it is common to keep only the PC needed to retain a certain percentage of the variance (e.g., 95 % ). The percentage of the total importance taken into account is based on the same idea.

Feature importance based methods are only really applicable if we are going to use a model based on a tree to make predictions. In addition to being stochastic, importance-based methods are a black box approach in that we don't really know why the model considers features to be irrelevant. If you use these methods, run them multiple times to see how the results change, and perhaps create multiple datasets with different parameters to test!

Single Value Columns

The last method is pretty basic: find all columns that have a single unique value. A feature with only one unique value cannot be useful for machine learning because this feature has zero variance. For example, a tree model can never split a feature with a single value (since there are no groups to split observations into).

There are no parameters to select here, unlike the other methods:

fs.identify_single_unique()4 features with a single unique value.

We can plot a histogram of the number of unique values in each category:

fs.plot_unique()
data cleaning data cleaning

A point to remember is that NaNs are removed before calculating unique values in Pandas by default.

Remove columns

Once we identify the features to remove, we have two options to remove them. All features to be removed are stored in the ops dict of FeatureSelector and we can use the lists to remove features manually. Another option is to use the built-in delete function.

For this method, we pass the methods to use to remove features. If we want to use all implemented methods, we just pass methods = 'all'.

# Remove the features from all methods (returns a df)
train_removed = fs.remove(methods = 'all')
['missing', 'single_unique', 'collinear', 'zero_importance', 'low_importance'] methods have been run

Removed 140 features.

This method returns a data frame with the features removed. To also remove hot-coded features that are created during machine learning:

train_removed_all = fs.remove(methods = 'all', keep_one_hot=False)Removed 187 features including one-hot features.

It may be a good idea to check which features will be removed before proceeding! The original dataset is stored in the FeatureSelector's data attribute as a backup!

Data cleansing pipeline

Rather than using the methods individually, we can use them all with identify_all. This takes a dictionary of parameters for each method:

fs. identify_all(selection_params = {'missing_threshold': 0.6,    
'correlation_threshold': 0.98,
'task': 'classify',
'eval_metric': 'auc',
'cumulative_importance': 0.99})
151 total features out of 255 identified for removal after one-hot encoding.

Note that the total number of features will change as we have re-run the model. The remove function can then be called to remove these features.

The Feature Selector class implements several common operations for removing features before training a machine learning model. It offers functions for identifying features to remove as well as visualizations. Methods can be run individually or all at once for efficient workflows.

The missing, collinear, and single_unique methods are deterministic, while the feature importance-based methods change with each run. Feature selection, much like the field of machine learning, is largely empirical and requires testing multiple combinations to find the optimal answer. It is recommended to try multiple configurations in a pipeline, and the feature selector provides a way to quickly evaluate feature selection settings.