The process of creating a model for a given machine learning problem is not only based on throwing all available features from the dataset into the algorithm. It is good practice to check what particular features affect the target variable and create a feature ranking, which is one of the elements of the broader concept of feature selection. This possibility is also provided by the ML.NET library. Today I will show you how to use it.
How is Feature selection defined?
The selection of features can be defined as checking how well the features describe the analyzed problem in order to select the most important features or create their ranking. We want to find these essential features, because others may not contribute much to the analysis. In addition, they can disturb or completely prevent the data analysis process.
As mentioned earlier, some of the selection methods are based on the ranking of features. The essence of ranking methods is to try to find some measure that will allow you to create a rating and thus help to eliminate unnecessary features. This ranking is being built by assessing each of the features separately and each of these methods has a different assessment criterion. The determination of this criterion is independent of the classifier used, which is one of the main advantages.
How does it look in ML.NET?
This topic is not very extensive in the ML.NET library at the moment. The library does not offer many possibilities, but I will give you an idea of one of the methods you can use. This method is called Permutation Feature Importance and is inspired by Leo Breiman’s Random Forest paper.
In short, this method is based on the fact that individual features are tested by classifying only using them and checking what results are being achieved. The ranking result of a given feature depends on how effective the classification was.
Example of use and code
I used the wine quality dataset from the UCI Machine Learning Repository for the experiment. The analyzed data set has 11 features and 11 classes. The classes determine the quality of the wine in the numerical range 0–10.
Therefore, after creating a new console application project, you should create a model that correspond to the attributes in the dataset. Created classes are shown in the listing:
Then you should write the code which will load the dataset and adapt the model structure to the standards adopted by the ML.NET library. This means that the property specifying the class must be called Label. The remaining attributes must be condensed under the name Features. There will be an estimator definition and data preprocessing step here as well.
Next, you might go to the code that creates a transformer using the data preprocessing estimator, then preprocess the training set, defines SDCA estimator and finally trains a previously created model.
Now you can go to the code responsible for the ranking of features and display the obtained results in the form of R-squared metrics.
According to the ranking, you will know which features are relevant to the problem and which you should remove from the model. During the research, you can make a prediction for a model consisting of all features and without the “weakest” features falling in with the ranking and then compare the results.