Recently, Deep Learning techniques has shadowed the traditional Machine Learning techniques due to many reasons such as:  It can solve the problem end to end in a single run, better performance and automated feature extraction. But, in some situations, specific to Industry, still Machine Learning is a reliable approach for data analysis, due to following reasons:

  • Deep Learning performs well if the data size is large. But with small data size, traditional Machine Learning algorithms is the best option.
  • Deep Learning techniques need very robost and high end infrastructure to train in reasonable time.
  • When there is lack of domain understanding for feature introspection, Deep Learning techniques outshines others as we have to worry less about feature engineering.
  • Deep Learning really shines when it comes to complex problems such as image classification, natural language processing, and speech recognition.

Few specific Industries like Clinical Research and Trial, sometimes we need to perform data analysis with small dataset and the risk of any mistake in model prediction pays a very high value to a human life. In such scenario, Machine Learning is mandatory.

In Machine Learning, feature Selection is an important step to get the better model performance. High-dimensional data, in terms of number of features, is increasingly common these days in machine learning problems. To extract useful information from these high volumes of data, we have to use statistical techniques to reduce the noise or redundant data. So, in this post, we will learn the following:

  • What is a Feature in Machine Learning?
  • What is Feature Selection?
  • Why is it required?
  • What are Feature Selection Techniques and how to select a technique?
  • Notebook with Feature Selection Techniques
    • Visualization and ANOVA
    • RandomForest
    • Boruta Package

So, let’s start.

What is a Feature in Machine Learning or Data Mining?

Features are simply variables present in a dataset, means these represent the collection of some observable phenomenon that can be quantified and recorded. Features can be independent and derived.

For example, take an ML application trying to determine the probability of heart disease in patients.  What are some possible features?

  • Gender
  • Age
  • Height
  • Weight
  • Blood pressure
  • Resting heart rate
  • Past medical history

Let’s look at some of the features in detail.  One thing we want to do sometimes is categorize features with many values and place them into a smaller number of categories.  Age, blood pressure, and resting heart rate all have a valid range of values from 0 to some integer with an upper bound.  But do we want 120 different possible values for age?  To make it easier, why not categorize the ages as 0 to 18, 19 to 29, 30 to 39, and so on?  Likewise, place the vital signs into with three to seven categories.

As for deriving values, height and weight can be combined into a standard feature: BMI.  So calculate BMI and break it up into a small number of categories.

Past medical history can be made into multiple binary features.  Has the patient suffered from a stroke?  Yes or no.  Has the patient been diagnosed with high blood pressure?

This is just a good basic example of identifying, classifying, and deriving useful features for machine learning applications.

What is Feature Selection?

Feature selection is the process of selecting the subset of the most relevant features from the set of features.

There are four main reasons to do so:

  • To simplify the model by reducing the number of parameters.
  • To increase the accuracy of trained model, as few relevant feature is better to train the model than huge amount of ir-relevant and redundant feature.
  • To avoid curse of dimensionality. [ This is phenomena when existing machine learning algorithm fails to work or do not scale well on large number of feature.]
  • To avoid overfitting. [chances of overfitting increased with large number of ir-relevant features (noise in data).]
  • Increase the accuracy and efficiency of classifier.
  1. Correlation: Correlation measure represents how the two numerical variables are related to each other. We should select the features which are not related to each other, instead each feature should be highly correlated to the dependent variable.

Cor (X, Y)=cov(X,Y)/Sx.Sy ,

where cov(X,Y) is covariance between variables X and Y.

Sx and Sy represnts the standard deviations of variable X and Y respectively.

  • Covariance: Covariance matrix represents how two features change together.

Cov(X,Y)=1/n [SumTotal {(Xi-Xbar)(Yi-Ybar)}]

  • Dimensionality Reduction: It is the technique of reducing the dimension of the dataset, by using following techniques.
  • PCA: It is the technique of creating fewer principal components by merging the correlated variables together by using their eigen values and these principal components are orthogonal to each other, means they are not related to each other, they are quite independent from each other.


where  is a square matrix, ν a vector and λ a scalar that satisfies Aν = λν, then λ is called eigenvalue associated with eigenvector ν of A

  • SVD: The Singular-Value Decomposition, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.

A = U . Sigma . V^T

Where A: A m x n matrix that we wish to decompose,

U: An m x m matrix, Sigma, is an m x n diagonal matrix,

V^T: Transpose of an n x n matrix where T is a superscript.

3. Importance: Based on importance of the features.

  • Filter: Filter methods are generally used as a preprocessing step. Following are some techniques:
  • Correlation: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1.
  • LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable.
  • ANOVA: ANOVA stands for Analysis of variance. It is operated using one or more categorical independent features and one continuous dependent feature. It is a statistical test to know whether the means of several groups are equal or not.
  • Chi-square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution.

What are Feature Selection Techniques?

There are many techniques on feature selection based on types of variables. Following describes a very detail branching of all the techniques, however each techniques is a very vast area in itself. So, we can learn more about each technique in our classroom training.


In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous models, we decide to add or remove features from our subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive. Wrapper technique is based on Coefficients as Feature Importance.

Feature selection with wrapper methods by using Boruta package helps to find the importance of a feature by creating shadow features.

  • Forward Selection: Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep on adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
  • Backward Elimination: In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until we get the best improved model.
  • Recursive Feature Elimination: It is a greedy optimization algorithm which aims to find the best performing feature subset. It creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are consumed. It then ranks the features based on the order of their elimination.
  • Linear Regression: These coefficients can provide the basis for a crude feature importance score. The scores suggest that the model found the five important features and marked all other features with a zero coefficient, essentially removing them from the model.
  • Logistic Regression:The coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0.
  • CART Classification Feature Importance: After being fit, the model provides a feature_importances_ property that can be accessed to retrieve the relative importance scores for each input feature
  • Random Forest Feature Importance
  • XGBoost Feature Importance:

Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods. Techniques such as LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.

  • Lasso: Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.
  • Ridge: Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.

Use Case:

This data set consists of three types of entities:
(a) The specification of an auto in terms of various characteristics,
(b) Its assigned insurance risk rating,
(c) Its normalized losses in use as compared to other cars.

Based on all the features, we have to build a model to predict the Price of an auto. However, based on the context of our discussion, we will discuss about the techniques to select the features which are contributing more towards deciding the price of the auto.

Dataset is collected from here

Download code:

Feature selection using Correlation and Anova

Step 1: Import all the necessary packages


Step 2: Upload the dataset


Step 3: Get the details of the dataset


Step 4: Correlation is the way of finding the inter dependence or relationship between two integer datatypes. Lets get the correlation between integer variables.


The resulting Pearson Correlation coefficient is a value between -1 and 1 inclusive, where:

  • 1: Total positive linear correlation.
  • 0: No linear correlation, the two variables most likely do not affect each other.
  • -1: Total negative linear correlation.

Step 5: To understand the (linear) relationship between an individual variable and the price. We can do this by using “regplot”, which plots the scatterplot plus the fitted regression line for the data. We can plot this, for all the numerical variables


Step 6: Draw scatter plot for all the variables


Step 7: The categorical variables can have the type “object” or “int64”. A good way to visualize categorical variables is by using boxplots.


we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price


We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price.

Step 8: To know significance of the correlation coefficient , we will use P-value

  • p-value is << 0.001: we say there is strong evidence that the correlation is significant.
  • the p-value is << 0.05: there is moderate evidence that the correlation is significant.
  • the p-value is << 0.1: there is weak evidence that the correlation is significant.
  • the p-value is >> 0.1: there is no evidence that the correlation is significant.

Step 9:

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant is our calculated score value.

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.


we can use the function ‘f_oneway’ in the module ‘stats’ to obtain the F-test score and P-value.


ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23

This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?


ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23


ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333


ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666

We have narrowed it down to the following variables as the best features:

Continuous numerical variables:

  • Length
  • Width
  • Curb-weight
  • Engine-size
  • Horsepower
  • City-mpg
  • Highway-mpg
  • Wheel-baseBore

Categorical variables:

  • Drive-wheels

Feature Selection using Random Forest

Step 1: Import the required libraries


Step 3: Dataset has few variables of datatype String, which should be converted as int, in order to RandomForest consider for model building. We will do this by using get_dummies()


Step 4: To check the size of the dataset now and convert all the NA to Number


Step 5: Create X and Y datasets.


Step 5: Make a list of all the features


Step 6: Train and Test data split


Step 7: Create a random forest classifier


Step 8: Train the classifier


Step 9: Print the name and gini importance of each feature


Step 10: Create a selector object that will use the random forest classifier to identify# features that have an importance of more than 0.01


Step 11: Print the names of the most important features


Observation or Result: Following are the important features received from the RandomForest model:

  • symboling
  • normalized-losses
  • wheel-base
  • length
  • width
  • height
  • curb-weight
  • engine-size
  • bore
  • stroke
  • compression-ratio
  • horsepower
  • peak-rpm
  • city-mpg
  • highway-mpg
  • make_subaru
  • aspiration_turbo
  • num-of-doors_four
  • body-style_hardtop
  • body-style_hatchback
  • body-style_sedan

Feature Selection using xgboost regression


  • Preparing a test regression problem using the make_regression () function.
  • define the model using XGBRegressor()
  • fit the model with X and y
  • get importance with feature_importances_
  • Get the list of all the important features
  • plot feature importance

Feature Selection using xgboost Classifier


  • Preparing a test classification problem using the make_classification() function.
  • define the model using XGBClassifier()
  • fit the model with X and y
  • get importance with feature_importances_
  • Get the list of all the important features
  • plot feature importance

feature selection using Boruta package

Boruta follows an all-relevant feature selection method where it captures all features which are in some circumstances relevant to the outcome variable. In contrast, most of the traditional feature selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.

While fitting a random forest model on a data set, you can recursively get rid of features in each iteration which didn’t perform well in the process. This will eventually lead to a minimal optimal subset of features as the method minimizes the error of random forest model. This happens by selecting an over-pruned version of the input data set, which in turn, throws away some relevant features.

On the other hand, boruta find all features which are either strongly or weakly relevant to the decision variable. This makes it well suited for biomedical applications where one might be interested to determine which human genes (features) are connected in some way to a particular medical condition (target variable).

Boruta package in Python.

BorutaPy is the original R package recoded in Python with a few added extra features. Some improvements include:

  • Faster run times, thanks to scikit-learn
  • Scikit-learn like interface
  • Compatible with any ensemble method from scikit-learn
  • Automatic n_estimator selection
  • Ranking of features


  1. First the models creates randomness to the features by creating duplicate features and shuffle the values in each column, which are called Shadow Features.
  2. Then it trains a Random Forest classifier on the Dataset and calculate the importance using Mean Decrease Accuracy.
  3. Then, the algorithm checks for each of the real features if they have higher importance, means feature having higher Z-score than the maximum Z-score of its shadow features.
  4. At every iteration, the algorithm compares the Z-scores of the shuffled copies of the features and selects the best feature.

Importing the required libraries


Importing the dataset


Getting dummy variables for Categorical variables to run in a RandomForest model.


Creating X and Y datasets


Converting to numpy array, as BorutaPy accepts only array.



We learnt many techniques for feature selection. Then our model behave with good accuracy built with the selected features.

Happy Learning…! Stay tuned…