There are many debates on how to decide the best classifier. Measuring the Performance Metrics score, getting the area under ROC are few of the approaches, but there is quite a lot of useful information to be gleaned from visualizing a decision boundary, information that will give us an intuitive grasp of learning models.
So, in this article, we will learn about the below:
- What is Decision Boundary
- Importance of Decision Boundary
- Types of Decision Boundary
- Decision Boundary for different classifiers.
- An Use Case with Python code
- Decision Boundary for Higher Dimension Data
So, lets start
What is Decision Boundary?
While training a classifier on a dataset, using a specific classification algorithm, it is required to define a set of hyper-planes, called Decision Boundary, that separates the data points into specific classes, where the algorithm switches from one class to another. On one side a decision boundary, a datapoints is more likely to be called as class A— on the other side of the boundary, it’s more likely to be called as class B.
Let’s take an example of a Logistic Regression.
The goal of logistic regression, is to figure out some way to split the datapoints to have an accurate prediction of a given observation’s class using the information present in the features.
Let’s suppose we define a line that describes the decision boundary. So, all of the points on one side of the boundary shall have all the datapoints belong to class A and all of the points on one side of the boundary shall have all the datapoints belong to class B.
- S(z) = Output between 0 and 1 (probability estimate)
- z = Input to the function (z= mx + b)
- e = Base of natural log
Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (A/B), we select a threshold value or tipping point above which we will classify values into class A and below which we classify values into class B.
If our threshold was .5 and our prediction function returned .7, we would classify this observation belongs to class A. If our prediction was .2 we would classify the observation belongs to class B.
So, line with 0.5 is called the decision boundary.
In order to map predicted values to probabilities, we use the Sigmoid function.
Importance of Decision Boundary
A decision boundary, is a surface that separates data points belonging to different class lables. Decision Boundaries are not only confined to just the data points that we have provided, but also they span through the entire feature space we trained on. The model can predict a value for any possible combination of inputs in our feature space. If the data we train on is not ‘diverse’, the overall topology of the model will generalize poorly to new instances. So, it is important to analyse all the models which can be best suitable for ‘diverse’ dataset, before using the model into production.
Examining decision boundaries is a great way to learn how the training data we select affects performance and the ability for our model to generalize. Visualization of decision boundaries can illustrate how sensitive models are to each dataset, which is a great way to understand how specific algorithms work, and their limitations for specific datasets.
Objective: To build the decision boundary for various classifiers algorithms and decide which is the best algorithm for the dataset.
Dataset is available here.
Dataset Description: The Dataset contains users’ information, based on which the best model should be built to predict whether the user will buy a car or not.
The Independent variables:
- Age: Age of the user
- Estimated Salary: Salary of the user.
The dependent variable: ‘Purchased’ which is 1 if user purchases the car and 0 otherwise.
Step 1: Import all the required libraries
Step 2: Import the dataset
Step 3: Applying StandardScaler to the dataset. Variables ‘Salary’ and ‘Age’ are not in the same scale. So, these should be scaled. Or else, model cannot be predict a good result. Standard Scaling also helps to speed up the calculations in an algorithm.
Step 4: Import sklearn libraries for classifiers
Step 5: Get the dimension of the dataset.
Step 6: Build Logistic Regression model and Display the Decision Boundary for Logistic Regression. Decision Boundary can be visualized by dense sampling via meshgrid. However, if the grid resolution is not enough, the boundary will appear inaccurate. The purpose of
meshgrid is to create a rectangular grid out of an array of x values and an array of y values. We can get the complete explanation on how to plot a meshgrid from here.
In Meshgrid, we will make an image, where each pixel represents a grid cell in the 2D feature space. The image defines a grid over the 2D feature space. The pixels of the image are then classified using the classifier, which will assign a class label to each grid cell. The classified image is then used as a background for a scatter plot that shows the data points of each class.
Advantage: It classifies the grid points in the 2D feature space.
Disadvantage: A computational cost for making very fine decision boundary maps, as we would have to make the grid finer and finer.
In Logistic Regression, Decision Boundary is a linear line, which separates class A and class B. Some of the points from class A have come to the region of class B too, because in linear model, its difficult to get the exact boundary line separating the two classes.
Step 7: Build Random Forest model and Plot the decision boundary. Being a Tree-based model it has many trees and the plot has tried to capture all the relevant classes. It is a nonlinear classifier.
The decision surfaces for the Decision Tree and Random Forest are very complex. The Decision Tree is by far the most sensitive, showing only extreme classification probabilities that are heavily influenced by single points. The Random Forest shows lower sensitivity, with isolated points having much less extreme classification probabilities. The SVM is the least sensitive, since it has a very smooth decision boundary.
Step 8: Build Support Vector Machine model and Plot the decision boundary
Support Vector Machine find a hyperplane that separates the feature space into two classes with the maximum margin. If the problem is not originally linearly separable, the kernel trick is used to turn it into a linearly separable one, by increasing the number of dimensions. Thus a general hyper surface in a small dimension space is turned into a hyperplane in a space with much larger dimensions.
Step 9: Build Decision Tree model and Plot the decision boundary
Step 10: Build Gaussian NaiveBayes model and Plot the decision boundary
Gaussian Naive Bayes has also performed well, having a smooth curve boundary line.
Decision Boundary for Higher Dimension Data
Decision boundaries can easily be visualized for 2D and 3D datasets. Generalizing beyond 3D forms a challenge in terms of the visualization where we have to transform the boundary which is present in multi dimension to a lower dimension, that can be displayed and understood by the experts is difficult.
However, a Decision Boundary can be plotted, using tSNE, where the dimensions of the data can be reduced in several steps. for example: If the dimension of my data is 150, then at first this shall be reduced to 50 and then shall be to 2 dimensions.
Libraries TSNE from sklearn.manifold and TruncatedSVD from sklearn.decomposition are used for this.
A very nice research paper is published here, describing about plotting decision boundary for higher dimensional data.
In this article, we learnt what is the role of Decision Boundary in determining a classifier model, built several classifier models and plotted their respective decision boundaries to select the best model and also knew that plotting a Decision Boundary for higher dimensional data is a complex task, can be plotted, using tSNE, where the dimensions of the data can be reduced in several steps.
Now, what is next for you…? Please come up with few points about the proposed approach to plot decision boundaries for higher dimensional data, as found from the research paper, published here.
See you then in our next article… Till then Stay Tuned and Happy Learning!