Yes, you read it right! Decision Tree Algorithm is a multiuse model since it can perform both classification and regression tasks, and even multioutput tasks. It is a powerful and versatile model, capable of fitting complex datasets. 

Decision Tree basically works as breaking down the data by making decision based on asking a series of questions. 

Here is an example in which we use a decision tree to decide upon activity on a particular day: 

Figure-1: Decision Tree Architecture with example

In the above example, we used some terms which is explained below: 

1. Root Node: It contains that feature that has maximum information of the dataset about the class labels. 

2. Internal Node: It contains the feature which uses the remaining information of the dataset after its preceding node. 

3. Branch: It is a process of dividing a node into two or more sub-nodes. 

4. Leaf Node: The nodes that do not split further is called Leaf or Terminal node. 

Advantages of Decision Tree 

1. Complexity: It is capable of finding complex nonlinear relationships in the data, so, it is better to use the Decision Tree Algorithm when the Linear Regression Algorithm makes our model underfit, since, Linear Regression is only successful when our data features have a linear relationship with the target variable. 

2. Easy Interpretation: Decision tree output is very easy to understand even for people from non- analytical background. Its graphical representation is very easy to understand and users can easily relate their hypothesis. 

3. Multitasking: It can handle both numerical and categorical tasks and even multi-output tasks. 

4. Non-Parametric: Decision Tree is a non-parametric model not because it does not have any parameters (it often has a lot) but because the number of parameters is not determined prior to training, so the model structure is free to fit closely around the data. 

5. White-Box model: Decision Trees are fairly intuitive and their decisions are easy to interpret. Hence, the decision trees model also called the white-box model. 

Decision Tree- Algorithms 

1. ID3 (Iterative Dicotomizer 3): It is developed by Ross Quinlan. It creates a multi-branch tree at each node using a greedy algorithm. Trees grow to maximum size before pruning. 

2. C4.5: It succeeded ID3 by overcoming the limitation of features required to be categorical. It dynamically defines discrete attributes for numerical attributes. It converts the trained trees into a set of if-then rules. Accuracy of each rule is evaluated to determine the order in which they should be applied. 

3. C5.0: It is Quinlan’s latest version and it uses less memory and builds smaller rules of sets than C4.5 while being more accurate. 

4. CART (Classification & Regression Trees): is similar to C4.5 but it supports numerical target variables and does not compute rule sets. It creates a binary tree. Scikit-learn uses the CART algorithm for making the Decision Tree model. 

We will talk only about the CART Algorithm in this blog. 

The CART Training Algorithm 

Scikit-Learn uses the Classification And Regression Tree (CART) algorithm for the training of Decision Trees. The idea is really quite simple: the algorithm first splits the training set into two subsets using a single feature k and a threshold tk and it will find for that feature k and threshold tk that will give the maximum information gain and will select that feature as the root node. This process repeats until it reaches the maximum depth or if it cannot find a split that will reduce impurity. Here, our main objective function is to maximize the information gain at each split. 

Information Gain (IG) is defined as follows: 

It is the difference between the impurity of the parent node and the sum of the child node impurities. The lower the impurity, the larger the information gain. Mathematically, it is: 

where, IG = Information Gain, I = Impurity measure Dp = Dataset of the parent node Dj = Dataset of the jth child node, Np = Total number of samples in the parent node and Nj = Number of samples in the jth child node  

There are two impurity measures by which CART Algorithm finds the feature that contains maximum information gain:

1. Information Gain using Entropy 

It measures the average information content of a message. Entropy is zero when all messages are identical. In Machine Learning, it is frequently used as an impurity measure: a set’s entropy is zero when it contains instances of only one class. Mathematically, the entropy of node i is given by: 

2. Information Gain using Gini Impurity 

It tells us what is the probability of misclassifying an observation. A node is “pure” (gini=0) if all training instances on which it applies to belong to the same class. Mathematically, the Gini impurity of node i is given by: 

The Gini impurity is maximum if the classes are perfectly mixed. 

Training and Visualizing a Decision Tree

Now, we will look at an example of how decision trees will work on a dataset. For this, we will use the Iris dataset. By default, Scikit-Learn contains many popular datasets and Iris Dataset is one of them. This dataset consists of three classes of flowers namely as Virginica, Versicolour, and Setosa(as shown in Figure-2). These contain the dataset features as sepal length, sepal width, petal length, and petal width. On the basis of these 4 features, we have to classify that which sample belongs to which class of flower. 

Figure-2: Iris Dataset

We have trained the decision tree on the iris dataset (for coding part, refer the Jupyter Notebook Decision Tree- A Multiuse Algorithm attached at the end of the article) and it gives a tree like this: 

Figure-3: Visualization of Decision Tree 

In the above figure, a node’s samples attribute counts how many training instances it applies to. A node’s value attribute tells you that this node applies to how many training instances of each class. A node’s Gini attribute measures its impurity. This tree is made without any restrictions on the tree. When we find the accuracy score of the above Decision Tree as 1.0 after training and after cross-validation(cv=10) on it, we found our accuracy score as 0.9533 which clearly shows that our model is highly overfitting on the training dataset. So, we have to constrain our model to make it simpler and reduce the risk of overfitting by regularizing the hyperparameters of the Decision Tree model. 

Regularizing Hyperparameters

For avoiding the overfitting of training data, you need to restrict the Decision Tree’s freedom during training. The regularization of hyperparameters depends on the algorithm used, but generally you can at least restrict the maximum depth of the Decision Tree. In Scikit-Learn, this is controlled by the max_depth hyperparameter (the default value is None, which means unlimited). Reducing the max_depth will regularize the model and thus reduce the risk of overfitting. The DecisionTreeClassifier class has other few parameters that similarly restrict the shape of the Decision Tree: min_samples_split (the minimum number of samples a node must have before it can be split), min_weight_fraction_leaf (the fraction of total number of weighted instances), min_samples_leaf (the minimum number of samples a leaf node must have), max_leaf_nodes (maximum number of leaf nodes), and max_features (maximum number of features that are evaluated for splitting at each node). Increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model. For better regularization, we should use GridSearchCV, so that we can get the best hyperparameters for our model. After, regularization of Decision Trees, we will get the visualization of Decision Tree as follows: 

Figure-4: Visualization of Regularized Decision Tree 

For the coding part, refer the Jupyter Notebook Decision Tree- A Multiuse Algorithm  at the end of this article

Here is the summary, from which we can easily see the difference between the accuracy scores between the non-regularized and regularized model. 

Figure-5: Summary table of Regularized and Non-Reguralized models

From the above table, we can observe that Decision Tree with Reguralized Hyperparameters is better than the non-regularized one. 

With this, I would like to conclude my blog. But yes, keep in mind that the Decision Tree Algorithm is a very vast concept and I gave only a glimpse of the concept. So, if you are really interested in this concept, you have to dive deeper into this. I would like to welcome your questions also if you have any. Till then, Happy Learning! 

Download Code: