Decision Tree in ML
Decision Tree algorithm belongs to the family of supervised machine learning algorithm. It is simple and popular and is based on choices as suggested by its name. This algorithm works both for continuous as well as choice variables i.e. it can be used for both classification as well as regression problems. The article unfolds in three sections, commencing with a general introduction of the algorithm, followed by the mathematical concepts that builds the blocks for the same and we shall conclude with a code walk through on building a Decision Tree model in Python using Loan approval data set.
1. Introduction
Decision Tree learns decision rules inferred from prior data that we use to train it. The motive is to predict the target variables or predict the class based on these decision rules. As the name suggests, the algorithm solves problems by using a tree representation. Each internal node of the tree are attributes and the leaf nodes corresponds to the class labels.1.1. Terminology
Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Figure 1. A sample Decision Tree
2. Pseudocode and Assumptions
The best attribute should be set a:s the root of the tree.
Split the data set into subsets in such a way that the subset contains the same value as an attribute.
Repeat the above two steps till the leaf nodes of all the branches of the tree is found out.
For predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record's attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node. The process is continued till the leaf node is reached.
We must first assume the whole training set is first taken as the root in the beginning, and the feature values are preferred to be categorical in nature. In case the values are continuous, it is discretized prior to model training. All the entries/records are distributed recursively in the sense that, the distribution will give similar information repeatedly for similar types of records. Decision Trees follow Sum of Product (SOP) representation. The Sum of product (SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is conjunction (product) of values, different branches ending in that class form a dis-junction (sum).
3. Attribute Selection and Mathematical Elucidations.
Let us suppose that there are “k” number of attributes in the data set and our first step, as discussed above would be to select an attribute which will be set as the root attribute of tree. Random selection of the root and subsequent nodes will be a great idea since the decision of making strategic splits heavily affects a tree’s accuracy. For the same purpose, researchers have come up with several criteria of attribute selection, to name a few:
- Entropy
- Information Gain
- Gini Index
3.1. Entropy
The degree of randomness in the information given by an attribute is measured by entropy. More the entropy, it becomes more ambiguous to draw any concrete conclusion from the information. Generally, in the tree a branch with zero entropy is placed at the terminal node and branches with more than zero entropy needs to be split further. Mathematically,
Where, S is the current state, p is the probability of an event i in the state S. For multiple attributes,
3.2. Information Gain
Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. The building of a Decision Tree is based on finding the attribute that has the highest information gain. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.
Where T is the state before split and X is the state after split.
3.3. Gini Index
Gini Index is used to evaluate the splits of the dataset by using a cost function. It is calculated by subtracting the sum of the squared probabilities of each class from 1. Unlike information gain, it favours large partitions and is simpler to implement and to be noted that it performs binary splits only. Higher the value of Gini index higher the homogeneity. Mathematically,
4. Decision Tree using Python
As discussed previously, here we shall build a simple decision tree model in this section. For the same we have used the loan approval dataset. The dataset could be used to do Classification as well as regression, although however we shall use Decision Tree classifier to predict whether loan should be approved or not.
Read the data:
Python gives output as follows:
The data has attributes as shown above and is pretty much self explanatory in nature and with the help of which we will be predicting the Loan_Status, whether loan will be approved or not.
As we see, there is missing data in the set. One way is to either drop the variable but that comes with a cost of losing a lot of degrees of freedom. Here, we will use simplistic methods to impute the missing values in the data. It is not necessary that we go by such preprocessing methods as shown here. For simplicity let us do the following steps to modify the data.
As shown above, missing gender and marital status is filled with “unknown”, Credit History and loan amount and the loan amount terms is filled in with the mean of the variables. Whether the person is self employed is given “No” and number of dependents has been given 0 if the data was missing. After the preprocessing, we have:
All the fields are non-null , we will go ahead with some more feature extraction.
Let us get the summary statistics of the data.
A simple does that for you in python, its given as follows:
Loan status as discussed before is the target variable and is labelled in 0s and 1s.
For training and validating the model let’s split the data into training and testing set.
We are keeping the test size as 1/3rd of the total size of the whole data set. As a next step we will create the model, first with the “Gini” criterion, then with the Entropy criterion. Then we will fit the model to the training data and then predict for the test data. The code is simple and is as follows:
Model using Gini Criterion
y_pred in the above code is the predicted values of the target.
Let’s evaluate the model.
Gives the following output:
The accuracy with the model is 75%. Let’s check what happens if we use the entropy criterion.
Model with Entropy Criterion
y_pred is defined as before. The evaluation report is given as below:
The reports are given as below:
As we see that, using “Entropy” gives a better accuracy than the “Gini” model.
No comments:
Post a Comment