Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Thursday, July 13, 2023

Decision Tree Algorithm in Machine Learning

 

Decision Tree in ML


Decision Tree algorithm belongs to the family of supervised machine learning algorithm. It is simple and popular and is based on choices as suggested by its name. This algorithm works both for continuous as well as choice variables i.e. it can be used for both classification as well as regression problems. The article unfolds in three sections, commencing with a general introduction of the algorithm, followed by the mathematical concepts that builds the blocks for the same and we shall conclude with a code walk through on building a  Decision Tree model in Python using Loan approval data set.

1. Introduction

Decision Tree learns decision rules inferred from prior data that we use to train it. The motive is to predict the target variables or predict the class based on these decision rules. As the name suggests, the algorithm solves problems by using a tree representation. Each internal node of the tree are attributes and the leaf nodes corresponds to the class labels.

1.1. Terminology

Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.




Figure 1. A sample Decision Tree




2. Pseudocode and Assumptions
The best attribute should be set a:s the root of the tree.
Split the data set into subsets in such a way that the subset contains the same value as an attribute.
Repeat the above two steps till the leaf nodes of all the branches of the tree is found out.
For predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record's attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node. The process is continued till the leaf node is reached.

We must first assume the whole training set is first taken as the root in the beginning, and the feature values are preferred to be categorical in nature. In case the values are continuous, it is discretized prior to model training. All the entries/records are distributed recursively in the sense that, the distribution will give similar information repeatedly for similar types of records.   Decision Trees follow Sum of Product (SOP) representation. The Sum of product (SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is conjunction (product) of values, different branches ending in that class form a dis-junction (sum).

3. Attribute Selection and Mathematical Elucidations.

Let us suppose that there are “k” number of attributes in the data set and our first step, as discussed above would be to select an attribute which will be set as the root attribute of tree. Random selection of the root and subsequent nodes will be a great idea since the decision of making strategic splits heavily affects a tree’s accuracy. For the same purpose, researchers have come up with several criteria of attribute selection, to name a few:
  • Entropy
  • Information Gain
  • Gini Index
These criterions will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.

3.1. Entropy
The degree of randomness in the information given by an attribute is measured by entropy. More the entropy, it becomes more ambiguous to draw any concrete conclusion from the information. Generally, in the tree a branch with zero entropy is placed at the terminal node and branches with more than zero entropy needs to be split further. Mathematically,
<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>E</mi><mfenced><mi>S</mi></mfenced><mo>=</mo><munder><mo>&#x2211;</mo><mi>i</mi></munder><mo>-</mo><msub><mi>p</mi><mi>i</mi></msub><msub><mi>log</mi><mn>2</mn></msub><msub><mi>p</mi><mi>i</mi></msub></math>

Where, S is the current state, p is the probability of an event i in the state S. For multiple attributes,

3.2. Information Gain

Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. The building of a Decision Tree is based on finding the attribute that has the highest information gain. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>I</mi><mi>n</mi><mi>f</mi><mi>o</mi><mi>r</mi><mi>m</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>&#xA0;</mo><mi>G</mi><mi>a</mi><mi>i</mi><mi>n</mi><mo>(</mo><mi>T</mi><mo>,</mo><mi>X</mi><mo>)</mo><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><mi>E</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>o</mi><mi>p</mi><mi>y</mi><mo>(</mo><mi>T</mi><mo>)</mo><mo>&#xA0;</mo><mo>-</mo><mo>&#xA0;</mo><mi>E</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>o</mi><mi>p</mi><mi>y</mi><mo>(</mo><mi>T</mi><mo>,</mo><mi>X</mi><mo>)</mo></math>
           

Where T is the state before split and X is the state after split.

3.3. Gini Index

Gini Index is used to evaluate the splits of the dataset by using a cost function. It is calculated by subtracting the sum of the squared probabilities of each class from 1. Unlike information gain, it favours large partitions and is simpler to implement and to be noted that it performs binary splits only. Higher the value of Gini index higher the homogeneity. Mathematically,
<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>G</mi><mi>i</mi><mi>n</mi><mi>i</mi><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><mn>1</mn><mo>-</mo><mo>&#xA0;</mo><munder><mo>&#x2211;</mo><mi>i</mi></munder><msup><mfenced><msub><mi>p</mi><mi>i</mi></msub></mfenced><mn>2</mn></msup></math>



4. Decision Tree using Python

As discussed previously, here we shall build a simple decision tree model in this section. For the same we have used the loan approval dataset. The dataset could be used to do Classification as well as regression, although however we shall use Decision Tree classifier to predict whether loan should be approved or not.

###Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split 

Read the data:

#import the data
df = pd.read_csv("train.csv")
df.info()


Python gives output as follows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB

The data has attributes as shown above and is pretty much self explanatory in nature and with the help of which we will be predicting the Loan_Status, whether loan will be approved or not.
As we see, there is missing data in the set. One way is to either drop the variable but that comes with a cost of losing a lot of degrees of freedom. Here, we will use simplistic methods to impute the missing values in the data. It is not necessary that we go by such preprocessing methods as shown here. For simplicity let us do the following steps to modify the data.


#Data Preprocessing

df["Gender"].fillna("unknown",inplace = True)
df["Married"].fillna("unknown",inplace = True)
df["Credit_History"].fillna(df.Credit_History.mean(),inplace = True)
df["Dependents"].fillna(0,inplace = True)
df["LoanAmount"].fillna(df.LoanAmount.mean(),inplace = True)
df["Self_Employed"].fillna("No",inplace = True)
df["Loan_Amount_Term"].fillna(round(df.Loan_Amount_Term.mean()),inplace=True)

As shown above, missing gender and marital status is filled with “unknown”, Credit History and loan amount and the loan amount terms is filled in with the mean of the variables. Whether the person is self employed is given “No” and number of dependents has been given 0 if the data was missing. After the preprocessing, we have:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               614 non-null object
Married              614 non-null object
Dependents           614 non-null object
Education            614 non-null object
Self_Employed        614 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     614 non-null float64
Credit_History       614 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB

All the fields are non-null , we will go ahead with some more feature extraction.


In[]: df["Dependents"]= df["Dependents"].map({"0":int(0),"1":int(1),"2":int(2),"3+":int(3)})
Dependents is the number of people dependent on the loan applicant. Since they are ordinal in nature we shall just map it to integer numbers.    

Let us get the summary statistics of the data.


df.describe()

A simple does that for you in python, its given as follows:




Dependents
ApplicantIncome
CoapplicantIncome
LoanAmount
Loan_Amount_Term
Credit_History
Loan_Status
count
599
614
614
614
614
614
614
mean
1
5403
1621
146
342
1
0
std
1
6109
2926
84
64
0
0
min
0
150
0
9
12
0
0
0
0
2878
0
100
360
1
0
1
0
3813
1189
129
360
1
0
1
2
5795
2297
165
360
1
1
max
3
81000
41667
700
480
1
1


Loan status as discussed before is the target variable and is labelled in 0s and 1s.


For training and validating the model let’s split the data into training and testing set.


X_train , X_test , y_train , y_test =train_test_split(X,y,test_size=0.33, random_state=42)

We are keeping the test size as 1/3rd of the total size of the whole data set. As a next step we will create the model, first with the “Gini” criterion, then with the Entropy criterion. Then we will fit the model to the training data and then predict for the test data. The code is simple and is as follows:

Model using Gini Criterion

clf_gini = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=3, min_samples_leaf=5)
In[] : clf_gini.fit(X_train, y_train)
Out[]: DecisionTreeClassifier(class_weight=None, criterion='gini'
            ,max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=100,
            splitter='best')
y_pred = clf_gini.predict(X_test)

y_pred in the above code is the predicted values of the target.

Let’s evaluate the model.

print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ",accuracy_score(y_test,y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))
 


Gives the following output:


Confusion Matrix:  [[129 2]
[ 48  24]]

Accuracy :  75.36945812807882
Report :               precision recall f1-score   support

          0 0.73      0.98 0.84   131
          1 0.92      0.33 0.49     72

avg / total       0.80 0.75   0.71 203





The accuracy with the model is 75%. Let’s check what happens if we use the entropy criterion.

Model with Entropy Criterion

clf_ent = DecisionTreeClassifier(criterion = "entropy",random_state = 42,
max_depth=3, min_samples_leaf=10)

clf_ent.fit(X_train, y_train)

Out[] : DecisionTreeClassifier(class_weight=None, criterion='entropy',             max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=10, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')


y_pred = clf_ent.predict(X_test)

y_pred is defined as before. The evaluation report is given as below:


print("Confusion Matrix: ", confusion_matrix(y_test, y_pred))
print ("Accuracy : ",accuracy_score(y_test,y_pred)*100)
print("Report : ", classification_report(y_test, y_pred))

The reports are given as below:

Confusion Matrix:  [[129 2]
[ 39  33]]
Accuracy :  79.80295566502463
Report :               precision recall f1-score   support

          0 0.77      0.98 0.86   131
          1 0.94      0.46 0.62     72

avg / total       0.83 0.80   0.78 203

As we see that, using “Entropy” gives a better accuracy than the “Gini” model.

No comments:

Post a Comment