Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Thursday, July 13, 2023

Machine Learning Model Evaluation Methods

 

Introduction

In both business and research, as we are getting more and more dependent on machine learning for solving strategic importance problems, real value must be added to organizations and business through the predictions that a machine learning model gives. While sometimes, we might have enough data that we use to train a model but it might so happen that it may not,in the future, be able to generalize to new data for which we wish to predict something.

Evaluating the model based on the data that we used for training makes little sense in assessing whether the model is a proper fit for the case. The reason being, the machine already learns from, what we call, the training data and when we try to predict for the same, then the machine remembers that correctly and by implication, predicts fairly well. This is called overfitting.

Our intention must be to see if the model generalizes well, so that it is able to predict correctly for new data, whether it actually works and how trustworthy the predictions are. For the purpose of the same, we have to use another data set which we call the test set to evaluate the model that we have trained.  

Types of Evaluation Methods

The two major ways in which a model is evaluated in through 1. Holding Out and 2. Cross Validation.  
  1. Holding Out - Consider a full dataset. For holding out method of validation, we use a large part of the data, say 80%, to train the model and test on the rest 20% of the data. This is called train test splitting. Another set, called the validation set is also taken from the data, in which we can try the model to draw conclusions on how we can improve the model performance, through changing the specifications of the model. Not all modelling need validation platform.
  2. Cross Validation - In this method we split the data in k-different folds, which will divide the data into k sets equally. The choice of k depends on the length of the dataset and might vary according to the analyst’s understanding of the same.  The validation of the model is done with keeping the ith set as test and using the rest of the sets as training. This is repeated k times, each time changing the dataset. All metrics which indicate the performance of the model are calculated k-times and an average of the same is considered the final indication of the performance of the model. Other than the k-fold cross validation that we discussed above, we also have Leave-One-Out Cross Validation(LOOCV), in which the data is divided into “n” parts, “n” being the length of the data. The testing is done on one point and other “n-1” parts are used in training. Process similar to K-Fold validation is followed to assess the performance of the model. 

Methods of Model Evaluation 

We can list numerous methods of evaluating the performance of the model. The main evaluation of the model is to check how low is the error in the predictions made by the model that we are building. There are different techniques that we can use to assess the error. The motive of making the optimum model is to reduce the error as much as possible. Regression models and Classification models are evaluated through different metrics and the choice of the same differs from model to model. This section explains some of the important methods which are used for machine learning models.

Methods for Regression:

a. Goodness of Fit ,RMSE & MAPE:
 In a regression model, to check whether the variations in the target variable is well  explained by the input what we give to the model is judged by , which is given by the following:
                                <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>R</mi><mrow><mn>2</mn><mo>&#xA0;</mo></mrow></msup><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><mn>1</mn><mo>&#xA0;</mo><mo>-</mo><mo>&#xA0;</mo><mfrac><mrow><mstyle displaystyle="true"><munder><mo>&#x2211;</mo><mi>i</mi></munder></mstyle><msup><mfenced><mrow><mi>y</mi><mo>-</mo><mover><mi>y</mi><mo>&#x23DE;</mo></mover></mrow></mfenced><mn>2</mn></msup></mrow><mrow><munder><mo>&#x2211;</mo><mi>i</mi></munder><msup><mfenced><mrow><mi>y</mi><mo>-</mo><menclose notation="top"><mi>y</mi></menclose></mrow></mfenced><mn>2</mn></msup></mrow></mfrac><mspace linebreak="newline"/></math>
     
Where y is the target variable y_hat is the estimated value and y_bar is the mean of the same. Higher the score, the better is the fit of the input data. A 0.9 value indicates that the input data is being able to explain 90% of the variations in the target variable.

RMSE or Root Mean Squared Error is defined as the square root of the sum of squares of the error. The sum of squares of error is given by the numerator in the above equation. It is defined as :

                                           <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>R</mi><mi>M</mi><mi>S</mi><mi>E</mi><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><msqrt><mstyle displaystyle="false"><munder><mo>&#x2211;</mo><mi>i</mi></munder></mstyle><msup><mfenced><mrow><mi>y</mi><mo>-</mo><mover><mi>y</mi><mo>&#x23DE;</mo></mover></mrow></mfenced><mn>2</mn></msup></msqrt></math>     
 The lower the RMSE, the better is the performance of the model.

The mean absolute percentage error (MAPE) is a statistical measure of how accurate a forecast system is. It measures this accuracy as a percentage, and can be calculated as the average absolute percent error for each time period minus actual values divided by actual values. Where At is the actual value and Ft is the forecast value, this is given by:


 The mean absolute percentage error (MAPE) is the most common measure used to forecast error, and works best if there are no extremes to the data .

b. Information Criteria:
Mainly two information criteria is used to evaluate the performance of the model. Here we must note that, these two information is used, mainly in which the estimation of the model is done using Maximum Likelihood Estimation methods and also used to select the best model from many models that we might have tried. The two information criteria is given as follows :

I. Akaike Information Criterion (AIC) : This is a method of scoring the models and selecting the models. The score statistic is given by :

                                                      AIC = -2/N * LL + 2 * k/N
 
Where N is the number of examples in the training dataset, LL is the log-likelihood of the model on the training dataset, and k is the number of parameters in the model. The model with the lowest AIC is selected. Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.

II. Bayesian Information Criterion (BIC) : Similar to AIC, this is also a method of scoring and selecting the models, and is based on Bayesian probability and inference.
                                            BIC = -2 * LL + log(N) * k

Where log() has the base-e called the natural logarithm, LL is the log-likelihood of the model, N is the number of examples in the training dataset, and k is the number of parameters in the model.

Methods for Classification

Unlike regression, classification problems deal with proper identification of the classes that the algorithm is supposed to predict. The evaluation based on numeric errors becomes meaningless. The methods through which we evaluate these models are as follows :

Classification Report 
A classification report is used to judge the predictions that are made by the algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report. Take the following example. Here, the classes that are getting predicted are Iris-setosa, Iris-versicolor and Iris-virginica


There are four ways to check if the predictions are right or wrong:
  1. TN / True Negative: when a case was negative and predicted negative
  2. TP / True Positive: when a case was positive and predicted positive
  3. FN / False Negative: when a case was positive but predicted negative
  4. FP / False Positive: when a case was negative but predicted positive
I. Precision
Measures the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.
                              Precision = TP/(TP + FP)


II. Recall
Measures the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.
                              Recall = TP/(TP+FN)


III. F1-Score
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
                 F1 Score = 2*(Recall * Precision) / (Recall + Precision)

IV. Area Under ROC curve
The ROC curve is locus of all points derived by plotting the true positive rate against the false positive rate at various threshold settings. Below is an example of a ROC curve.

The ROC curve is given by the orange line. The more the area under this curve, the better are the predictions.

No comments:

Post a Comment