Introduction
In both business and
research, as we are getting more and more dependent on machine learning
for solving strategic importance problems, real value must be added to
organizations and business through the predictions that a machine
learning model gives. While sometimes, we might have enough data that we
use to train a model but it might so happen that it may not,in the
future, be able to generalize to new data for which we wish to predict
something.
Evaluating the model based on the data that we used for training makes little sense in assessing whether the model is a proper fit for the case. The reason being, the machine already learns from, what we call, the training data and when we try to predict for the same, then the machine remembers that correctly and by implication, predicts fairly well. This is called overfitting.
Our intention must be to see if the model generalizes well, so that it is able to predict correctly for new data, whether it actually works and how trustworthy the predictions are. For the purpose of the same, we have to use another data set which we call the test set to evaluate the model that we have trained.
Evaluating the model based on the data that we used for training makes little sense in assessing whether the model is a proper fit for the case. The reason being, the machine already learns from, what we call, the training data and when we try to predict for the same, then the machine remembers that correctly and by implication, predicts fairly well. This is called overfitting.
Our intention must be to see if the model generalizes well, so that it is able to predict correctly for new data, whether it actually works and how trustworthy the predictions are. For the purpose of the same, we have to use another data set which we call the test set to evaluate the model that we have trained.
Types of Evaluation Methods
The two major ways in which a model is evaluated in through 1. Holding Out and 2. Cross Validation.
- Holding Out - Consider a full dataset. For holding out method of validation, we use a large part of the data, say 80%, to train the model and test on the rest 20% of the data. This is called train test splitting. Another set, called the validation set is also taken from the data, in which we can try the model to draw conclusions on how we can improve the model performance, through changing the specifications of the model. Not all modelling need validation platform.
- Cross Validation - In this method we split the data in k-different folds, which will divide the data into k sets equally. The choice of k depends on the length of the dataset and might vary according to the analyst’s understanding of the same. The validation of the model is done with keeping the ith set as test and using the rest of the sets as training. This is repeated k times, each time changing the dataset. All metrics which indicate the performance of the model are calculated k-times and an average of the same is considered the final indication of the performance of the model. Other than the k-fold cross validation that we discussed above, we also have Leave-One-Out Cross Validation(LOOCV), in which the data is divided into “n” parts, “n” being the length of the data. The testing is done on one point and other “n-1” parts are used in training. Process similar to K-Fold validation is followed to assess the performance of the model.
Methods of Model Evaluation
We can list numerous
methods of evaluating the performance of the model. The main evaluation
of the model is to check how low is the error in the predictions made by
the model that we are building. There are different techniques that we
can use to assess the error. The motive of making the optimum model is
to reduce the error as much as possible. Regression models and
Classification models are evaluated through different metrics and the
choice of the same differs from model to model. This section explains
some of the important methods which are used for machine learning
models.
Methods for Regression:
a. Goodness of Fit ,RMSE & MAPE:
In a regression
model, to check whether the variations in the target variable is well
explained by the input what we give to the model is judged by , which is
given by the following:
Where y is the target variable y_hat is the estimated value and y_bar is the mean of the same. Higher the score, the better is the fit of the input data. A 0.9 value indicates that the input data is being able to explain 90% of the variations in the target variable.
RMSE or Root Mean Squared Error is defined as the square root of the sum of squares of the error. The sum of squares of error is given by the numerator in the above equation. It is defined as :
Where y is the target variable y_hat is the estimated value and y_bar is the mean of the same. Higher the score, the better is the fit of the input data. A 0.9 value indicates that the input data is being able to explain 90% of the variations in the target variable.
RMSE or Root Mean Squared Error is defined as the square root of the sum of squares of the error. The sum of squares of error is given by the numerator in the above equation. It is defined as :
The lower the RMSE, the better is the performance of the model.
The mean absolute percentage error (MAPE) is a statistical measure of how accurate a forecast system is. It measures this accuracy as a percentage, and can be calculated as the average absolute percent error for each time period minus actual values divided by actual values. Where At is the actual value and Ft is the forecast value, this is given by:
The mean absolute
percentage error (MAPE) is the most common measure used to forecast
error, and works best if there are no extremes to the data .
b. Information Criteria:
Mainly two information
criteria is used to evaluate the performance of the model. Here we must
note that, these two information is used, mainly in which the estimation
of the model is done using Maximum Likelihood Estimation methods and
also used to select the best model from many models that we might have
tried. The two information criteria is given as follows :
I. Akaike Information Criterion (AIC) : This is a method of scoring the models and selecting the models. The score statistic is given by :
I. Akaike Information Criterion (AIC) : This is a method of scoring the models and selecting the models. The score statistic is given by :
AIC = -2/N * LL + 2 * k/N
Where N is the number of examples in the training dataset, LL is the log-likelihood of the model on the training dataset, and k is the number of parameters in the model. The model with the lowest AIC is selected. Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.
II. Bayesian Information Criterion (BIC) : Similar to AIC, this is also a method of scoring and selecting the models, and is based on Bayesian probability and inference.
BIC = -2 * LL + log(N) * k
Where log() has the base-e called the natural logarithm, LL is the log-likelihood of the model, N is the number of examples in the training dataset, and k is the number of parameters in the model.
Methods for Classification
Unlike regression,
classification problems deal with proper identification of the classes
that the algorithm is supposed to predict. The evaluation based on
numeric errors becomes meaningless. The methods through which we
evaluate these models are as follows :
Classification Report
A classification report
is used to judge the predictions that are made by the algorithm. How
many predictions are True and how many are False. More specifically,
True Positives, False Positives, True negatives and False Negatives are
used to predict the metrics of a classification report. Take the
following example. Here, the classes that are getting predicted are
Iris-setosa, Iris-versicolor and Iris-virginica
There are four ways to check if the predictions are right or wrong:
- TN / True Negative: when a case was negative and predicted negative
- TP / True Positive: when a case was positive and predicted positive
- FN / False Negative: when a case was positive but predicted negative
- FP / False Positive: when a case was negative but predicted positive
Measures the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.
Precision = TP/(TP + FP)
II. Recall
Measures the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.
Recall = TP/(TP+FN)
III. F1-Score
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
IV. Area Under ROC curve
The ROC curve is locus of all points derived by plotting the true positive rate against the false positive rate at various threshold settings. Below is an example of a ROC curve.
The ROC curve is given by the orange line. The more the area under this curve, the better are the predictions.
No comments:
Post a Comment