Evaluating a Machine Learning Model: Regression and Classification Metrics

13 min readNov 13, 2020

World as we know it today is constantly wanting new technology and better applications. Developers on the other hand are crazy to introduce their new software and applications providing their users with a better experience overall. Machine Learning and Deep Learning plays a vital role in all this.

The most important part of anything which is done in the field of Machine Learning and Deep Learning is its application. Applications are driven by performance and performance is achieved by better and improved results. Now, there are many automated tools, libraries and scripts which allows you to directly run a ML model without knowing anything about it. This results in fast development, and especially beginners are benefitted by that. Most people do not know how a model works but they know what it does and how to use it. The math behind machine learning is quite complex and not everybody wants to learn it, to be honest they don’t need to either. But everybody has to go through the results achieved by their model on the dataset they used, performance of a model is very important in order to provide good accuracy. That’s why improvements at every stage are necessary, developers constantly need to give better results than before to keep up with their users. Hence, it becomes important to know thet the results are even improved or not, does new results improve the overall performance in any way?

One of the most common ways for evaluating the performance is to find Accuracy, which is a great measure of comparison between models but there are some times when other metrics (measures) give much more insight into performance than ‘accuracy’. There are measures to quantify the error and loss as well. This article explains some of these mostly used metrics to evaluate Regression and Classification models along with their python implementation. Before we jump into that, there are few conventions that you need to know,

y = y_true = Ground Truth = the actual values of target variables
y^ = y_pred = the predicted values of target variables
sample = row in data

Regression Metrics,

A Regression model predicts a real number, like house price prediction.

Error metrics measure the dissimilarity, which means how much difference is there between y_true and y_pred. Error measures are lesser the better, so always try to minimize your error.

1. Mean Absolute Error (MAE),

The mean is taken to know the average error we are making in predicting a single sample. A risk metric corresponding to the expected value of the absolute error loss or l1-norm loss is computed. For a sample i, if y^₍ᵢ₎ is the predicted value and y₍ᵢ₎ is the corresponding true value, then the Mean Absolute Error estimated over nₛₐₘₚₗₑₛ is defined as,

#PYTHON CODE
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)

This measure can be used to directly interpret the range of predictions. So, if the MAE |y- y^| is say 10. Then for a new sample, if the model predicts 100, then we can say that the prediction ranges between 90∓110 or 100∓10. Means, the error or variation of the model is ∓10. MAE is not suitable for use when the deviation of predicted output is very large. Let’s say for one sample, the absolute error is 1, and for another the absolute error is 100. Now, MAE treats both of these errors equally, it just aggregates (sums) them and takes their mean. But, if we think sensibly, both these errors should be treated relatively, because an error of 100 is very large in comparison to 1, hence it should be penalized more then 1. It can be said that MAE does a poor job when the scale of the data is high. To overcome this difficulty, we use MSE.

2. Mean Squared Error (MSE),

This computes a risk metric corresponding to the expected value of the squared (quadratic) error or loss. Mean Squared Error over n samples is,

#PYTHON CODE
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

→ Squaring always gives a positive value, so the sum will not be zero.

→ Squaring emphasizes larger differences — a feature that turns out to be both good and bad (think of the effect outliers have)

This measure is more convenient than MAE because it very effectively penalizes the outliers. If the error is 0.1 (<1), its magnitude effect as whole will be 0.01, we can say its negligible and if the error is 10, its magnitude effect will be 100 which is known as penalizing the outliers.

Hence, we are not penalizing the model for making an error <1. Here, smaller error will affect the model very less in comparison to larger error. In this way MSE fulfils the desired property of regression models, i.e. Relative Penalty.

3. Median Absolute Error,

This error measure is particularly interesting because it is robust to outliers. If there are some kind of outliers in the data itself, then taking the mean of data is not an appropriate solution, since the mean will shift towards the outlier and it will not be able to represent the central tendency of data correctly. To remove the effect of that one outlying sample, we can take the median of data instead of mean and give justice to all other sample values. In this way Median Absolute Error prevents outliers contributing more error to the model evaluation. Over n samples it can be calculated as,

#PYTHON CODE
from sklearn.metrics import median_absolute_error
median_absolute_error(y_true, y_pred)

4. Mean Squared Logarithmic Error (MSLE),

A risk metric corresponding to the expected value of the squared logarithmic (quadratic) error or loss is computed. Over n samples MSLE is,

#PYTHON CODE
from sklearn.metrics import mean_squared_log_error
mean_squared_log_error(y_true, y_pred)

Here, logₑ(x) means the natural logarithm of x. This metric is best to use when targets have exponential growth, such as population counts, average sales of a commodity over a span of years etc. Note that this metric penalizes an under-predicted estimate greater than an over-predicted estimate. MSLE is very useful when the scale of prediction is too high, log works as a scale reductor and scales it down.

5. Max Error,

This metric captures the maximum residual error i.e. the worst case error between the predicted value and the true value. It is used in critical cases when every sample data is important like the medical industry. In a perfectly fitted single output regression model, Max Error would be 0 on the training set and though this would be highly unlikely in the real world, this metric shows the extent of error that the model had when it was fitted. It is defined as,

#PYTHON CODE
from sklearn.metrics import max_error
max_error(y_true, y_pred)

Since this measure aims to find what is the maximum error made by the model for a single sample, it helps in finding the best model that covers all the samples.

6. Root Mean Squared Error (RMSE),

It is just the square root of MSE, mostly used to scale it down. RMSE is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

A Regression Model can also be evaluated based on the similarity of predictions i.e. the Similarity Metrics. Measure of similarity means how much similarity is there between y_true and y_pred. How much nearer is y_pred from y_true.

More the better, lesser the worse. Hence, always try to maximize your similarity. The similarity measure for regression cannot be accuracy since a regression model predicts continues values rather than discrete values. So, if y_true=20 and let’s say y_pred=19.5, then y_true and y_pred are obviously not accurate but they are pretty similar. Hence, a regression model need not to be accurate but should have similar predictions. It can be concluded that, for a model, the error measure should be lower, and the similarity measure should be higher.

# Explained Variance Score,

This computes the explained variance regression score, if y^ is the estimated target output and y the corresponding (correct) target output and Var is variance (square of the standard deviation), then the explained variance is estimated as follow,

#PYTHON CODE
from sklearn.metrics import explained_variance_score
explained_variance_score(y_true, y_pred)

This is a very good evaluation model. Here, 1 is the best evaluation score possible for a model, and <0 are considered to be not properly trained models. This score in turn explains the variance of the whole model. Note that, while calculating variance of errors in the numerator, the values are not squared or made positive, they are taken as it is.

# R² Score, the coefficient of determination,

A very popular evaluation score which computes the coefficient of determination usually denoted as R². It represents the proportion of variance (of y) that has been explained by the independent variables in the model. It provides an indication of goodness of fit and therefore a measure of how well unseen samples (test data) are likely to be predicted by the model, through the proportion of explained variance.

As such variance hugely depends on the dataset, R² may not be meaningful comparable across different datasets. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features would get an R² score of 0.0. Over n samples, it can be calculated as,

#PYTHON CODE
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

Note that R² score calculated unadjusted R² without correcting for bias in sample variance of y.

Classification Metrics,

1. Confusion Matrix,

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusion.

Given an actual label and a predicted label, the first thing we can do is divide our samples in 4 buckets.

True positive (TP) — actual = 1; predicted = 1 (11 = 1)

False positive (FP) — actual = 0; predicted = 1 (01 = 0)

False negative (FN) — actual = 1; predicted = 0 (10 = 0)

True negative (TN) — actual = 0; predicted = 0 (00 = 1)

“XNOR gate” produces this kind of output. Our objective is to train the model, so that our algorithm predicts the same as the true output. Hence, our algorithm should produce more outputs like TP and TN.

#PYTHON CODE
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)

If we are given 100 people, let’s say 50 are healthy and 50 are unhealthy. We developed an algorithm which predicted 35 to be healthy and 65 be unhealthy. So, can we draw the Confusion Matrix based on this data? Try!! We can’t since, we are unable to map (compare) those predicted values to the actual values. Those 35 values could be from 50 healthy people or from 50 unhealthy people. We need to know each row to draw the confusion matrix.

2. Accuracy,

The most common metric for classification is accuracy, which is the fraction of samples predicted correctly as shown below.

#PYTHON CODE
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)

This is simple enough to understand, let’s take a scenario to further develop our concept. We have a binary classifier whose accuracy is 45%, is this a good classifier? Think!!!

This is not at all a good binary classifier, since for binary data without even using a classifier we can predict all the data towards one class and easily achieve 50% accuracy. We do not need any logistic regression (for binary classification) to reach 50%. Hence, the above model is very badly trained. Similarly, for a three-class classification we can have 33% accuracy without using logistic regression. This analogy is true only when we have balanced data, meaning the number of rows for each class should be the same. Contradicting this analogy one can say that for a binary classification algorithm if we have 100 samples, where 90 denote one class, and the remaining 10 samples denote another class. In that case our analogy fails. (here, we can easily get 90% accuracy by just giving all samples to first class, which seems weird but there are multiple methods to deal with this. Like weighted average, keeping those 10 values common for all 9 sets of 90 values, and then training and other techniques.

3. Precision Score,

Precision is the fraction of predicted positive events that are actually positive as shown below.

#PYTHON CODE
from sklearn.metrics import precision_score
precision_score(y_true, y_pred)

4. Recall,

Recall (also known as sensitivity or True Positive Rate) is the fraction of positive events that are predicted correctly.

#PYTHON CODE
from sklearn.metrics import recall_score
recall_score(y_true, y_pred)

Example, let’s say there is data of rainy days of a month. So, total days = 30; actual rainy days = 10. Now, the model trained predicted that out of 30, 9 days had rain. In this, rain on 5 days is predicted correctly and 4 days are predicted incorrectly. Hence,

Precision = 5/9 (How many selected days are relevant?)

Recall = 5/10 (How many relevant days are selected?)

To make the precision maximum, one needs to select only those samples about which the confidence of being correct is 100% (make the FP=0). If this kind of surety is only for one sample, then also precision would be (1/1) = 100%.

To make recall maximum, one needs to select all those samples for which the actual output is correct, so all 30 days should be selected. Means, our model should say that there will be rain at all 30 days (it would surely cover all the raining days as well).

If we use these individually, our model can fool us easily by following the above-mentioned ideas. We require a matrix that can combine both ‘precision’ and ‘recall’ together. That matrix is called “F1 Score”.

5. F1 Score,

The harmonic mean of recall and precision, with a higher score as a better model.

#PYTHON CODE
from sklearn.metrics import f1_score
f1_score(y_true, y_pred)

If precision is low, F1 is low. If recall is low, F1 is low.

If both are high, then only F1 will be high.

Here, this formula focuses 50% on precision and 50% on recall. What if we want to focus more on precision rather than recall? How can we modify this formula to incorporate the imbalanced percentage of focus between these two? For this purpose, we use F-β Score.

Fig3: Relation between Precision and Recall

6. F-β Score,

The general formula for non-negative real β is,

#PYTHON CODE
from sklearn.metrics import fbeta_score
fbeta_score(y_true, y_pred, beta=0.5)

→ F₀.₅ measure (β=0.5): more weight on precision, less weight on recall

→ F₁ measure (β=1): balanced weight on precision and recall

→ F₂ measure (β=2): less weight on precision, more weight on recall

Hence, we conclude that,

♦ Precision and Recall provide two ways to summarize the errors made for the positive class in a binary classification problem.

♦ F1-measure provides a single score that summarizes the precision and recall.

♦ Fβ-measure provides a configurable version of the F-measure to give more or less attention to the precision and recall measure when calculating a single score. (if β<1, focus is more towards precision; if β>1, focus is more towards recall; if β=1, both have balanced equal focus)

7. ROC curve (Receiver Operating Characteristic Curve) and ROC AUC Score (Area Under the ROC Curve),

ROC curves are VERY helpful with understanding the balance between true-positive rate and false positive rates. Calculated using 3 lists
thresholds = all unique prediction probabilities in descending order
fpr = the false positive rate (FP / (FP + TN)) for each threshold
tpr = the true positive rate (TP / (TP + FN)) for each threshold
It tells how much a model is capable of distinguishing between classes.
Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

Variation in Threshold value will vary the values of fpr and tpr. How??

Let’s see, (Th = Threshold)

Table1: Change in tpr and fpr due to threshold

This change in ‘tpr’ and ‘fpr’ can be plotted through theROC curve and can be measured through ROC AUC Score.

Here, the green line represents the perfect prediction or the ground truth itself, red and blue are two possible threshold values for prediction, and black is just some random prediction. The quality of the curve is quantified by AUC Score, if the area under the curve is more, it is considered a better algorithm. So, here, we will choose the red predictor threshold as it is nearest to the ground truth and has maximum AUC Score.

These metrics can help research oriented people compare and quantify the results in order to write a good research paper. All the research papers include a result section where these studies can make your paper stand out from others. Comparative studies of different models and algorithms have its own benefits and these metrics are key to them.