Evaluation Metrics I— Classification

Published in

The Startup

8 min readAug 16, 2020

If you are familiar with machine learning competitions and you take the time to read through the competition guidelines, you will come across the term Evaluation Metric. The evaluation metric is the basis by which the model performance is determined and winning models placed on the leaderboard. Understanding evaluation metrics will help you build better models and give you an edge over your peers in the event of a competition. We will discuss the common model evaluation metrics paying attention to when they are used, range of values for each metric and most importantly, the values we want to see.

A prediction model is trained with historical data. To ensure that the model makes accurate predictions — is able to generalize learned rules on new data — the model should be tested using data that it was not trained with.

This can be done by separating the dataset into training data and testing data. The model is trained using the training data and the model’s predictive accuracy is evaluated using the test set. The dataset can be split using sklearn.model_selection.train_test_split from the scikit-learn library.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

X is the input feature(s) or independent feature(s)
y is the target column or dependent feature
test_size denotes the ratio of total dataset that will be used to test the model.
The names X_train, X_test, y_train, y_test are conventional names and can be changed to suit your taste.

The model prediction accuracy is then determined using Evaluation Metrics. We will discuss popular evaluation metrics for classification models.

Confusion Matrix

The confusion matrix is a common method that is used to determine and visualize the performance of classification models. The confusion matrix is a NxN matrix where N is the number of target classes or values.

The rows in a confusion matrix represent actual values while the columns represent predicted values.

Terms to note in Confusion Matrix

True positives: True positives occur when the model has correctly predicted a True instance.
True Negatives: True negative are cases when the model accurately predicts False instances.
False positives: False positives are a situation where the model has predicted a True value when the actual value is False.
False negatives: False negative is a situation where the model predicts a False value when the actual value is True.

To build a better image I will use the Titanic dataset as an example. The Titanic dataset is a popular machine learning dataset and is common amongst beginners. It is a binary classification problem and the goal is to accurately predict which passengers survived the Titanic shipwreck. The passengers who survived are denoted by 1 while the survivors who did not survive are denoted by 0 in the target column SURVIVED.

Now if our model classifies a passenger as having survived (1) and the passenger actually survived (according to our dataset) then that classification is a True Positive — the model accurately predicted a 1.

If the model predicts that a passenger did not survive and the passenger did not survive, that is a True Negative classification.

But if the model predicts that a passenger survived when the passenger in fact did not survive, that is a case of False Positive.

As you may have guessed, when the model predicts a passenger died when the passenger actually survived, then it is a False Negative.

Hope this illustration puts things in perspective for you.

An ideal confusion matrix will have higher non-zero values along it’s major diagonal, from left to right.

The Titanic dataset confusion matrix of a Binary Classification problem with only two possible outputs — survived or did not survive — is shown below. We have only two target values, therefore, we have a 2x2 matrix. With the actual values on the vertical axis and the predicted values on the horizontal axis.

A confusion matrix can also visualize multi-class classification problems. This is not different from the binary classification with the exception of an increase in the dimension of the matrix. A three-class classification problem will have a 3x3 confusion matrix and so on.

Specific Metrics that can be gotten from the Confusion Matrix include:

Accuracy

The accuracy metric measures the number of classes that are correctly predicted by the model — the true positives and true negatives.

#calculating accuracy mathematically
Accuracy = sum(clf.predict(X_test)== y_test)/(1.0*len(y_test))#Calculating accuracy using sklearn
from sklearn.metrics import accuracy_score print(accuracy_score(clf.predict(X_test),y_test))

The values for accuracy range from 0 to 1. If a model accuracy is 0, the model is not a good predictor. The accuracy metric is well suited for classification problems where the two classes are balanced or almost balanced.

Remember, it is advisable to evaluate the model on new data. If the model is evaluated using the same data the model was trained on, high accuracy value is not surprising as the model remembers the actual values and returns them as predictions. This is called overfitting. Overfitting is a situation whereby the model fits the training data but is not able to accurately predict target values when introduced to new data. To ensure your model is accurate , make sure to evaluate with a new set of data.

Precision and Recall

The precision and recall metrices work hand-in-hand. While the Precision measures the number of positive values predicted by the model that are actually positive, Recall determines the proportion of the positives values that were accurately predicted.

A model that produces a low number of false positives has high precision. While a model with low false negatives, has high recall value.

We will again use the Titanic dataset as before. If the confusion matrix for our model is as above, using our equations above, we get precision and recall values (in 2 dp) as follows:

Precision = 0.77
Recall = 0.86

If the number of False negatives are decreased, the Recall value increases. Likewise, if the number of False positives is reduced, the precision value increases. From the confusion matrix above, the model is able to predict those who died in the shipwreck more accurately than those who survived.

Another way to understand the relationship between precision and recall is with thresholds. The values above the threshold are assigned positive (survived) and values below the threshold are negative (did not survive).

If the threshold is 0.5, passengers whose target value fall above 0.5 survived the Titanic. If the threshold is higher, such as 0.8, the accurate number of survivors who survived but are classified as dead will increase (the False Negatives) and the recall value will decrease. At the same time, the number of False positives will reduce as the the model will now classify fewer models as positive due to its high threshold. The Precision value will increase.

In this way, precision and recall can be seen to have a see-saw relationship. If the threshold is lower, the number of positive values increase, the False positive value is higher and the False Negative increases. In his book Hands On Machine Learning with Scikit-Learn, Keras and Tensorflow, Aurélien-Géron describes this Tradeoff splendidly.

We may choose to focus on precision or recall for a particular problem type. For example, if our model is to classify between malignant cancer and benign cancer, we would want to minimize the chances of a malignant cancer being classified as a benign cancer — we want to minimize the False Negatives — and therefore we can focus on increasing our Recall value rather than Precision. In this situation, we are keen on correctly diagnosing as many malignant cancer as possible, even if some happen to be benign rather than miss a malignant cancer.

Precision and Recall values range from 0 to 1 and in both cases, the closer the metric value is to 1, the higher the precision or recall. They are also good metrics to use when the classes are imbalanced and can be averaged for multiclass/multilabel classifications.

You can read more on precision and recall in the scikit-learn documentation.

from sklearn.metrics import recall_score
from sklearn.metrics import precision_scorey_test = [0, 1, 1, 0, 1, 0]
y_pred = [0, 0, 1, 0, 0, 1]
recall_score(y_test, y_pred)
precision_score(y_test, y_pred)

A metric that takes these both precision and recall into consideration is the F1 score.

F1 score

The F1 score takes into account the precision and recall of the model. The F1 score computes the harmonic mean of precision and recall, giving a higher weight to the low values. Therefore, if either of precision or recall has low values, the F1 score will also have a value closer to the lesser metric. This gives a better model evaluation than an arithmetic mean of precision and recall. A model with high recall and precision values will also have a high F1 score. The F1 score ranges from 0 to 1. The closer the F1 score is to 1, the better the model is.

F1 score also works well with imbalanced classes and for multiclass/multilabel classification targets.

from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

To display a summary of the results of the confusion matrix:

from sklearn.metrics import classification_report
y_true = [0, 1, 1, 1, 0]
y_pred = [0, 0, 1, 1, 1]
target_names = ['Zeros', 'Ones']
print(classification_report(y_true, y_pred, target_names=target_names))               precision    recall    f1-score   support       Zeros       0.50      0.50      0.50         2
        Ones       0.67      0.67      0.67         3    accuracy                           0.60         5
   macro avg       0.58      0.58      0.58         5
weighted avg       0.60      0.60      0.60         5

Log Loss

The Logarithmic Loss metric or Log loss as it’s commonly known, measures how far the predicted values are from the actual values. The Log loss works by penalizing wrong predictions.

Log loss does not have negative values. Its output is a float data type with values ranging from 0 to infinity. A model that accurately predicts the target class has a log loss value close to 0. This indicates that the model has made minimal error in prediction. Log loss is used when the output of the classification model is a probability such as in logistic regression or some neural networks.

from sklearn.metrics import log_loss
log_loss(y_test, y_pred)

In this post, we have discussed some of the most popular evaluation metrics for a classification model such as the confusion matrix, accuracy, precision, recall, F1 score and log loss. We have seen instances where these metrics are useful and their possible values. We have also outlined the metric scores for each evaluation metric that indicate our model is doing a great job at predicting the actual values. Next time we will look at the charts and curves that can also be used to evaluate the performance of a classification model.

Written by Anita Igbine