Evaluation Metrics II: Classification Evaluation Curves

Image for post
Image for post
Photo by Keila Hötzel on Unsplash

In our previous post, we discussed the Evaluation Metrics for Classification models with emphasis on the Confusion matrix, accuracy, precision, recall and log loss metrics. You can read the post here to understand the basic evaluation metrics.

By evaluating our models, we assess their performance using key parameters and functions. Model evaluation is an integral step in the machine learning pipeline and we evaluate our models for a number of reasons. Most importantly, we evaluate models to ascertain that they are able to generalize on unseen data and therefore produce accurate predictions. As models are trained using past data, how they perform on future data to predict target outcomes can only be gauged through the process of model evaluation.

Model evaluation is also carried out to ensure that the model does not overfit the training data. Overfitting occurs when a model has memorized the target values for our data set and performs excellently on training data but perform poorly when fitted to new data.

We also evaluate models to determine the model that gives the highest accuracy for a dataset. It is common practice to fit various models to a dataset in order to determine which model makes the most accurate predictions.

It is advisable to evaluate model performance on data different from the training data.

Evaluation techniques include:

  • Holdout: In this technique, the dataset is randomly divided into the Training, Validation and Test set. The training set is used to fit the model. During training, the model learns the relationships between the input features and target output. The Validation set is used to assess model learning ability and adjust model parameters if need be. Weak learners are models that do not accurately define the relationship between the input and output variables while strong learners are capable of identifying the relationship between the input features and target output. The test or holdout set is then used to assess the model’s performance after training and validation have been carried out.
  • Cross validation: In this technique, model training and testing is carried out on two separate datasets. There are different cross validation methods such as the train_test_split and the k-folds validation. The methods of Cross validation are not within our scope but you can read about them here.

In this post, we will discuss two curves commonly used to evaluate the performance of Classification models: the ROC curve and Precision-Recall curve.

Receiver Operating Characteristic Curve(ROC)

The ROC Curve is a plot of the True Positive rate(recall) vs the false positive rate. Recall as we learnt in our previous post, is the ratio of positive cases that the model accurately classifies.

Image for post
Image for post

The False Positive Rate is the proportion of negative classes that are classified as positive classes by the model. Let us continue with our Cancer Classification problem, if we assign a negative instance to benign cancers (0) and a positive instance to the malignant cancers(1), with the goal of identifying cases of malignant cancers, the False Positive Rates are the cases of benign cancer that are misclassified as malignant cancer. We can therefore call these False Alerts.

FPR, also known as Fall-out is defined by the formula:

Image for post
Image for post
False Positive Rate

Specificity is the True Negative Rate which is the proportion of negative classes which the model classifies as negative.

Image for post
Image for post

The ROC curve can therefore be defined as the plot of recall versus (1 — specificity). The ROC curve is a probability curve and the goal is to identify the false negatives.

The ideal ROC curve is close to the top left corner of the plot, a classifier with high recall and low false positive rate. The ROC curve is a good method to compare performance of different Classification models.

Image for post
Image for post

The diagonal line from the origin to the point (1,1) represents a random model.

Model with high True Positive Values and Low False Positive values are more accurate. From our ROC curve above, we can see that the green classifier is the most accurate followed by the purple, blue and red classifiers in that order.

The Sklearn roc_curve function is used to draw an ROC curve. The example below was pulled from the sklearn roc_curve documentation.

import numpy as np
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)

The threshold values range from 0 to 1. The threshold determines the True positive or False positive rate. This is because values below and above the thresholds fall into different classes.

In our Cancer example, if we set a threshold of 0.5, then values above 0.5 indicate a malignant cancer while values below indicate benign cancers. If we adjust our threshold value to 0.7, then the range of values for malignant cancers in reduced. Model recall is reduced and specificity increases.

Likewise if we reduce the threshold, there are more positive values and recall increases while specificity reduces.

The ROC can be plotted for varying threshold values to determine the optimum threshold value or plotted for different classifiers to determine the best model. The ROC is best suited for binary classification problems but can be extended for multiclass classification problems.

Area Under Curve (AUC)- ROC

The Area Under Curve is a summary value of the ROC curve. As the name suggests, it is the value of the area that lies beneath the ROC curve. The AUC ranges from 0 for a poor classifier to 1 which is the maximum value of the area under the ROC curve. Therefore an AUC value closer to 1 indicates an accurate model which is able to differentiate between the classes in a classification problem. The AUC can also be used to compare the performance of various classifiers.

An AUC value of 0.5 represents a random classifier.

from sklearn.metrics import roc_auc_score
roc_auc_score(y_true, y_scores)

Here is a link to the sklearn roc_auc_score.

Precision-Recall Curve

The precision-recall curve is a plot of the recall on the x-axis and precision on the y-axis. Precision measures the number of positive values predicted by the model that are actually positive and Recall determines the proportion of the positives values that were accurately predicted. The precision-recall curve allows us to view the precision-recall relationship at a glance and is especially useful in imbalanced class classification.

Image for post
Image for post

The Precision-Recall curve of a good classifier is closer to the top right corner of the plot.

Image for post
Image for post
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

While the ROC curve compares the True Positive Rate and False Positive Rates, the Precision-Recall curve compares the Precision and Recall of models.

The Area Under Curve of the Precision-Recall Curve (AUCPRC) can also be determined using Trapezoidal rule. A high AUCPRC value — one — indicates that the model has high precision and recall.

A horizontal line in the Precision-Recall plot indicates a poor classifier and AUCPRC of zero


When working with a problem, you should outline your target output and its interpretation. Knowing this will help you determine what evaluation model to use.

In a breast cancer classification problem, we are concerned with high Recall — low False Negative — value so as to capture all cases of malignant cancer even if some benign cancers are classified as malignant. In an attrition model, Sensitivity is our main focus and Accuracy is best used in balanced classification problems. Need I go on?

Here, we have learnt that Precision-Recall curves are better suited to imbalanced classification than ROC. We have also looked at the importance of Evaluating models and briefly outlined the model evaluation techniques.

Thanks to libraries such as scikit-learn, a deep understanding of these metrics may not be required but it is always good to know when to apply each metric to obtain required results.

  • Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron
  • Udacity Microsoft Azure Foundation Course Lesson 3.25: Evaluation Metrics for Classification
  • 11 Important Model Evaluation Error Metrics, Analyticsvidhya
  • Scikit-learn dicumentations
  • Training Sets, Validation Sets, and Holdout Sets; DataRobot
  • Understanding AUC-ROC Curve; Towards Data Science Publication

Data Science Nigeria PH community. Data Science and Artificial Intelligence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store