Confusion Matrix

This tutorial presents the confusion matrix, its use in the context of multiclass classification and the biases that can be misperceived when discussing the results.

confusion matrix

2x2 confusion matrix

A confusion matrix is used to represent the results of a binary classification. True Positives, False Positives, True Negatives and False Negatives.

confusion matrix

The elements in blue are those correctly predicted (Y hat), and those in red incorrectly.

Let's see how to construct a confusion matrix and understand its terminologies. Consider that we need to model a classifier that classifies 2 kinds of fruit. We have 2 types of fruits – apples and grapes – and we want our machine learning model to identify or classify the given fruit as apple or grape.

We therefore take 15 samples of 2 fruits, of which 8 belong to the Apple class and 7 belong to the Grape class. The class is nothing but the output, in this example we have 2 output classes: Apples and Grapes. We will represent Apple as class 1 and Grape as class 0.

The actual class for 8 apples and 7 grapes can be represented as follows:

Real = [1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]

The classifier model predicts 1 for Apple and 0 for Grape.

Suppose the classifier takes the 15 inputs and makes the following predictions:

  • Out of 8 apples, it will correctly classify 5 as apples and incorrectly predict 3 as grapes.
  • Out of 7 grapes, it will correctly classify 5 as grapes and wrongly predict 2 as apples.

The prediction of the classifier can be as follows:

Prediction = [1,0,0,0,1,1,1,1,0,0,0,0,0,1,1]

# Creation of confusion matrix in using sklearn
from sklearn.metrics import confusion_matrix
#Let actual value be 1 for apple and 0 grapes for our example
CURRENT = [1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]
# Let the predicted values be
PREDICTION= [1,0,0,0,1,1,1,1,0,0,0,0,0,1,1]
# Confusion matrix for actual and predicted values.
matrix = confusion_matrix(ACTUAL,PREDICTION, labels=[1,0])
print('Confusion matrix: n',matrix)
# outcome values order in sklearn
TP, FN, FP, TN = confusion_matrix(ACTUAL,PREDICTION,labels=[1,0]).reshape(-1)
print('Outcome values: n', TP, FN, FP, TN)

For our example, the positive value is Apple and the negative value is Grape.

True positive: This means that the actual value and the predicted values are the same. In our case, the actual value is also an apple, and the model prediction is also an apple. If you look at the TP cell, the positive value is the same for Actual and Predicted.

False negative: This means that the true value is positive. In our case it is apple, but the model predicted it as negative, i.e. grape. The model therefore gave a bad prediction. This was supposed to give a positive result (apple), but it gave a negative result (grape). So whatever negative result we get is false; hence the name false negative.

False positive: This means that the true value is negative. In our case it is grape, but the model predicted it as positive, i.e. apple. The model therefore gave a bad prediction. It was supposed to give a negative result (grape), but it gave a positive result (apple), so whatever positive result we get is false, hence the name false positive.

True negative: This means that the actual value and the predicted values are the same. In our case, the actual values are the grapes, and the prediction is also the grapes. The values for the example above are: TP=5, FN=3, FP=2, TN=5.

You know the theory, now let's put it into practice. Let's code a confusion matrix with the Scikit-learn (sklearn) library in Python.

Sklearn has two interesting functions: confusion_matrix() and classification_report().

Sklearn confusion_matrix() returns the values of the Confusion matrix. The result, however, is slightly different from what we have studied so far. It takes the rows as actual values and the columns as predicted values. The rest of the concept remains the same.

Sklearn classification_report() generates precision, recall and f1 score for each target class. In addition to this, it also has additional values: micro average, macro average and weighted average.

The Mirco average is the precision/recall/f1 score calculated for all classes.

The macro average is the average of precision/recall/f1 score.

The weighted average is simply the weighted average of the precision/recall/f1 score.

Multiclass confusion matrix, 3x3 example

Let’s try to understand the confusion matrix for 3 classes and the confusion matrix for multiple classes with a popular dataset – the IRIS DATASET.

The dataset contains 3 flowers as outputs or classes, Versicolor, Virginia and Setosa.

Using the petal length, petal width, sepal length, and sepal width, the model should classify the given instance as a Versicolor or Virginia, or Setosa flower.

Let's apply a classifier model here. We can use logistic regression, but a decision tree classifier is applied to the above dataset. The dataset has 3 classes; we therefore obtain a 3 X 3 confusion matrix.

But how do you know the TP, TN, FP and FN values?

In multiclass classification problem, we will not directly get TP, TN, FP and FN values like in binary classification problem. For validation we need to calculate for each class.

1TP5Importing packages
import panda ace pd
import numpy ace n.p.
import sea born ace sns
import matplotlib.pyplot ace please
#Importing of dataset to dataframe. 
df = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")
#To see first 5 rows of the dataset
df.head()
#To know the data types of the variables.
df.dtypes
#Speceis is the output class,to know the count of each class we use value_counts() df['Species'].value_counts()
#Separating independent variable and dependent variable("Species")
X = df.drop(['Species'], axis=1) y = df['Species']
# print(X.head())
print(X.shape)
# print(y.head())
print(y.shape)
# Splitting the dataset to Train and test
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#to know the shape of the train and test dataset.
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
#We use Support Vector classifier as a classifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
#training the classifier using X_Train and y_train 
clf = SVC(kernel = 'linear').fit(X_train,y_train) clf.predict(X_train)
#Testing the model using X_test and storing the output in y_pred
y_pred = clf.predict(X_test)
# Creating a confusion matrix, which compares the y_test and y_pred
cm = confusion_matrix(y_test, y_pred)
# Creating a dataframe for an array-formatted Confusion matrix, so it will be easy for plotting.
cm_df = pd.DataFrame(cm,
                     index = ['SETOSA','VERSICOLR','VIRGINICA'], 
                     columns = ['SETOSA','VERSICOLR','VIRGINICA'])
#Plotting the confusion matrix plt.figure(figsize=(5,4)) sns.heatmap(cm_df, annot=True) plt.title('Confusion Matrix') plt.ylabel('Actal Values') plt.xlabel('Predicted Values') plt.show()

As stated before, FN: the false negative value of a class will be the sum of the values of the corresponding rows, except the TP value. FP: The false positive value of a class will be the sum of the values of the corresponding column except the TP value. TN: The true-negative value of a class will be the sum of the values of all columns and rows, except the values of that class for which we are calculating the values. And TP: the true-positive value is the one where the actual value and the predicted value are the same.

The confusion matrix for the IRIS dataset is as follows:

confusion matrix

Let's calculate the TP, TN, FP and FN values for the Setosa class using the tips above:

TP: The actual value and the predicted value must be the same. So regarding the Setosa class, the value of cell 1 is the TP value.

FN: the sum of the values of the corresponding lines except the TP value

FN = (cell 2 + cell 3) = (0 + 0) = 0

FP: The sum of the values of the corresponding column except the TP value.

FP = (cell 4 + cell 7) = (0 + 0) = 0

TN: The sum of the values of all columns and rows, except the values of the class for which we are calculating the values.

TN = (cell 5 + cell 6 + cell 8 + cell 9) = 17 + 1 +0 + 11 = 29

Similarly, for the Versicolor class, the values/metrics are calculated as below:

TP: 17 (cell 5)

FN: 0 + 1 = 1 (cell 4 + cell 6)

FP: 0 + 0 = 0 (cell 2 + cell 8)

TN: 16 +0 +0 + 11 =27 (cell 1 + cell 3 + cell 7 + cell 9).

You can try the Virginia course.

Interpretive measure

The confusion matrix allows us to measure recall and precision, which, along with precision and the AUC-ROC curve, are the metrics used to measure the performance of ML models.

Recall can be explained by saying, among all positive classes, how many we predicted correctly. The recall should be as high as possible.

The accuracy can be explained by saying, of all the classes we predicted as positive, how many are actually positive. The precision must be as high as possible.

Among all classes (positive and negative), the accuracy determines how many of them we predicted correctly. In this case it will be 4/7. The precision must be as high as possible.

confusion matrix

It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use the F-Score. The F-score allows us to measure recall and precision simultaneously. It uses the harmonic mean instead of the arithmetic mean, punishing extreme values more.

F1-score

There are many other precision measurements, each of which has very specific uses and uses:

confusion matrix

a- the number of real positive cases in the data

b- A test result that correctly indicates the presence of a condition or characteristic

c- Type II error: a test result that falsely indicates that a particular condition or attribute is absent

d- the number of real negative cases in the data

e- Type I error: a test result that falsely indicates that a particular condition or attribute is present

f- A test result that correctly indicates the absence of a condition or characteristic

Evaluate biases and decision tree

When evaluating a model, metrics collected against a full test or validation set do not always provide a clear picture of how fair that model is.

Let's take the example of a new model developed to predict the presence of tumors, evaluated on a validation set of 1,000 patients. 500 records corresponding to women and 500 to men. The following confusion matrix summarizes the results obtained for the 1,000 examples:

True positives (VP): 16 False positives (FP): 4
False negative (FN): 6 True negative (TN): 974

Precision = 0.800, Recall = 0.727

These results seem promising: precision of 80% and recall of 72.7%. But what happens if we calculate the result separately for each set of patients? Let's split the results into two separate confusion matrices: one for women and one for men. For distributions, it is also possible to use the decision tree to help!

confusion matrix

When we calculate metrics for female and male patients separately, we observe marked differences in model performance for each group.

Women:

  • Of the 11 women who had tumors, the model correctly predicted a positive result for 10 patients (recall rate: 90.9%). In other words, the model does not detect a tumor in 9.1% of female cases.

  • Likewise, when the model returns a positive tumor result in women, it is correct in 10 out of 11 cases (accuracy rate: 90.9%). In other words, model incorrectly predicts tumor in 9.1% female cases.

Men:

  • However, of the 11 male patients who had tumors, the model correctly predicted a positive result for only 6 patients (recall rate: 54.5%). It means that the model does not detect a tumor in 45.5% of cases for men.

  • When the model gives a positive tumor result in men, it is correct in only 6 out of 9 cases (i.e. an accuracy rate of 66.7%). In other words, model incorrectly predicts tumor in 33.3% cases for men.

We now have a much better understanding of the biases inherent in the model's predictions, as well as the risks associated with each subgroup if it were to be used for medical purposes by the general population.