Bangda Sun

Practice makes perfect

Common Metrics in Binary Classification Problems

Confusion matrix and the evaluation metrics (recall, precision, f1 score) derived from it are used to evaluate classification model performances. Each metric has its own characteristic in terms of evaluation. Here I summarize these basic metrics and some tricks to memorize the calculations.

In classification problem, we usually start analysis the performance of the model / algorithm from the confusion matrix:

1
2
3
4
> confusion_matrix
pred 0 pred 1
actual 0 55 4
actual 1 3 43

here I denote “0” as negative, “1” as positive, they are the classes in our classification problem. The confusion matrix here says:

55 observations are correctly classified as “0”, they are true negative ;
43 observations are correctly classified as “1”, they are true positive;
4 observations are incorrectly classified as “1”, they are false positive;
3 observations are incorrectly classified as “0”, they are false negative.

1
2
3
4
> confusion_matrix
pred 0 pred 1
actual 0 "TN" "FP"
actual 1 "FN" "TP"

We always define the class that we are interested in as “Positive”. For instance, when we want to detect the spam, the spam would be “Positive”; when we want to predict whether the user will click the link, action of click would be “Positive”.

The first measure we will always use is just the Accuracy, which is the proportion of correctly classified observations, here we have

\[
\text{Accuracy} = \frac{55+43}{55+43+4+3} = 93.33\%,
\]

very easy and straight-forward.

Next we can use true positive rate (TPR) and false positive rate (FPR), they are defined as:

\[
\text{TPR} = \frac{\text{true positive}}{\text{positive}} = \frac{\text{true positive}}{\text{true positive} + \text{false negative}} = \frac{43}{43+3} = 93.48\%,
\]

\[
\text{FPR} = \frac{\text{false positive}}{\text{negative}} = \frac{\text{false positive}}{\text{false positive} + \text{true negative}} = \frac{4}{4+55} = 6.78\%,
\]

where TPR can be viewed as gauge of “Type I error” (reject true) and FPR corresponds to “Type II error” (failure to reject false). TPR has another name called Recall, also known as Sensitivity, and 1 - FPR is known as Specificity.

Personally speaking, the trick I use to remember the calculation of these metrics are:

  • when it is about True XX Rate, the denominator is XX, which is Positive - Positive and Negative - Negative;
  • when it is about False XX Rate, the denominator is opposite, which is Positive - Negative and Negative - Positive.

From the perspective of confusion matrix, we can see the denominator of FPR is the summation of first row of confusion matrix; the denominator of TPR is the summation of second row of confusion matrix.

Based on TPR and FPR, we can draw a plot named ROC (receiver operating characteristic), where the x-axis is FPR and y-axis is TPR. ROC is plotted at various threshold settings, the ideal point is at (0, 1), means for nice performance, FPR should be as small as possible and TPR should be as large as possible. Also we have AUC (area under curve) which is the area under ROC curve. ROC and AUC are widely used in industries.

Next, we can define Precision, which is the proportion of all true positive. Sometimes two classes are not “equally weighted” in our problem, which means we care more about one type of class.

\[
\text{Precision} = \frac{\text{true positive}}{\text{true positive} + \text{false positive}} = \frac{43}{43+4} = 91.49\%,
\]

where the denominator is the second column of confusion matrix. Typically we can also draw Precision versus Recall which is known as PRC.

Finally we have one more important metric called F1-score, which combine Recall and Precision together, it’s the harmonic mean of Recall and Precision:

\[
\text{F1-score} = \frac{2\times\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}=\frac{2\times43/47\times43/46}{43/47+43/46} = 0.9247.
\]

References