Choice of evaluation metrics in classification problem.
1. Different Types of Error
This time we will discuss more details about the application of metrics: TPR (Recall), FPR, Precision, PRC, ROC/AUC and F1-score.
We will have different requirements when we facing different problems. Actually it depends on our tolerance of different types of error. Therefore we have two scenarios:
- We care less TPR
From the definition of TPR, we see that this is the proportion of “correctly classified as positive” and total actual positive. Since total actual positive is a constant, this means we don’t care much about “correctly classified as positive”.
For instance, in spam classification we want predict whether it’s a spam. Define “spam” as positive and “email” as negative. Although we want filter out the spam, we won’t want the email to be filtered out. We can tolerate more spam to be classified as email (false negative), but if emails are classified as spam (false positive), we might miss important events. This case we hope high precision.
- We need high TPR
We hope high Recall in this case and don’t care too much about the precision.
For instance, in flood prediction we will make alarm if we expect there will be a flood. Define “flood” as positive. We hope that all of the flood can be detected therefore we have high tolerance of false positive. Since typically the losses made by flood is always higher than cost of extra precaution. In general, if missing positive (false negative) would cost a lot, we hope high recall.
As a result, we can see that Accuracy is not always the goal we want to achieve, sometimes we may focus on reducing certain types of error rather than total error.
2. ROC AUC and PR AUC
In general ROC AUC is more commonly used. The ROC curve is draw from a set of pairs of FPR (x-axis) and TPR (y-axis) with different threshold (we can always get the probability of data to be one of the classes), and PR curve is draw from pairs of Recall (i.e. TPR, x-axis) and Precision (y-axis) with different threshold.
In addition, there are numerous imbalanced classification problems in real world, the situations are more complex. For instance, it is possible that the algorithm performs terrible on PR AUC but looks perfect on ROC AUC, one reason is ROC AUC is insensitive to imbalanced data. Therefore for imbalanced data set, do check both of ROC AUC and PR AUC.
More detailed research can be found in this paper: The Relationship Between Precision-Recall and ROC Curves.
3. Summary
There are still more issues need to be considered in real world applications since we need to incorporate more factors. Thus it’s always a good habit that don’t always rely on one metrics.
4. Reference
- Jesse Davis, Mark Goadrich, The Relationship Betwen Precision-Recall and ROC Curves