Probabilistic interpretation of AUC

March 18, 2020 less than 1 minute read

The area under the ROC (receiver operating characteristic) curve, or AUC, is a popular and robust metric for machine learning classification. However, one issue with its use is that it is not as easy to explain to a client what it is and why you are using it. Over at 0-fold Cross-Validation, Alexej Gossmann has a useful post on how to think of the AUC probabilistically:

In other words, if the classification algorithm distinguishes “positive” and “negative” examples (e.g., disease status), then AUC is the probability of correct ranking of a random “positive”-“negative” pair.

From this, you can derive a simple formula to compute AUC on a finite sample as an alternative to the trapezoidal rule:

Among all “positive”-“negative” pairs in the dataset compute the proportion of those which are ranked correctly by the evaluated classification algorithm.

And as a reminder of why the AUC has many advantages compared to other ‘single number” performance measures:

It has a really nice probabilistic meaning!

Independence of the decision threshold.

Invariance to prior class probabilities or class prevalence in the data.

Can choose/change a decision threshold based on cost-benefit analysis after model training.

Extensively used in machine learning, and in medical research.

Share on

Twitter Facebook LinkedIn

Francis T. O'Donovan

Probabilistic interpretation of AUC

Share on

Leave a comment

You may also enjoy

Why You’re Not Getting Value from Your Data Science

List only untracked files

Pandas: Display DataFrames side by side

Pandas: Transforming two DataFrame columns into a dictionary