Meeting Brochure and registration form      SMDM Homepage

Monday, October 22, 2007
P2-12

RECEIVER OPERATING CHARACTERISTIC (ROC) CURVES VERSUS PRECISION-RECALL (PR) CURVES IN MODELS EVALUATED WITH UNBALANCED DATA

Jagpreet Chhatwal, MS, Elizabeth S. Burnside, MD, MPH, MS, and Oguzhan Alagoz, PhD. University of Wisconsin, Madison, WI

Purpose: Receiver operating characteristic (ROC) curves are commonly used to measure the performance of a test (or a model) with a binary outcome. However, area under the ROC curves may lead to over-optimistic results when the test data is unbalanced (i.e. disproportionate number of negative versus positive cases), which is often the case for screening tests such as mammography. Our objective was to compare the discriminative ability of Precision-Recall (PR) curves with ROC curves in measuring the performance of tests applied to unbalanced data.

Methods: We constructed a logistic regression model to assess breast cancer risk and evaluated its performance using area under the nonparametric ROC curves (AROC) and PR curves (APR). Precision and recall are the measure of sensitivity and positive predictive value (PPV) of a test, respectively. We plotted sensitivity (X-axis) versus PPV (Y-axis) at different cut-off points to obtain a PR curve. Our data set consists of 62,219 mammography abnormalities (510 malignant and 61,709 benign) observed by radiologists. We simulated the outcome of the model by adding bias to the probability of cancer for malignant cases. First, we added negative bias (underestimating the probability of cancer) to the malignant cases thus reducing the performance of the model. We measured AROC and APR for various bias values. Secondly, we compared one of the biased risk assessment models (model-1) with the radiologists' prediction of breast cancer as measured by Breast Imaging and Reporting Data Systems (BI-RADS) assessment codes, using both ROC and PR curves.

Results: As we increased the magnitude of bias, the best and the worst performance as measured by AROC were 0.965 and 0.815, respectively; whereas, the corresponding APR decreased from 0.550 to 0.035 for the same bias values. AROC of radiologists and model-1 were equal to 0.939 and 0.934, showing no statistical difference (p-value = 0.599); whereas, PR curves showed that radiologists (APR = 0.496) performed significantly better (p-value < 0.001) than the model-1 (APR = 0.448).

Conclusions: ROC curves over-estimate the performance of a model when the test data are unbalanced. In contrast, APR avoids this over-estimation and can demonstrate a statistically significant difference between tests, which is not detected by AROC.