PS2-22 CLINICAL UTILITY OF PREDICTION MODELS FOR OVARIAN TUMOR DIAGNOSIS: A DECISION CURVE ANALYSIS

Monday, June 13, 2016
Exhibition Space (30 Euston Square)
Poster Board # PS2-22

Laure Wynants1, Jan Verbakel2, Sabine Van Huffel1, Dirk Timmerman2 and Ben Van Calster2, (1)KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium, (2)KU Leuven Department of Development and Regeneration, Leuven, Belgium
Purpose: To evaluate the clinical utility of prediction models to diagnose ovarian tumors as benign vs malignant using decision curves.

Method(s): We evaluated the widely used RMI scoring system using a cut-off of 200, and the following risk models: ROMA and three models from the International Ovarian Tumor Analysis (IOTA) consortium (LR2,SRrisks, and ADNEX). We used a multicenter dataset of 2403 patients collected by IOTA between 2009 and 2012 to compare RMI, LR2,SRrisks, and ADNEX. Additionally, we used a dataset of 360 patients collected between 2005 and 2009 at the KU Leuven to compare RMI, ROMA, and LR2. The clinical utility was examined in all patients, as well as in several relevant subgroups (pre- versus postmenopausal, oncology versus non-oncology centers).

We quantified clinical utility through the Net Benefit. NB corrects the number of true positives for the number of false positives using a harm-to-benefit ratio. This ratio is the odds of the risk of malignancy threshold at which one would suggest treatment for ovarian cancer (e.g. surgery by an experienced gynecological oncologist). A threshold of 20% (odds 1:4) implies that up to 4 false positives are accepted per true positive. Using NB, a model can be compared to competing models or to default strategies of treating all or treating none. We expressed the difference between models as gain in ‘net specificity (i.e., sensitivity for a constant specificity,ΔNB/prevalence). 95% confidence intervals were obtained by bootstrapping.

Result(s): Thresholds between 5% (odds 1:19) and 30% (odds 1:2.3) were considered reasonable. ADNEX andSRrisks consistently showed best performance (see figure). RMI performed worst and was harmful, i.e., worse than treat all, at thresholds <20%. At the 10% threshold, ADNEX’ net sensitivity was 24% (95% CI 21% to 27%) higher than that of RMI. The gain is identical forSRrisks. LR2 performed in between. Subgroup results showed similar patterns. On the second dataset, results for RMI were similar. In addition, LR2’s net sensitivity was 7% higher (1% to 14%) than that of ROMA.

Conclusion(s): NB supersedes discrimination and calibration to quantify the clinical utility of prediction models. Our data suggest superior utility of IOTA models compared to RMI and ROMA.