EVALUATING THE CLINICAL UTILITY OF PREDICTION MODELS IN A HETEROGENEOUS MULTICENTER POPULATION USING DECISION-ANALYTIC MEASURES: THE RANDOM EFFECTS-WEIGHTED NET BENEFIT
Candidate for the Lee B. Lusted Student Prize Competition
Purpose: To investigate methods to evaluate the clinical utility of a prediction model in heterogeneous multicenter datasets based on decision-analytic measures.
Method: We focus on the Net Benefit (NB) statistic from Vickers and Elkin (Med Decis Making 2006). NB is defined as (TP-w*FP)/N, with TP the number of true positives, FP the number of false positives, and w the ‘harm-to-benefit ratio' of treating a false positive versus a true positive. This ratio equals the odds of the risk threshold t used to classify patients as positive or negative. A model's NB can be compared to the default strategies of classifying all as positive (treat all) or negative (treat none). We averaged center-specific NBs for specific values of t using random effect weights 1/(se2+τ2), with se2 the within-center and τ2 the between-center variance of NB. To compare the utility from individual centers with different event rates, we calculated center-specific differences between the model's NB and the best default strategy. These were also averaged using random effect weights. We present a case study in which a prediction model (LR2) for malignancy of ovarian tumors is evaluated in a dataset of 5913 women recruited at 13 oncology referral centers and 11 non-oncology centers. We computed separate weighted averages of NB for oncology and non-oncology centers, thereby re-estimating τ2 in each sub-population.
Result: There was considerable heterogeneity in NB: the between-center variance was up to ten times as large as the average within-center variance. The NB corresponding to t=0.1 was 0.333 (95% prediction interval 0.043-0.623) for oncology centers and 0.117 (0.034-0.200) for non-oncology centers (see figure). LR2 was always better than the best default strategy in the average non-oncology center. However, in the average oncology center, the NB of LR2 was lower than the NB of treating all patients when t ≤0.1. Risks of malignancy were underestimated in a number of oncology centers. Updating LR2 to resolve calibration issues improved the clinical utility.
Conclusion: We conclude that NB can be highly heterogeneous in multicenter studies. NB may increase because of increased prevalence of malignancy, and decrease due to insufficient calibration or reduced classification performance of the model in specific centers. This heterogeneity should be recognized and explored using appropriate techniques.