PS4-59
A CALIBRATION HIERARCHY FOR RISK MODELS: ‘MODERATE' CALIBRATION GUARANTEES NON-HARMFUL DECISION MAKING
Method: We used methodological considerations and research results, data simulations, and a mathematical proof to build the presented arguments. We focus on risk models for binary outcomes, but the ideas generalize to other outcome types.
Result: We discern four increasing levels of calibration. Mean calibration (calibration-in-the-large) is achieved when the observed event rate equals the average estimated risk. Weak calibration (logistic calibration) means that there is no general over- or underfitting or over- or underestimation of risks. Weak calibration can be investigated through Cox’ logistic recalibration analysis to calculate the calibration intercept and slope. Moderate calibration (flexible calibration) means that estimated risks correspond to observed event rates per level of estimated risk. Finally, strong calibration means that estimated risks correspond to observed event rates per covariate pattern. We prove that moderate calibration leads to non-harmful decision making in terms of decision curve analysis: the clinical Net Benefit of the model at any risk threshold is at least as high as the Net Benefit of treating all patients or treating no patients by default. Finally, we argue that strong calibration, although desirable for patient communication and shared decision making, is utopic based on statistical reasons. In essence, strong calibration requires that the correct model is obtained conditional on the included predictors, but this is unrealistic.
Conclusion: Our results indicate that researchers should aim for simple models that are moderately calibrated, because this guarantees non-harmful decision making. Furthermore, this is likely to facilitate acceptance and uptake of risk models by clinicians.
See more of: 37th Annual Meeting of the Society for Medical Decision Making