Purpose: Risk prediction models that contain more than seven predictor variables may be less likely to be used in clinical practice. The purpose of this project was to compare five commonly used variable selection methods in their ability to create small, accurate risk prediction models of comparable size (<=7 variables).
Method: The five methods (forward stepwise regression, backwards stepwise regression, forward stepwise based on the c statistic, Harrell’s stepdown, and random survival forest (RSF)) were compared head-to-head in four large cohorts using 100 random cross validations. All of the methods were used to select variables for use in a Cox regression in which continuous variables were fit using restricted cubic splines. RSF was also used to select variables for an RSF generated prediction. The cohorts ranged in size from 3,969 to 191,011 patients. Variables and interactions included in the “full” statistical models were determined by clinical experts for previous studies using these same datasets.
Result: Forward stepwise regression was at least as good as the other methods in 3 out of the 4 datasets (as determined by the median cross-validated c statistic), but there was little difference in the discrimination of the models produced by forward stepwise, backward stepwise, forward stepwise based on the c statistic, and Harrell’s stepdown. RSF was the least stable method across the different datasets and while it produced the most accurate Cox model for one dataset, the RSF generated prediction was the least accurate in three of the datasets and had to be abandoned in the largest dataset due to excessive computation time. Histograms of the variable selection frequencies show that all of the methods were inconsistent in their selection of variables between each cross validation.
Conclusion: Forward stepwise regression appears to be a reasonable approach for creating prediction models when the number of variables in the model is limited to 7 in an effort to increase the use of the model in clinical care. These results may not pertain to larger statistical models with lower numbers of events per variable. RSF could become a more attractive approach as computational capabilities improve and if dataset characteristics can be identified that suggest RSF is more likely to produce a more accurate result.