D-4 INCORPORATION OF VARIABLE COST INTO VARIABLE SELECTION FOR LOGISTIC REGRESSION USING INFORMATION CRITERIA AND THE AREA UNDER THE ROC CURVE

Tuesday, October 20, 2009: 1:45 PM
Grand Ballroom, Salon 4 (Renaissance Hollywood Hotel)
Ben Van Calster, PhD1, Sabine Van Huffel, PhD1 and Dirk Timmerman, MD, PhD2, (1)Katholieke Universiteit Leuven, Leuven, Belgium, (2)University Hospitals Leuven, Leuven, Belgium

Purpose: Simple and easy yet well performing diagnostic models are preferable for successful implementation into daily clinical practice. Therefore, we present some approaches for cost-sensitive variable selection within the context of ovarian tumor diagnosis using logistic regression.

Method: We performed variable selection on data from 1938 females with an ovarian tumor (542 malignancies) using 31 candidate predictors. Variable cost was scored from 1 to 5, based on time-related and financial constraints, subjectivity, and patient impact. Stepwise selection based on the Akaike information criterion (AIC), Schwarz’ Bayesian information criterion (BIC), or the area under the ROC curve (AUC) was considered. To account for variable cost, the penalty term k*p for AIC (k=2) and BIC (k=log(n)) was replaced by k*(Σc + 1), with p the number of coefficients and Σc the total cost for the variables in the model. The original cost values, i.e. 1 to 5, were also linearly rescaled to 1 to C to vary the impact of variable cost. Cost was accounted for in the AUC criterion by subtracting mc from the training AUC (rounded at three decimals), with m representing the impact of variable cost. If C=1 (AIC/BIC) or m=0 (AUC), no penalization for variable cost is induced. One thousand random train-validation splits of the data set were created (70% vs 30%). After variable selection, the training AUC and validation AUC were recorded, as well as the number of selected variables, the total cost Σc, and the average cost per selected variable. We combined results for the 1000 train-validation splits using box plots and averages.

Result: For all three criteria, similar results were obtained. Compared to no penalization, increasing impact of variable cost by varying C or m strongly reduced the total cost with very limited reductions in training or validation AUC (e.g. -60% versus -5%). The reduction in total cost was mainly caused by selecting fewer variables. The average cost per selected variable also decreased as the selection of high-cost predictors was increasingly discouraged.

Conclusion: The straightforward incorporation of variable cost into variable selection for logistic regression can result in clearly cheaper and simpler models with limited loss of discriminatory performance. Further work will focus on other applications, sample size, cross-validated AUC, polytomous diagnosis, and on other methods than stepwise selection.

Candidate for the Lee B. Lusted Student Prize Competition