IMPUTATION OF MISSING COVARIATE VALUES IS ALWAYS BETTER THAN DROPPING THE COVARIATE, EVEN WITH MISSING DATA PERCENTAGES OF UP TO 90%
Kristel J.M. Janssen, MSc1, A.R.T. Donders, PhD2, Y. Vergouwe, PhD1, F.E. Harrell, PhD3, D.E. Grobbee1, and K.G.M. Moons1. (1) University Medical Center Utrecht, Utrecht, Netherlands, (2) Copernicus Institute, Utrecht University, Utrecht, Netherlands, (3) Vanderbilt University School of Medicine, Nashville, TN
Purpose Most standard statistical packages exclude subjects with a missing value on one of the variables. Next to this, many researchers tend to drop a variable from the analysis when it has many missing values. Both strategies often lead to biased results. We used empirical data and simulations to compare the result of multiple imputation of variables with different percentages of missing values with the results of a complete case analysis and dropping the predictor, when developing a multivariable prediction model. Methods The empirical dataset (without any missing values) consisted of data of 805 patients with a suspicion of having deep venous thrombosis (DVT), of which 160 had DVT (20%). For our aim, we selected the three strongest predictors of presence or absence of DVT. We simulated 500 data sets, in which missing values were introduced for two predictors (not in both predictors simultaneously). Missing values, varying from 10% to 90%, were generated using a ‘missing at random' strategy. Three methods were used for the analyses: complete case analysis, dropping the variables with missing values and multiple imputation. A multivariable logistic regression model with the presence of DVT as the outcome was fitted in all datasets, to estimate the regression coefficients and standard errors of the predictors. The model fitted on the original complete dataset represented the true values. The three methods were compared by estimating the bias of the regression coefficients, coverage of the confidence intervals (the percentage of their 90 percent CI that indeed included the true value), power (percentage of the 90% confidence intervals that excluded 0) and discriminative value (concordance (c-) statistic). Results Multiple imputation resulted in less biased regression coefficients, a better discriminative value, a better coverage and a higher power than dropping the variable from the analysis. Dropping the variable with missing values resulted in severely biased estimates of the regression coefficients of the remaining (non-dropped) predictors. Even for percentages of up to 90%, multiple imputation provided less biased results than dropping the variable. Following others, we also found that a complete case analyses provided the most severely biased results. Conclusion Multiple imputation should be the method of first choice when dealing with missing values in (medical) research, even when the percentage of missing values is over 50%.