18CSG COMPARATIVE EFFECTIVENESS OF DECISION TREE VERSUS LOGISTIC REGRESSION MODELS IN PREDICTING HIGH HEALTHCARE COSTS

Sunday, October 19, 2008
Columbus A-C (Hyatt Regency Penns Landing)
Michael B. Nichol, PhD1, Joanne Wu, MS1, Tara K. Knight, PhD1, Jack Mahoney, MD2 and Christine Berman, PhD2, (1)University of Southern California, Los Angeles, CA, (2)Pitney Bowes, Stamford, CT
Purpose: To compare the accuracy of two data mining methods (decision tree (DTM) and logistic regression (LRM) models) in predicting individuals at risk for high healthcare costs in a large U.S. employer general population. Method: Employee/dependents with at least two years of health insurance coverage during 2004 to 2006 were included in the analysis. Two types of data mining models were developed using previous years demographic, clinical, and healthcare utilization data to predict high healthcare cost (≥$10,000) in 2005 and 2006. Approximately 6% of individuals were identified as high cost; an over sampling method to obtain a sample with 50% high cost individuals was used. The sample was partitioned into 70% for training and 30% for validation. The 2004 predicting 2005(2004-2005) model was further validated using independent data for 2005-2006. Model effectiveness (misclassification, ROC index, sensitivity, specificity, positive predictive value[PPV]) was compared across model years and types for the validation samples. Results: Prior years’ healthcare costs and number of prescription fills were significant independent predictors of high cost in the DTM; number of generic drugs taken was the most significant independent predictor in LRM across model years. ROC indexes were the same between model types for 2005-2006(ROC=0.83) and 2004-2006(ROC=0.82). DTM(ROC=0.79) had a slightly lower ROC index than LRM(ROC=0.81) for 2004-2005. DTM(23%) had a lower misclassification rate than LRM(26%) for 2005-2006, but the misclassification rates were the same between model types for 2004-2005(27%) and 2004-2006(25%). Sensitivity ranged from 65% to 73% and was similar between DTM and LRM across model years. Specificity was the same for both DTM and LRM for 2005-2006(80%), and DTM had 2% and 5% higher specificity than LRM for 2004-2006 and 2004-2005, respectively. PPV ranged from 75% to 81% across model years and types. However, the DTM had 1% to 3% higher values than LRM across model years. When DTM and LRM for 2004-2005 were used in 2005-2006 data, performances were the same for both models. Both models had 23% misclassification rate, 70% sensitivity, 78% specificity, and 17% positive predictive value. Conclusion: Independent predictors of high prospective healthcare costs varied across model types and model years. However, the effectiveness of DTM and LRM was similar across model years. Both data mining models are useful tools in the identification of high cost individuals.