DO PROPENSITY SCORE METHODS OVERCOME BIAS IN ESTIMATING AVERAGE TREATMENT EFFECTS IN OBSERVATIONAL STUDIES?

Sunday, 23 October 2005 - 11:15 AM

DO PROPENSITY SCORE METHODS OVERCOME BIAS IN ESTIMATING AVERAGE TREATMENT EFFECTS IN OBSERVATIONAL STUDIES?

ZHEHUI LUO, PhD¹, JOSEPH C. GARDINER, PhD¹, Cathy J. Bradley, PhD², and CHARLES W. GIVEN, PhD¹. (1) Michigan State University, East Lansing, MI, (2) Virginia Commonwealth University, Richmond, VA

PURPOSE: To compare and contrast the properties of different estimators of the average treatment effect in observational studies based on propensity score methods to that estimated in randomized controlled trials (RCT).

BACKGROUND: Randomization of patients to treatment or control groups is assumed to balance the groups on both observed and unobserved patient characteristics. In non-randomized studies patients self-select themselves to treatments. Naīve estimators of the treatment effect are subject to selection bias. Propensity score methods have been used as a means to mitigate the effect of selection bias.

METHODS: We use data from two RCTs (n=237 and n=124) of a cognitive behavioral intervention designed to reduce the severity of symptoms among cancer patients. The benchmark treatment effects are estimated using the physical function and mental health function assessed by the SF-36. We have access to a prospective longitudinal study of a cohort of cancer patients who were undergoing chemotherapy but no cognitive behavioral intervention (n=888). We then construct composite samples of patients from the two RCTs and comparable non-treated and non-randomized patients from the prospective study. The treatment effect is estimated using stratification, nearest neighbor matching, k-nearest neighbor matching, and bias-corrected k-nearest neighbor matching. Heteroscedasticity-robust standard errors are obtained for each estimator. Our propensity score estimators are then compared to the benchmarks from the RCTs.

RESULTS: In comparison with the benchmark effects from the RCTs, the propensity score methods produced estimates that varied widely with the choice of comparison samples and outcomes. Agreement was closer with physical function than with mental health function. Bias was greater when the comparison sample differed from the treatment group on several patient characteristics. No single propensity score estimator dominated the others in estimating the benchmark effects. However, the bias-corrected k-nearest neighbor matching method yields standard errors close to that from the RCTs.

CONCLUSIONS: The choice of propensity score matching technique affected the accuracy of average treatment effect estimation. Strikingly, the effect on estimation of different composite samples we constructed was pronounced. There was closer agreement with the benchmark effect when the comparison samples were similar to the treated group. Our study suggests that propensity score methods do not always yield accurate estimates of average treatment effects. They must be used with caution in observational studies.

See more of Oral Concurrent Session H - Methodological Advances
See more of The 27th Annual Meeting of the Society for Medical Decision Making (October 21-24, 2005)