Thursday, October 18, 2012: 1:30 PM-3:00 PM
Regency Ballroom C (Hyatt Regency)

Session Chairs:
Andrew H. Briggs, DPhil and Dominick Frosch, PhD
1:30 PM
Vidit Munshi, MA, Michael E. Gilmore, MBA, Alexander Goehler, MD, MSc, MPH, G. Scott Gazelle, MD, MPH, PhD and Pamela McMahon, PhD, Massachusetts General Hospital, Boston, MA

Purpose:    Repeated follow-up imaging examinations for indeterminate pulmonary nodules can have a large impact on patient outcomes, radiation risk, and healthcare costs through resource utilization and physician burden.  A pre-existing lung cancer model was used to assess comparative effectiveness and cost-effectiveness of an older follow-up program with standard Fleischner Society guidelines for management of pulmonary nodules, including and in the absence of screening.

Method:   The Lung Cancer Policy Model (LCPM) is a microsimulation model that simulates individuals’ lung cancer development, progression, detection, follow-up, and survival, while accumulating healthcare-related costs.  Benign pulmonary nodules and risks of radiation-induced lung cancer from imaging exams are also simulated. Patients with CT or CXR-detected nodules (4-8mm diameter) undergo follow-up CTs at 1-, 3-, 6-, 9-, 12-, and 24-months.  Using the LCPM, trial runs of 500,000 individuals born in 1930 (with US-representative smoking histories) were conducted utilizing the old follow-up program and a newly designed program based on Fleischner Society’s recommendations. The baseline risk factor threshold (5 pack-years) in the Fleischner guidelines was varied to include 10, 20, and 30 pack-years. All programs were simulated with no screening, as well as with 1, 3, and 10-CT screen programs at yearly intervals beginning at age 65. We compared the outcomes of the various follow-up protocols on the basis of life-years saved and healthcare-related costs. 

Result:    In the absence of screening, the older follow-up program was strictly dominated by the Fleischner Society guidelines (all thresholds), which yielded 93,187 additional life years and reduced costs by over $996 million (baseline threshold, cohort size of 500,000).  The total number of CTs for the cohort was reduced by 5.7% (422,763 to 398,684) by switching to the Fleischner follow-up.  Fleischner guidelines also strictly dominated the old follow-up in the presence of screening, with gains in LY and more cost-savings (2.4%, 2.8%, and 3.5% decrease in total costs with 1.5%, 1.4%, and 1.3% increase in life-years for 1, 3, and 10-year screening programs respectively).

Conclusion:    Follow-up strategies involving targeted management of pulmonary nodules dominate more aggressive strategies with numerous follow-up CTs, particularly in the presence of screening.  While compliance to guidelines varies across institutions, models are an effective tool to compare current and hypothetical guidelines for clinical and cost-effectiveness and develop efficient protocols for management of pulmonary nodules.

1:45 PM
Lauren A. Shluzas, PhD1, Mary K. Goldstein, MD, MS1, Douglas K. Owens, MD, MS1 and John P.A. Ioannidis, MD, PhD2, (1)Veterans Affairs Palo Alto Health Care System and Stanford School of Medicine, Stanford, CA, (2)Stanford School of Medicine, Stanford, CA

Purpose: This research examines cost-effectiveness analyses (CEAs) with comparable target populations, interventions, and comparators, yet disparate incremental cost-effectiveness ratios (ICERs). The goal of this research is to identify assumptions and parameters used to determine cost-effectiveness, in order to understand underlying differences in CEA outcomes.   

Methods: From the CEA Registry, we identified three comparative health interventions, in which 11 to 24 CEAs had been conducted for each comparison. These included carotid artery stenting (CAS) vs. carotid endarterectomy (CAE); drug-eluting stents (DES) v. bare-metal stents (BMS); and verenicline (VAR) vs. bupropion (BUP) for smoking cessation therapy.  Of the 46 CEAs identified, we reviewed 20 CEAs that used quality-adjusted life-years (QALYs) to represent health effects. For each study, we documented eight parameters to identify potential sources of variability among groups: clinical trial setting, patient randomization, trial duration, time horizon, the inclusion of direct vs. indirect costs, the inclusion of post-intervention costs, study perspective, and sponsorship.  For each group, we computed the median ICER and interquartile range, and the percent of CEAs reporting cost-effective outcomes.  We used Fischer's exact test to examine the strength of associations between variability parameters and cost-effectiveness.   

Results: Table 1 presents the median ICER per group (measured by cost per QALY and standardized to US$ 2012), and the percent of studies reporting cost-effective outcomes. The strongest association between study parameters and cost-effectiveness was seen with respect to industry sponsorship: 10 of 12 industry-sponsored studies reported cost-effective outcomes, in comparison to 1 of 7 studies without industry sponsorship (p = 0.003). Outcome variability was also associated with the inclusion vs. exclusion of post-intervention cost data: 11 of 17 analyses that included post-intervention costs reported cost-effective outcomes, in comparison to 0 of 3 studies that included short-term intervention costs only (p = 0.074).


Conclusions: This research highlights sources of variability in CEA analyses for three comparative health interventions, and the relationships between variability parameters and cost-effectiveness. The data indicate that industry sponsorship significantly influenced ICERs for the interventions examined. The findings from this study provide investigators with insight regarding the interpretation of CEAs with mixed outcomes, despite the use of standard methods for assessing cost-effectiveness.   Views expressed in this abstract are those of the authors and not necessarily those of the Department of Veterans Affairs.

2:00 PM
David GT Whitehurst, PhD1, Richard Norman, MSc2, John Brazier, PhD3 and Rosalie C. Viney, PhD2, (1)University of British Columbia, Vancouver, BC, Canada, (2)University of Technology, Sydney, Sydney, Australia, (3)School of Health and Related Research, Sheffield, United Kingdom

Purpose: To explore the extent to which the application of a common scoring procedure ameliorates the comparability of EQ-5D and SF-6D responses. Poor agreement between preference-based health-related quality of life instruments has been widely-reported across patient and community-based samples. Between-measure discrepancies can be attributed to the descriptive systems of the respective instruments, the valuation techniques used to derive preference weights, or a combination of the two. Research comparing different valuation techniques (e.g. time-trade off (TTO) versus standard gamble (SG)) has demonstrated systematic differences in resulting index scores. Due to considerable methodological challenges, little research has attempted to isolate the effect of different descriptive systems with regard to the comparability of index scores.

Method: Scoring algorithms for the EQ-5D and SF-6D have been generated using the same discrete choice experiment (DCE) approach, using an Australia-representative online sample. Empirical analysis to examine the nature of the relationship between index scores comprised descriptive statistics, assessment of agreement (Bland-Altman plots, interclass correlation coefficient (ICC)) and explorative ordinary least squares regressions. The comparative assessment uses the same dataset that compared TTO-derived EQ-5D scores and SG-derived SF-6D scores across 7 patient/population groups, reported by Brazier and colleagues in 2004 (n=2112). This analytic framework enables the direct comparability of scenarios where both the descriptive and valuation systems differ (2004 study) and where only the descriptive systems differ (current study).

Result: DCE-derived EQ-5D scores were consistently higher than DCE-derived SF-6D scores, with mean differences exceeding 0.17 across each patient/population sample. ICC for the whole sample was 0.557, indicating ‘fair’ agreement, ranging from 0.373 to 0.638 within the subsamples. Comparable TTO/SG results: mean scores were within 0.10 in all 7 subsamples (with mean SF-6D scores greater than mean EQ-5D scores in 6 of 7 subgroups); whole sample ICC = 0.522 (ranging from 0.352 to 0.547).

Conclusion: A common scoring procedure did not reduce the level of disagreement between EQ-5D and SF-6D responses, indicating that the instruments provide substantially different ways for respondents to describe their health state. Accordingly, poor agreement between the instruments is inevitable. Normative unknowns relating to the descriptive components of preference-based measures (e.g. conceptual framing of questions and response options, length of recall etc.) require further attention. Reference:   Brazier J, et al. Health Econ. 2004; 13(9): 873-84

2:15 PM
Joseph A. Ladapo, MD, PhD1, Saul Blecker1, Michael R. Elashoff2, Jerome J. Federspiel3, Mark Monane2, Steven Rosenberg2, Charles E. Phelps4 and Pamela S. Douglas3, (1)NYU School of Medicine, New York, NY, (2)CardioDx, Inc., Palo Alto, CA, (3)Duke University, Durham, NC, (4)University of Rochester, Gualala, CA

Purpose: Exercise testing with myocardial perfusion imaging (MPI) or echocardiography (ECHO) is widely used to risk-stratify patients with suspected coronary artery disease (CAD). However, reports of diagnostic performance do not routinely adjust for referral bias, which results from the preferential referral of higher-risk patients to cardiac catheterization, the gold standard. To understand how this practice may impact test characteristics and clinical decision-making, we systematically reviewed the literature on catheterization referral rates and estimated adjusted measures of diagnostic performance.

Method: We searched PubMed and EMBASE for studies reporting catheterization referral rates after normal or abnormal exercise MPI and ECHO. Findings were pooled with the Mantel-Haenszel fixed-effects model, and we used Bayesian methods developed by Begg and Greenes (Biometrics, 1993) to adjust exercise test diagnostic performance reported in a widely cited meta-analysis (Fleischmann et al, JAMA 1998). To evaluate the impact of referral bias on overall diagnostic performance, we constructed summary receiver operating characteristic (SROC) curves and calculated positive and negative predictive values over a range of pretest probabilities.

Result: Our literature search yielded 253 citations, of which 10 reported referral patterns in 16,799 patients. Mean age was 60.5 years, 40.3% were women, and 8% had prior history of myocardial infarction. Catheterization referral rates after normal and abnormal exercise tests were 2.3% (95% CI, 2.0%-2.6%) and 30.2% (95% CI, 29.1%-31.3%), respectively, with an odds-ratio for referral after an abnormal test of 10.5 (p<0.001) (Figure). After adjusting for referral, exercise ECHO sensitivity fell from 85% to 33% and specificity rose from 77% to 99%. Similarly, exercise MPI sensitivity fell from 87% to 36% and specificity rose from 64% to 97%. SROC curve analysis demonstrated that the adjustment for referral reduced overall discriminatory power and diagnostic yield. While positive predictive value generally increased, the negative predictive value of a normal exercise test for intermediate risk patients (CAD pretest probability=25%) fell from approximately 93% to 81% for both imaging tests.

Conclusion: Exercise ECHO and MPI have lower diagnostic yield after adjusting for the referral process, and patients with normal test results are at risk for misclassification. Incorporating such adjustments into assessments of exercise test performance not only provides a more accurate evaluation of current and emerging diagnostic technologies, but may also significantly influence clinical decision-making and patient care.

2:30 PM
Alan Schwartz, PhD1, Shoshana Butler1, Sam Lee2, Adam Rosman, BA2 and Maggie Garcia, BA2, (1)University of Illinois at Chicago, Chicago, IL, (2)University of Illniois at Chicago, Chicago, IL

Purpose: To evaluate the operation of the medical risk subscale for the Domain-Specific Risk Taking Scale (DOSPERT) proposed by Schwartz, et al. (2012), and test the hypothesis that medical risk attitudes are distinct from those measured in the DOSPERT health/safety subscale.

Method: Risk taking (RT), risk perception (RP), and benefit perception (BP) was measured using the 36-item DOSPERT scale with the new medical risk subscale (DOSPERT+M) administered to a US-representative online panel. Medical activities include donating blood, donating a kidney, participating in a clinical trial, taking daily allergy medication, knee replacement surgery, and general anesthesia in dentistry. To reduce respondent burden, each of 344 respondents was randomly assigned to two of the three tasks with task order counterbalanced (RT+RP n=108, RT+BP n=126, RP+BP n=110). We created composite scores for each task for each of the six DOSPERT+M domains (financial, social, ethical, health/safety, recreational, and medical), examined subscale reliability and correlations between the medical composites and other domain composites in each task, and fitted multiple linear regression models to assess the impact of demographic differences (gender, ethnicity, age, income, education, marital status) on medical composites.

Result: The medical subscale evinced moderate interitem consistency (Cronbach's alpha RT=0.56, RP=0.66, BP=0.74). As hypothesized, correlations between the medical and health/safety domains were small for risk-taking (r=.12, p=0.07), risk perception (r=.25, p<.001), and benefit perception (r<.01, p=0.99). In fact, the medical subscale were most strongly associated with attitudes and perceptions of social risks (RT r=0.41, RP r=0.46, BP r=0.53). We found no demographic differences in willingness to take medical risks. Hispanic respondents gave slightly higher average ratings of riskiness for medical activities than Caucasian respondents (standardized regression coefficient Beta=0.15, p=.04), and separated respondents gave higher ratings than married respondents (Beta=0.15, p=.04). Women gave higher average ratings of benefit for medical activities than men (Beta=.15, p=.023) as did respondents with higher household incomes (Beta=.17, p=0.29). These differ substantially from demographic associations with mean responses to the social risk scale.

Conclusion: The DOSPERT health/safety subscale does not appear to measure attitudes and perceptions associated with typical medical activities faced by patients. Instead attitudes toward medical activities appear to be associated with attitudes toward social risks, which may reflect the interpersonal impact of many medical decisions, but demonstrate different patterns of individual difference.

2:45 PM
Olga Kostopoulou, PhD, Andrea Rosen, MSc, Thomas Round, MBChB, Ellen Wright, MBChB and Brendan C. Delaney, MD, King's College London, London, United Kingdom

Purpose: To assess the effectiveness of two modes of diagnostic support in family medicine: 1) suggestion of relevant diagnoses to consider at the beginning of the clinical encounter (“suggesting”) and 2) alert about diagnoses to exclude at the end of the encounter (“alerting”).

Method: We designed 9 detailed patient scenarios presenting one of 3 commonly misdiagnosed complaints, in a 3x3x3 factorial design: experimental condition (control, suggesting, alerting) x complaint (chest pain, abdominal pain, dyspnea) x case difficulty (easy, moderate, difficult). The study was powered to detect a 10% increase in diagnostic accuracy over control (N=297).  The scenarios were presented to family physicians on computer over the Internet, while they were on the phone with a researcher. After reading some initial patient information on their screen, physicians could request further information in order to diagnose. The researcher selected the answer from a list and this was displayed to the physician. The suggesting list was presented after the patient’s main complaint and then disappeared (it could be recalled at will). The alerting list was presented only after physicians gave a diagnosis (they could change this following the alert).

Result: Current analyses based on 256 participants (86% of final sample) find a 5% overall increase in mean diagnostic accuracy with “suggesting” but no increase with “alerting” over control. In a logistic regression model that accounted for physician clustering and adjusted for case difficulty, the odds ratio of diagnosing correctly with “suggesting” was 1.3 (95% CI: 1.07–1.60, P=0.020). There was a significant correlation between the amount of information elicited and mean accuracy (Pearson r=0.40, P=<0.0001). There was no difference in the amount of information elicited between experimental conditions (P=0.67).

Conclusion: We found a modest effect of early suggestions of diagnoses to consider on family physicians’ accuracy, without an increase in the amount of information gathered. An appropriately developed computerized diagnostic support system, integrated with the patient record, that would activate automatically once the reason for encounter is entered, has the potential to improve diagnostic accuracy. In contrast, a system that monitors the information that the physician elicits during the encounter and alerts about further diagnoses to exclude is not likely to improve accuracy. It seems difficult to make physicians question their diagnosis once they have settled on it.