To Register      SMDM Homepage

Wednesday, 20 October 2004

This presentation is part of: Poster Session - Utility Theory; Health Economics; Patient & Physician Preferences; Simulation; Technology Assessment

CREATING A SYNTHETIC POPULATION OF INDIVIDUALS FROM PUBLIC AND DE-IDENTIFIED DATA USING A MODIFIED ITERATIVE PROPORTIONAL FITTING ALGORITHM

Douglas B. Fridsma, MD, PhD, University of Pittsburgh, Center for Biomedical Informatics, Pittsburgh, PA and Mark S. Roberts, MD, University of Pittsburgh, Section of Decision Sciences and Clinical Systems Modeling, Pittsburgh, PA.

Purpose: The Health Insurance Portability and Accountability Act of 1996 (HIPAA) extended important privacy protections over an individual’s health information, but made it difficult to use that information for research purposes. In this report, we describe the application of an iterative proportional fitting algorithm (IPF) to census and hospital discharge data to create a synthetic population of individuals. Using only public and de-identified medical data, we created a synthetic population of individuals that was statistically equivalent to the real population of patients in Allegheny County and suitable for agent-based and micro-simulation studies.

Methods: We used three sources of data to create our synthetic population—the 1990 Census Public Use Microdata Sample (PUMS), the Census Summary Tape File 3A (STF-3A), and de-identified hospital discharge data from the MARS database at the University of Pittsburgh. For the datasets in which we only had summary statistical data and in which the cross-tabulations shared data elements with the synthetic population, we used a modified IPF algorithm to integrate this information and derive individual level data (PUMS and STF-3A). For data in which we had individual-level but de-identified data (MARS datasets), we used a probabilistic matching algorithm to integrate this information into the synthetic population. Identifiers of age, gender, nationality, and zip code location, were used to match records in our synthetic population with records obtained from MARS. We made no assumptions as to the underlying joint distributions of the data fields. This assumption simplified the task of creating the synthetic population, and generated a solution (of many that are possible) that fit the data used for the synthetic population.

Results: Using both IPF and a probabilistic matching algorithm, we created 1.3 million synthetic individual and household level records representative of Allegheny County. The dataset was constructed using only public census and de-identified data, yet contained detailed individual and household level data. We maintained the marginal statistics for the data, but filled in the cells of the tables with anonymized, but representative data.

Conclusions: This work showed that IPF is a suitable technique to generate household and individual level data set of patients from publicly available and de-identified data. It applied a number of well-tested mathematical processes that have been used for census data to medical datasets.


See more of Poster Session - Utility Theory; Health Economics; Patient & Physician Preferences; Simulation; Technology Assessment
See more of The 26th Annual Meeting of the Society for Medical Decision Making (October 17-20, 2004)