NATURAL LANGUAGE PROCESSING FOR AUTOMATIC COHORT SELECTION

Gonzalez, Graciela

All medical research starts with the selection of a cohort of patients with the disease or finding of interest. Currently this is done through the selection of appropriate ICD9 and CPT codes, which are discrete data fields in the electronic medical record. However, sole reliance on diagnosis based and procedure related codes for cohort identification leads to missing cases of interest due to the inherent inadequacies and limited scope of these coding tools. Not only do codes not exist for every concept which might be investigated by a clinical researcher, but since these codes are applied within the clinical context for billing purposes, they may be incompletely applied. We present the framework of a natural language processing module (NLP) to extract relevant patient cohorts using the narrative text of pediatric emergency room encounters, and to test the accuracy and expressiveness of the approach to extract cohorts as compared to traditional ICD code-based queries. As a proof of concept, we chose to study concepts that could be coded, as well as those that do not have an existing code. Defining cohorts beyond the limitations of coding could have a profound impact on the development of prevention strategies and health policy initiatives. We will examine the extent to which reliance on ICD 9 and CPT codes for cohort selection under or over-estimates cohort size in clinical research.

SYM2-3 NATURAL LANGUAGE PROCESSING FOR AUTOMATIC COHORT SELECTION