Purpose: Patients' country of origin can have a major impact on their health outcomes in the US. This information, however, is difficult to capture from the electronic medical record (EMR). Our objective was to develop and validate a computer-based algorithm to identify foreign-born status from the EMR to facilitate comparative effectiveness research in HIV-infected populations.
Methods: We queried a large US health care system EMR registry to identify those HIV-infected who were linked to HIV care (≥1 outpatient encounter), between January 1, 2001 and March 31, 2012 (N=2,813). We developed a three stage algorithm for identifying foreign-born patients in this cohort (Figure). In stage 1, we classified those clearly coded as non-English language speaking (in EMR-coded demographics) as foreign-born (N=242). In stage 2, we searched via computer algorithm free text EMR notes of the remaining 2,571 for specific keywords potentially associated with place of birth and language spoken. Those with no occurrence of keywords were classified as US-born (N=766). In stage 3, we retrieved and reviewed a 100 character window of text (or “token”) around the keyword for the remaining patients to determine place of birth and language spoken (N=1,805). We compared the algorithm results to the primary physician classification. We calculated sensitivity and specificity using physician classification as the gold standard. We asked all outpatient HIV providers (N=37) to classify their HIV-infected patients (limited to 50 patients each, total N=957) as foreign- or US-born and to rate their confidence in each assessment (confident vs. not confident).
Results: We excluded 160 of 957 patients because physicians indicated the patient was not HIV-infected (N=54), “not my patient” (N=103), or had unknown place of birth (N=3), leaving 797 for analysis. The algorithm classified 31% (N=248) as foreign-born and physicians classified 27% (N=218). Sensitivity of the algorithm was 94% (95% CI ±1.7%) and specificity 91% (95% CI ±2.0%), with 92% (95% CI ±1.9%) correctly classified. Physicians were not confident in 10% of their classifications; algorithm performance improved slightly when excluding these patients (sensitivity 97% [95% CI ±1.3%], specificity 92% [95% CI ±2%]).
Conclusions: A computer-based algorithm to ascertain foreign-born status in a large patient data registry yielded accurate identification of foreign-born individuals. This approach can be used to increase opportunities for EMR-based outcomes research.