A predictive algorithm to identify ever smoking in medical claims-based epidemiologic studies

Document Type



PURPOSE: To develop and validate an algorithm to estimate probability of ever smoking using administrative claims. METHODS: Using population-based samples of Medicare-aged individuals (121,278 Behavioral Risk Factor Surveillance System survey respondents and 207,885 Medicare beneficiaries), we developed a logistic regression model to predict probability of ever smoking from demographic and claims data. We applied the model in 1,657,266 additional Medicare beneficiaries and calculated area under the receiver operating characteristic curve (AUC) using presence or absence of a tobacco-specific diagnosis or procedure code as our "gold standard." We used these "gold standard" and lung/laryngeal cancer codes to over-ride predicted probability as 100%. We calculated Spearman's rho between probability from this full algorithm and smoking assessed in prior Parkinson disease studies, by substituting our observed and prior ("true") smoking-Parkinson disease odds ratios into the attenuation equation. RESULTS: The predictive model contained 23 variables, including basic demographics, high alcohol consumption, asthma, cardiovascular disease and associated risk factors, selected cancers, and indicators of routine medical usage. The AUC was 67.6% (95% confidence interval 67.5%-67.7%) comparing smoking probability to tobacco-specific diagnosis or procedure codes. Spearman's rho for the full algorithm was 0.82. CONCLUSIONS: Ever smoking might be approximated in administrative data for use as a continuous, probabilistic variable in epidemiologic analyses.

Publication Date


Publication Title

Annals of epidemiology



PubMed ID


Digital Object Identifier (DOI)