A Pilot Project Leveraging Large Language Models for Automated Screening and Variable Extraction in Observational Studies

AI Summary

Developed modular LLM pipelines LitScreen and VarEx to automate screening and variable extraction for observational systematic reviews.
VarEx delivered covariate-level precision 0.80, recall 0.79, F1 0.76 and classification accuracy 0.97 across validation datasets.
LitScreen maintained high recall while cutting screening and extraction time roughly 80 to 90 percent, improving efficiency and reproducibility.

AI Summary

OBJECTIVE: To develop and evaluate modular LLM-based pipelines, LitScreen and VarEx, that automate study screening and variable extraction for observational systematic reviews across multiple use cases, including hypertension as a primary exposure with Alzheimer's disease and related dementias (ADRD) as outcomes, and posttraumatic stress disorder (PTSD) as the exposure with self-harm, self-injury, and suicidality as outcomes.

CONCLUSION: A retrieval-augmented LLM framework can automate major components of screening and variable extraction for observational systematic reviews, generating reusable structured covariate inventories that integrate with causal confounder assessment tools and substantially improve the efficiency and reproducibility of evidence synthesis, while remaining an assistant to, rather than a replacement for, human reviewers.

Basic summary

medRxiv [Preprint]. 2026 Jun 24:2026.06.13.26355589. doi: 10.64898/2026.06.13.26355589.

ABSTRACT

BACKGROUND: Systematic reviews of observational studies are central to causal inference in chronic disease epidemiology but are increasingly limited by the scale of the literature and heterogeneity in confounder control. There is a need for transparent, open methods that reduce screening burden and make reported exposures, outcomes, and covariates comparable across studies.

OBJECTIVE: To develop and evaluate modular LLM-based pipelines, LitScreen and VarEx, that automate study screening and variable extraction for observational systematic reviews across multiple use cases, including hypertension as a primary exposure with Alzheimer’s disease and related dementias (ADRD) as outcomes, and posttraumatic stress disorder (PTSD) as the exposure with self-harm, self-injury, and suicidality as outcomes.

METHODS AND MATERIALS: We built an end-to-end workflow in which reproducible MEDLINE via Ovid queries yield RIS corpora that are processed by LitScreen, a three-phase screening pipeline combining abstract-level evidence extraction, criterion-wise inclusion adjudication with high-recall gates, and full-text retrieval-augmented verification. Screened-in articles enter VarEx, a retrieval-augmented extraction pipeline that identifies role-specific passages and performs evidence-grounded extraction and semantic classification of exposures, outcomes, and covariates into predefined categories aligned with Metaconfoundr. Performance was evaluated on six labeled SYNERGY datasets and expert-annotated hypertension-to-ADRD and education-to-dementia corpora using precision, recall, F ₁ , a strict score requiring correct variable identity and category, and time-per-reference estimates.

RESULTS: In the primary hypertension-to-ADRD reference set, VarEx achieved covariate-level precision of 0.80, recall of 0.79, and F ₁ of 0.76, with classification accuracy of 0.97 and similar performance for education-to-dementia and SYNERGY validation datasets. LitScreen preserved high recall while excluding most ineligible records and reduced total screening and extraction time by roughly 80-90 percent relative to manual review baselines by routing only uncertain or borderline citations to full-text verification.

PMID:42369518 | PMC:PMC13308080 | DOI:10.64898/2026.06.13.26355589

Document this CPD