Comparative Evaluation of Pretrained Large Language Models for Suicide Risk Prediction from Clinical Notes in U.S. Veterans

AI Summary

Pretrained LLMs outperformed bag-of-words in seven of nine risk tier and time window combinations, achieving a maximum AUROC of 0.644 using text alone.
Incorporating structured clinical variables with LLM text representations further improved discrimination to AUROC 0.748.
Model interpretation highlighted suicide-related language, particularly in notes within 30 days among patients classified as high risk.

AI Summary

Increasing evidence suggests that unstructured clinical narratives contain additional psychosocial information that may enhance risk prediction when analyzed using natural language processing (NLP).

CONCLUSIONS: Pretrained LLMs can extract clinically meaningful information from narrative documentation, providing a foundation for future work adapting to additional clinical contexts and nuanced temporal associations to improve suicide risk prediction.

Basic summary

medRxiv [Preprint]. 2026 Jun 18:2026.06.16.26355804. doi: 10.64898/2026.06.16.26355804.

ABSTRACT

BACKGROUND: Suicide remains a significant and potentially preventable cause of death among United States veterans. Predictive models based on structured electronic health record (EHR) data, including the U.S. Department of Veterans Affairs’ Recovery Engagement and Coordination for Health-Veterans Enhanced Treatment (REACH-VET) program, aim to identify individuals at elevated risk for enhanced monitoring and follow-up. Increasing evidence suggests that unstructured clinical narratives contain additional psychosocial information that may enhance risk prediction when analyzed using natural language processing (NLP). However, optimal approaches for representing clinical text remain uncertain. Recent advances in large language models (LLMs) enable contextual text representations that capture complex semantic relationships beyond traditional lexical methods.

METHODS: We compared the predictive performance of pretrained LLMs with classical bag-of-words (BoW) representations for suicide risk prediction using clinical notes from 27,241 veterans receiving care in the Veterans Health Administration. Patients were stratified by REACH-VET risk tier (low, moderate, high), and models were evaluated across prediction windows defined by note look-back periods (<30, <90, and <270 days).

RESULTS: LLM-based representations outperformed BoW approaches in seven of nine risk tier-time window combinations, achieving a maximum AUROC of 0.644 when solely considering text. Incorporating structured clinical variables further improved performance (AUROC=0.748). Model interpretation identified suicide-related language, especially in notes documented within 30 days of the outcome among patients classified as high risk.

PMID:42369523 | PMC:PMC13308335 | DOI:10.64898/2026.06.16.26355804

Document this CPD