Personalizing Suicide Risk Assessment: Machine Learning Extraction of Cross-Modal Interactions Between Psychosocial and Demographic Factors in Veterans (1)

AI Summary

Balanced integration of BoW-derived NLP features with structured EHR reveals cross-modal psychosocial-demographic interactions that improve suicide risk prediction, notably for low and medium risk veterans.
Bag-of-words representation outperformed prior semantic lexicon (SÉANCE) approaches in extracting informative clinical language from notes.
Interpretable machine learning and temporal cohort stratification reduced performance gap with complex models and established benchmarks for future deep learning on clinical narratives.

AI Summary

Department of Veterans Affairs (VA) rely primarily on structured electronic health record (EHR) data, though clinical notes contain rich contextual information that can be quantified using natural language processing (NLP) to derive psychosocial variables that may improve risk detection.

Machine learning methods, particularly classification and regression trees (CART), can also uncover interactions between clinical and psychosocial variables, enabling identification of patient characteristics that modify suicide risk factors.

DISCUSSION AND CONCLUSIONS: Together, these findings demonstrate the utility of interpretable NLP methods for uncovering clinically meaningful interactions between psychosocial and demographic factors in suicide risk and establish a strong benchmark for future deep learning approaches aimed at capturing richer contextual and temporal information from clinical narratives.

Basic summary

medRxiv [Preprint]. 2026 Jun 18:2026.06.16.26355796. doi: 10.64898/2026.06.16.26355796.

ABSTRACT

BACKGROUND: Veterans face an elevated risk of suicide compared to the general population, motivating national efforts to develop predictive models that can guide proactive care. Current models used by the U.S. Department of Veterans Affairs (VA) rely primarily on structured electronic health record (EHR) data, though clinical notes contain rich contextual information that can be quantified using natural language processing (NLP) to derive psychosocial variables that may improve risk detection. Machine learning methods, particularly classification and regression trees (CART), can also uncover interactions between clinical and psychosocial variables, enabling identification of patient characteristics that modify suicide risk factors. However, integrating structured and unstructured data presents challenges because NLP features often greatly outnumber traditional clinical variables, potentially biasing interaction discovery. In prior work, we addressed this imbalance by introducing a weighted CART framework that balances structured variables with NLP-derived psychosocial features from semantic lexicons (SÉANCE). While effective, semantic approaches summarize language into predefined constructs and may overlook important lexical variation present in clinical narratives.

METHODS: In this study, we extend that framework by replacing semantic features with a high-dimensional bag-of-words (BoW) representation of clinical notes and by evaluating models across cohorts defined by structured suicide risk stratification (low, medium, high) and varying temporal lookback windows. Using a cohort of 27,241 veterans, we analyzed clinical documentation collected up to 30, 90, or 270 days prior to death (or a matched index date for controls), enabling temporally flexible risk modeling. XGBoost models were trained to balance structured and unstructured features and identify cross-modal interactions between textual and clinical variables.

RESULTS: When incorporated into generalized linear models, these interactions improved predictive performance, particularly among low- and medium-risk patients, and substantially reduced the performance gap between interpretable and more complex models. Notably, the BoW representation outperformed our prior semantic index-based approach.

PMID:42369484 | PMC:PMC13308074 | DOI:10.64898/2026.06.16.26355796

Document this CPD