BMJ Ment Health. 2025 May 11;28(1):e301654. doi: 10.1136/bmjment-2025-301654.
ABSTRACT
BACKGROUND: We previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.
OBJECTIVE: With the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models.
METHODS: From 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning-that is, the topics in the chain of thought-associated with correct or incorrect predictions.
FINDINGS: The cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58-6.04)). The corresponding c-statistic was 0.64 (0.63-0.65), modestly poorer than the GPT4o model (0.67 (0.66-0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction.
CONCLUSIONS: Application of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk.
CLINICAL IMPLICATIONS: Smaller models can yield more secure, scalable and transparent risk prediction.
PMID:40350181 | DOI:10.1136/bmjment-2025-301654
AI-Assisted Evidence Search
Share Evidence Blueprint
Search Google Scholar