Welcome to Psychiatryai.com: Latest Evidence - RAISR4D

Large Language Models and Text Embeddings for Detecting Depression and Suicide in Patient Narratives

JAMA Netw Open. 2025 May 1;8(5):e2511922. doi: 10.1001/jamanetworkopen.2025.11922.

ABSTRACT

IMPORTANCE: Large language models (LLMs) and text-embedding models have shown potential in assessing mental health risks based on narrative data from psychiatric patients.

OBJECTIVE: To assess whether LLMs and text-embedding models can identify depression and suicide risk based on sentence completion test (SCT) narratives of psychiatric patients.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study, conducted at Seoul Metropolitan Government-Seoul National University Boramae Medical Center, analyzed SCT data collected from April 1, 2016, to September 30, 2021. Participants included psychiatric patients aged 18 to 39 years who completed SCT and self-assessments for depression (Beck Depression Inventory-II or Zung Self-Rating Depression Scale) and/or suicide (Beck Scale for Suicidal Ideation). Patients confirmed to have an IQ below 70 were excluded, leaving 1064 eligible SCT datasets (52 627 completed responses). Data processing with LLMs (GPT-4o, May 13, 2024, version; OpenAI [hereafter, LLM1]; gemini-1.0-pro, February 2024 version; Google DeepMind [hereafter, LLM2]; and GPT-3.5-turbo-16k, January 25, 2024, version; OpenAI) and text-embedding models (text-embedding-3-large, OpenAI [hereafter, text-embedding 1]; text-embedding3-small; OpenAI; and text-embedding-ada-002; OpenAI) was performed between July 4 and September 30, 2024.

MAIN OUTCOMES AND MEASURES: Outcomes included the performance of LLMs and text-embedding models in detecting depression and suicide, as measured by the area under the receiver operating characteristic curve (AUROC), balanced accuracy, and macro F1-score. Performance was evaluated across concatenated narratives of SCT, including self-concept, family, gender perception, and interpersonal relations narratives.

RESULTS: Based on SCT narratives from 1064 patients (mean [SD] age, 25.4 [5.5] years; 673 men [63.3%]), LLM1 showed strong performance in zero-shot learning, with an AUROC of 0.720 (95% CI, 0.689-0.752) for depression and 0.731 (95% CI, 0.704-0.762) for suicide risk using self-concept narratives. Few-shot learning for depression further improved the performance of LLM1 (AUROC, 0.754 [95% CI, 0.721-0.784]) and LLM2 (AUROC, 0.736 [95% CI, 0.704-0.770]). The text-embedding 1 model paired with extreme gradient boosting outperformed other models, achieving an AUROC of 0.841 (95% CI, 0.783-0.897) for depression and 0.724 (95% CI, 0.650-0.795) for suicide risk. Overall, self-concept narratives showed the most accurate detections across all models.

CONCLUSIONS AND RELEVANCE: This cross-sectional study of SCT narratives from psychiatric patients suggests that LLMs and text-embedding models may effectively detect depression and suicide risk, particularly using self-concept narratives. However, while these models demonstrated potential for detecting mental health risks, further improvements in performance and safety are essential before clinical application.

PMID:40408109 | DOI:10.1001/jamanetworkopen.2025.11922

Document this CPD

AI-Assisted Evidence Search

Share Evidence Blueprint

QR Code

Search Google Scholar

close chatgpt icon
ChatGPT

Enter your request.

Psychiatry AI: Real-Time AI Scoping Review (RAISR4D)