Welcome to Psychiatryai.com: Latest Evidence - RAISR4D

Leveraging simulation to provide a practical framework for estimating the novel scope of risk of large language models in healthcare

AI Summary
  • Simulation links specific LLM failure modes to structured pathways to harm, enabling quantification of context dependent risks for LLM-SaMDs.
  • Estimated P1 and P2 probabilities spanned four orders of magnitude, reflecting variability in model safety across tasks and model sizes.
  • Simulation using synthetic clinician reviewed datasets provides practical, scalable risk estimation to support regulation, deployment and context specific risk mitigation.
Summarise with AI (MRCPsych/FRANZCP)

BMJ Ment Health. 2026 Jun 24;29(1):e302626. doi: 10.1136/bmjment-2026-302626.

ABSTRACT

BACKGROUND: Large language models (LLMs) are rapidly entering clinical and consumer use, yet their probabilistic outputs have delivered a variety of unsafe user responses. Difficulties in quantifying and mitigating risks posed by LLMs threaten to stall regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). Practical approaches are needed to extend existing medical-device regulations to LLM-SaMDs.

OBJECTIVE: To demonstrate how simulation can extend existing medical-device risk management frameworks for addressing LLM-SaMD-specific risks.

METHODS: We implement a simulation-based methodology for estimating LLM-SaMD risk. Fourteen open-source models were evaluated on three safety-classification tasks: suicidal-ideation, therapy-request and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro and evaluated by psychiatrists. Model false-negative rates informed estimates of P1, the likelihood that a hazard progresses to a hazardous situation, and P2, the likelihood that that situation results in harm.

FINDINGS: LLM success at generating synthetic datasets varied by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions. Performance generally improved with model size. Estimated P1 values ranged from 1.1×10⁻⁸ to 1.6×10⁻⁴ and P2 from 4.9×10⁻⁵ to 5.1×10⁻³, spanning four orders of magnitude.

CONCLUSIONS: By linking model failure modes to structured pathways to harm, simulation can extend existing medical-device risk frameworks to help address the probabilistic and context-dependent risks of LLM-SaMDs.

CLINICAL IMPLICATIONS: Simulation-based risk estimation offers a practical way to characterise the risk landscape for specific LLM-SaMD, patient population and clinical context combinations.

PMID:42342371 | DOI:10.1136/bmjment-2026-302626

Document this CPD

Share Evidence Blueprint

QR Code

Search Google Scholar

Save as PDF

close chatgpt icon
ChatGPT

Enter your request.

Psychiatry AI: Real-Time AI Scoping Review