Leveraging simulation to provide a practical framework for estimating the novel scope of risk of large language models in healthcare

AI Summary

Simulation links specific LLM failure modes to structured pathways to harm, enabling quantification of context dependent risks for LLM-SaMDs.
Estimated P1 and P2 probabilities spanned four orders of magnitude, reflecting variability in model safety across tasks and model sizes.
Simulation using synthetic clinician reviewed datasets provides practical, scalable risk estimation to support regulation, deployment and context specific risk mitigation.

BMJ Ment Health. 2026 Jun 24;29(1):e302626. doi: 10.1136/bmjment-2026-302626.

ABSTRACT

BACKGROUND: Large language models (LLMs) are rapidly entering clinical and consumer use, yet their probabilistic outputs have delivered a variety of unsafe user responses. Difficulties in quantifying and mitigating risks posed by LLMs threaten to stall regulatory evaluation and clinical deployment of LLM-based software as a medical device (LLM-SaMD). Practical approaches are needed to extend existing medical-device regulations to LLM-SaMDs.

OBJECTIVE: To demonstrate how simulation can extend existing medical-device risk management frameworks for addressing LLM-SaMD-specific risks.

METHODS: We implement a simulation-based methodology for estimating LLM-SaMD risk. Fourteen open-source models were evaluated on three safety-classification tasks: suicidal-ideation, therapy-request and therapy-like interaction detection. Synthetic datasets were generated by Gemini 2.5 Pro and evaluated by psychiatrists. Model false-negative rates informed estimates of P₁, the likelihood that a hazard progresses to a hazardous situation, and P₂, the likelihood that that situation results in harm.

FINDINGS: LLM success at generating synthetic datasets varied by task, with strong performance for neutral and non-therapeutic content but frequent errors in suicidal-ideation and therapy-like interactions. Performance generally improved with model size. Estimated P₁ values ranged from 1.1×10⁻⁸ to 1.6×10⁻⁴ and P₂ from 4.9×10⁻⁵ to 5.1×10⁻³, spanning four orders of magnitude.

CONCLUSIONS: By linking model failure modes to structured pathways to harm, simulation can extend existing medical-device risk frameworks to help address the probabilistic and context-dependent risks of LLM-SaMDs.

CLINICAL IMPLICATIONS: Simulation-based risk estimation offers a practical way to characterise the risk landscape for specific LLM-SaMD, patient population and clinical context combinations.

PMID:42342371 | DOI:10.1136/bmjment-2026-302626

Document this CPD