Differences in Safety Risks Across Languages in Health-Relevant Queries: Vulnerability Analysis of Large Language Model Responses – Psychiatry AI: Real-Time AI Scoping Review

AI Summary

LLM jailbreak susceptibility differs by language, with Hindi exhibiting the highest success rates for emoji and permutation cipher attacks.
Attacks varied by harm category, with violence-related prompts more susceptible than drug-related or self-harm queries.
Findings call for strengthened multilingual content moderation and language-aware safety mechanisms to ensure equitable protection in health AI.

JMIR Form Res. 2026 May 26;10:e87465. doi: 10.2196/87465.

ABSTRACT

BACKGROUND: Large language models (LLMs) such as ChatGPT are increasingly used to support health-related queries and decision-making. However, these models can be “jailbroken” through adversarial prompts that bypass safety filters and elicit harmful or medically inappropriate responses. In health care contexts, such vulnerabilities pose serious risks. Understanding how jailbreak susceptibility varies across languages is essential for developing robust safeguards and promoting equitable access to safe health information. This paper may contain examples that may be deemed harmful in terms of violence, self-harm, and drug abuse.

OBJECTIVE: This study aims to systematically compare and contrast the vulnerability of a health LLM for jailbreaking across 3 languages: English, Spanish, and Hindi (transliterated using the Latin alphabet), based on emoji and permutation cipher attacks.

METHODS: We analyzed 1000 input prompts per language, drawn from the BeaverTails dataset, across 3 harm categories: self-harm, violence, and drug abuse. Each prompt was modified using emoji and permutation cipher techniques, resulting in 6000 input-output pairs. Model responses were evaluated by human coders to determine the success rate of jailbreak attempts across languages and cipher types.

RESULTS: Hindi prompts showed the highest vulnerability, with 787 successful jailbreaks using emoji ciphers and 873 using permutation ciphers. Spanish and English followed, with lower success rates across both cipher types. Differences in jailbreak success across languages and cipher strategies were statistically significant. Additionally, attacks targeting violence-related prompts were more successful overall than those targeting drug-related or self-harm content, indicating variation in vulnerability by harm type.

CONCLUSIONS: The findings of this formative study reveal that LLM safety performance varies substantially across languages and harm categories, raising concerns about equitable protection in multilingual health communication. Disparities in access to harmful content may contribute to downstream health risks. Strengthening multilingual content moderation and developing language-aware safety mechanisms are critical steps toward creating safer and more inclusive health AI systems.

PMID:42190643 | DOI:10.2196/87465

Document this CPD