Assessing large language model responses to pediatric depression FAQs: a cross-sectional study on readability, accuracy, and sentiment

AI Summary

Large language models vary significantly in readability, factual accuracy, and completeness when answering pediatric depression FAQs.
DeepSeek 3.1V had highest readability; Microsoft Copilot GPT-5 had highest accuracy; ChatGPT-5 provided the most comprehensive coverage.
AI chatbot outputs require human interpretation and oversight before use in paediatric mental health education or guidance.

Front Psychiatry. 2026 Apr 29;17:1782288. doi: 10.3389/fpsyt.2026.1782288. eCollection 2026.

ABSTRACT

BACKGROUND: Pediatric depression shows age-specific symptoms that hinder recognition and delay care, while parents and adolescents increasingly turn to online sources, including large language models, for mental health information and guidance. The quality of such information depends on readability, factual accuracy, completeness, and emotional tone. This study compared responses from 3 contemporary large language models (LLMs) to frequently asked questions about pediatric depression to assess their suitability as informational tools.

METHODS: A cross-sectional analytical study design was used. 15 standardized frequently asked questions covering definition, causes, clinical features, diagnosis, prevention, treatment, and prognosis of pediatric depression were submitted to ChatGPT-5, Microsoft Copilot GPT-5 in Smart Research mode, and DeepSeek 3.1V. Responses were collected verbatim. Readability was assessed using seven established indices. Accuracy and completeness were independently scored on a 0 to 6 scale using a predefined rubric. Sentiment was measured with sentiment scores. One-way analysis of variance (ANOVA) with Tukey post hoc statistical analysis was performed.

RESULTS: Readability was different among the various models. DeepSeek 3.1V achieved the highest Flesch Reading Ease Score of 54 to 55 and the lowest Flesch-Kincaid Grade Level of about 9.5 thus indicating easier comprehension. ChatGPT-5 showed intermediate readability with scores of 49 to 50 and grade level about 10.5. Copilot-5 had the lowest Reading Ease score of 43 to 44 and the highest grade level near 10.8. Accuracy on a 0 to 6 scale was highest for Copilot-5. ChatGPT-5 showed the greatest completeness, whereas other models had variable coverage in detailed clinical items.

CONCLUSION: Large language models (LLMs) provide information on pediatric depression but show varying levels of readability, accuracy, and completeness. DeepSeek 3.1V provides greater linguistic accessibility, Microsoft Copilot GPT-5 shows stronger factual consistency, and ChatGPT-5 provides more comprehensive coverage. These artificial intelligence (AI) chatbot systems require human understanding before use in pediatric mental health education or guidance.

PMID:42137540 | PMC:PMC13168101 | DOI:10.3389/fpsyt.2026.1782288

Document this CPD