Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records – Psychiatry AI: Real-Time AI Scoping Review

AI Summary

Small locally hosted 20-billion-parameter LLM reliably classifies DSM-5 substance types from child welfare narratives, extending binary detection to multi-label substance identification.
Five categories achieved almost perfect agreement (κ = 0.94 to 1.00) with precision 92% to 100%: alcohol, cannabis, opioid, stimulant, sedative/hypnotic/anxiolytic.
Rare categories (hallucinogen, inhalant) performed poorly; pipeline runs entirely on local hardware without data changes, enabling surveillance and service alignment.

J Evid Based Soc Work (2019). 2026 May 14:1-14. doi: 10.1080/26408066.2026.2673377. Online ahead of print.

ABSTRACT

BACKGROUND: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, such as detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. However, whether smaller locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested.

OBJECTIVE: Validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives.

METHODS: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substace-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and agreement with the criterion standard (Cohen’s kappa). Model reproducibility was evaluated using approximately 15,000 independently classified records.

RESULTS: Five substance categories achieved almost perfect criterion-standard agreement (κ = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Run-to-run agreement ranged from 92.1% to 99.1% across the seven categories.

CONCLUSIONS: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification. Operating entirely on local hardware and requiring no changes to existing data collection, the pipeline supports substance-specific surveillance, retrospective trend analysis, and improved alignment of services with the substance profiles of investigated families.

PMID:42133549 | DOI:10.1080/26408066.2026.2673377

Document this CPD