- Iterative multidisciplinary development and evaluation using synthetic case simulation to optimise a patient-facing SDoH chatbot and its evaluation rubric before clinical deployment.
- Chatbot showed strong interpretation, communication, cultural sensitivity, and adaptive questioning but weaker domain focus, data completeness, and safety, prompting targeted refinements.
- Formative approach recommends external raters, patient stakeholder involvement, repeated scenario testing, and prospective clinical evaluation for validation.
JMIR Form Res. 2026 Jun 30. doi: 10.2196/89837. Online ahead of print.
ABSTRACT
BACKGROUND: Systematic collection of social determinants of health (SDoH) data remains inconsistent across healthcare settings, despite its critical impact on patient outcomes. Large language model (LLM)-powered chatbots offer promise for scalable SDoH data collection, but rigorous, feasible evaluation methods for patient-facing applications are lacking.
OBJECTIVE: To describe an efficient, iterative, multidisciplinary approach for developing and evaluating a patient-facing SDoH chatbot using synthetic data and case simulation, with the goal of optimizing both chatbot performance and evaluation rubric prior to clinical deployment.
METHODS: A 10-criterion evaluation rubric was adapted from established healthcare artificial intelligence (AI) frameworks and applied to 27 synthetic clinical scenarios representing diverse SDoH profiles. Scenarios were role-played by a licensed clinical social worker (RJK, and chatbot-patient interactions were rated by three members of the research team that were multidisciplinary experts – a social worker(RJK), nurse practitioner (HC), and physician (AMM). Quantitative analysis used percent agreement and Fleiss’ κ to characterize chatbot performance and rater consensus, with percent agreement selected due to high prevalence of ceiling effects in several domains. Qualitative analysis synthesized rater feedback to guide iterative refinement of both chatbot prompts and rubric domains.
RESULTS: Across 27 simulated cases, the chatbot received high proportions of positive ratings for accurate interpretation (% agreement = 0.98; 95% CI, 0.91-0.99), communication quality and cultural sensitivity (% agreement = 0.99; 95% CI, 0.93-1.00), and appropriately adaptive questioning (% agreement = 0.99; 95% CI, 0.93-1.00). Lower performance was observed in domain focus and completeness (% agreement = 0.51; 95% CI, 0.40-0.61), completeness of data capture (% agreement = 0.59; 95% CI, 0.48-0.69; Fleiss’ κ=0.18), and safety (% agreement = 0.69; 95% CI, 0.58-0.78; Fleiss’ κ=-0.04), prompting targeted adaptations. Qualitative feedback highlighted the importance of distinguishing screening from clinical interviewing capabilities and informed the refinement of the rubric, including clarifying the definition of safety to focus on recognition of physical and mental health emergencies.
CONCLUSIONS: This study describes a formative feasibility approach for iterative refinement of a patient-facing SDoH chatbot and its evaluation rubric using synthetic case simulation. Future work will include independent external raters, patient stakeholders, repeated scenario testing, and prospective clinical evaluation.
PMID:42378322 | DOI:10.2196/89837
Share Evidence Blueprint

Search Google Scholar
Save as PDF

