Explainable and Interpretable AI for Voice and Speech Analysis in Clinical Care: A Systematic Review

AI Summary

Recent studies applied diverse XAI methods but predominantly used qualitative explanations with limited quantitative validation or external dataset consistency.
No studies performed human-in-the-loop evaluations with clinical stakeholders, revealing a major gap in stakeholder alignment and real-world applicability.
Future work must develop validated, audio-aware, stakeholder-centred XAI tailored to clinical contexts to support trustworthy deployment and address bias concerns.

J Med Internet Res. 2026 Apr 20. doi: 10.2196/83790. Online ahead of print.

ABSTRACT

BACKGROUND: Driven by recent advances in artificial intelligence, particularly in medicine, audio-based voice and speech biomarkers are increasingly investigated for various medical applications as a complementary or even alternative modality to traditional medical devices. The adoption of deep learning techniques in recent literature is motivated by their superior performance compared to classical machine learning (ML) methods. However, ethical and regulatory concerns regarding the black-box nature of these models have limited their integration into clinical workflows. Consequently, Explainable AI (XAI) has recently been employed to address this issue by generating explanations for opaque model outputs. Ideally, medical XAI systems aim to provide human-understandable, clinically grounded explanations essential for enhanced AI trustworthiness and, thereby, facilitated adoption into real-world clinical settings.

OBJECTIVE: We conduct a systematic literature review of XAI methods applied for explaining deep learning techniques in audio-based voice and speech clinical applications. Our aim is to identify what XAI methods have been used to explain the decisions of deep learning voice and speech AI systems in healthcare, as well as XAI-informed insights. Additionally, we aim to contextualize these findings with respect to clinical applicability and stakeholder relevance. Lastly, we identify opportunities and recommendations for future clinical audio XAI design.

METHODS: This review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Six electronic databases (IEEE Xplore, ACM Digital Library, Scopus, PubMed, Web of Science, and Nature) were systematically searched for articles published between January 2015 and February 2025. Eligible studies applied explainability or interpretability methods to deep learning models for voice or speech audio in healthcare contexts. Risk of bias was assessed using PROBAST+AI. Results were thematically synthesized across explainability categories, input representations, clinical domains, validation strategies, and stakeholder considerations.

RESULTS: Thirty studies met the inclusion criteria. These studies employed a range of explainability approaches, including gradient-based methods, perturbation-based techniques, surrogate model-based methods, model-internal representation analyses, concept-based detectors, and attention-based explanations. Applications spanned diverse clinical domains, including voice disorders, neurodegenerative diseases, psychiatric conditions, and traumatic brain injury. Overall, results indicate that most studies relied primarily on qualitative interpretation of explainability outputs, with limited quantitative validation of explanation consistency across external datasets. Furthermore, none of the included studies explicitly conducted human-in-the-loop evaluations with relevant stakeholders, highlighting a substantial gap in stakeholder alignment.

CONCLUSIONS: Current XAI practices in clinical voice and speech analysis are limited by insufficient validation, lack of domain-specific design, and misalignment with clinical stakeholder needs. This review highlights opportunities for developing validated, audio-aware, and stakeholder-centered XAI approaches to support trustworthy clinical deployment. Interpretation of these findings should consider limitations related to single-reviewer study selection, potential high-risk of bias in model development and evaluation, and the repeated use of benchmark datasets across reviewed studies.

PMID:42084850 | DOI:10.2196/83790

Summarize with:

Document this CPD