Psychosocial Stress in the Chinese Community: Speech Analytics Through Linguistic and Acoustic Fusion Using Machine Learning

AI Summary

Fusion of linguistic and acoustic speech features significantly outperformed single-feature models in detecting psychosocial stress among Chinese family caregivers.
Orthogonalisation to decorrelate acoustic and linguistic features before fusion markedly improved classification accuracy compared with non-orthogonalised features.
Linear support vector machine achieved AUC 78.28%, F1-score 75.27% and accuracy 73%, demonstrating viability of speech analytics for early stress detection.

JMIR Biomed Eng. 2026 May 29;11:e91138. doi: 10.2196/91138.

ABSTRACT

BACKGROUND: Family caregivers experience significant stress due to intensive caregiving activities, making them highly susceptible to adverse psychosocial health conditions. Early detection of this stress is crucial for timely interventions to prevent disease progression and long-term disability.

OBJECTIVE: This study aimed to develop and validate the Linguistic and Acoustic Speech Analytics Program, a novel machine learning approach capable of providing a fusion analysis of linguistic and acoustic speech features to enhance the effectiveness of psychosocial stress assessment.

METHODS: This quantitative study analyzed speech data collected from 100 Chinese family caregivers. Participants responded to 12 open-ended questions, and their voices were recorded for linguistic and acoustic feature extraction. Various machine learning classifiers, including support vector machine, were developed to process speech data. A key methodological step was the application of an orthogonalization procedure to decorrelate acoustic features from linguistic features before fusion analysis. The classifiers were then trained to evaluate psychosocial stress levels based on the processed and fused linguistic and acoustic speech features. Model performance was measured using receiver operating characteristic-area under the curve, F1-score, and accuracy.

RESULTS: The linear support vector machine model emerged as the top performer, achieving a receiver operating characteristic-area under the curve of 78.28%, an F1-score of 75.27%, and an accuracy of 73%. These results demonstrate the model’s strong capability in identifying stressed participants based on their speech. Critically, the fusion of linguistic and acoustic features significantly outperformed models using either feature type alone. Furthermore, the orthogonalization procedure proved essential, as decorrelating features before fusion markedly enhanced classification accuracy compared to using non-orthogonalized features.

CONCLUSIONS: This study demonstrates that fusion analysis of linguistic and acoustic features effectively identifies psychosocial stress among family caregivers. It also emphasizes the importance of proper feature processing when combining multiple features extracted from the same audio sample. These findings provide valuable insights for developing machine learning models for psychosocial stress assessment and addressing various psychosocial conditions in different contexts, supporting population mental health management.

PMID:42214077 | DOI:10.2196/91138

Document this CPD