Real-World Generalizability of Alzheimer’s Volumetric MRI Machine-Learning Models: External Validation with British Data

AI Summary

External validation on UK SLaM-BRC data demonstrated 'CN versus AD' model stability across 1.5T and 3.0T scans, with 87.1% concordant class assignments.
Volume normalization to estimated intracranial volume reduced internal versus external performance differences but decreased SLaM-BRC balanced accuracy to 81.5% via misclassifying atrophic non-AD cases.
'CN versus MCI versus AD' non-normalized model retained similar balanced accuracy externally (55.3% internal versus 54.6% SLaM-BRC), indicating model robustness despite heterogeneity.

AI Summary

In this study, we aimed to test our models' performance on an external validation real-world clinical dataset and to evaluate the impact of magnetic field strength and brain volume normalization on the classification.

METHODS: We validated two previously published models trained on public datasets (Alzheimer's disease [AD], mild cognitive impairment [MCI] and cognitively normal [CN] subjects) on a real-world clinical dataset from UK memory clinics (SLaM-BRC: 255 non-AD [subjects without cognitive complaints], 281 MCI and 711 AD).

AD' model (87.7% balanced accuracy [BAC]) led to decreased performance in the SLaM-BRC (81.5% BAC) due to misclassifications of non-AD subjects with evidence of hippocampal atrophy.

Basic summary

Clin Neuroradiol. 2026 Jun 24. doi: 10.1007/s00062-026-01688-8. Online ahead of print.

ABSTRACT

PURPOSE: Assessing generalizability and performance of machine learning models in clinical settings is crucial. In this study, we aimed to test our models’ performance on an external validation real-world clinical dataset and to evaluate the impact of magnetic field strength and brain volume normalization on the classification.

METHODS: We validated two previously published models trained on public datasets (Alzheimer’s disease [AD], mild cognitive impairment [MCI] and cognitively normal [CN] subjects) on a real-world clinical dataset from UK memory clinics (SLaM-BRC: 255 non-AD [subjects without cognitive complaints], 281 MCI and 711 AD).

RESULTS: Our ‘CN vs. AD’ model showed similar performance when tested at different magnetic fields (1.5T vs. 3.0T: 87.1% of 93 subjects had the same class assignment). The volume-normalized ‘CN vs. AD’ model (87.7% balanced accuracy [BAC]) led to decreased performance in the SLaM-BRC (81.5% BAC) due to misclassifications of non-AD subjects with evidence of hippocampal atrophy. The non-normalized ‘CN vs. MCI vs. AD’ model’s performance (initially 55.3% BAC) remained similar in the SLaM-BRC cohort (BAC: SLaM-BRC = 54.6%).

CONCLUSION: Volumes normalized to the estimated total intracranial volume led to a smaller difference in performance between internal and external datasets than non-normalized volumes. Our ‘CN vs. MCI vs. AD’ model performance remained the same, denoting robustness. These findings suggest that dataset and disease/diagnostic heterogeneities, magnetic field, and brain volume normalization may affect models’ performance.

PMID:42340466 | DOI:10.1007/s00062-026-01688-8

Document this CPD