Diagnostic Signal Decay and the Signal Fidelity Index
Why AI Models Fail When They Leave Home—And How to Fix It
Created Dec 18, 2025 - Last updated: Dec 18, 2025
Your AI model performs beautifully in development. You deploy it to a new hospital system. Performance collapses.
This isn’t a bug in your algorithm. It’s a feature of how healthcare data works.
The diagnostic codes your model learned from don’t mean the same thing everywhere. A dementia diagnosis at a major academic medical center—backed by neuroimaging, specialist evaluation, and neuropsychological testing—carries different information than the same ICD-10 code assigned during a rushed emergency department visit at a rural hospital. Same code. Radically different signal.
We call this phenomenon diagnostic signal decay. And in two recent papers, we’ve shown not only how to measure it, but how to fix it.
The Hidden Crisis in Clinical AI
Machine learning models in healthcare increasingly rely on administrative and electronic health record data—particularly diagnostic codes—to define phenotypes, identify at-risk cohorts, and predict outcomes. The computational capacity to detect complex patterns has grown enormously. But the fidelity of the clinical signals embedded in these data has received far less scrutiny.
Nowhere is this gap more consequential than in dementia, where diagnosis is often missing, delayed, ambiguous, or inconsistently documented across healthcare settings.
Quantifying Signal Decay: A National Study
In our first study, published in Alzheimer’s & Dementia, we analyzed 2016–2018 Medicare Part A hospitalization claims across more than 3,000 U.S. counties. We examined 17 ICD-10 dementia codes grouped into five categories: Alzheimer’s disease, vascular dementia, Pick’s disease (behavioral variant frontotemporal dementia), other neurocognitive disorders, and non-specific dementia.
The findings were striking.
Non-Specific Codes Dominate
The single most common dementia diagnosis nationally was F03.90—“Unspecified dementia, unspecified severity.” Non-specific codes appeared in over 75% of dementia hospitalizations across all states.
This dominance of non-specific codes represents reduced diagnostic resolution—an early indicator of signal density loss in administrative data.
Signal Decays Over Time
Using temporal sequence analysis with our transitive sequential pattern mining (tSPM+) algorithm, we found consistent transitions from specific to non-specific codes over time. A patient might initially receive a specific Alzheimer’s disease code (G30.9), but subsequent hospitalizations increasingly document non-specific dementia (F03.90).
This isn’t patients getting better or diagnoses becoming uncertain. It’s documentation entropy—the systematic degradation of diagnostic specificity as clinical information passes through the healthcare system.
Geography and Demographics Matter
Counties with higher proportions of rural residents, Medicaid-eligible patients, and Black or Hispanic dementia patients demonstrated significantly lower similarity to national diagnostic coding patterns. Our model explained 38% of the variation in local-to-national diagnostic alignment.
This geographic and demographic variation means that AI models trained predominantly on data from well-resourced academic medical centers may systematically underperform—or actively mislead—when deployed in underserved communities.
The Generalization Problem
These findings have direct implications for clinical AI. When diagnostic codes carry systematically different meanings across settings, models trained in one context will fail in another. This isn’t a simple “domain shift” that can be addressed with standard transfer learning techniques.
The problem is more fundamental: the labels themselves have become unreliable.
Consider what happens when you deploy a dementia risk prediction model trained at Massachusetts General Hospital to a rural health system in Mississippi:
- The training data contained high-fidelity diagnoses backed by specialist evaluations
- The deployment data contains codes assigned under time pressure without specialist input
- The same ICD-10 code carries different predictive value in each setting
- Model performance degrades not because the algorithm is wrong, but because the signal has decayed
Traditional approaches—recalibration, fine-tuning, domain adaptation—assume you have labeled outcomes in the target domain. But in real-world deployment, you often don’t. You’re flying blind.
The Signal Fidelity Index
In our second paper, published in Scientific Reports, we introduced a solution: the Signal Fidelity Index (SFI).
The SFI quantifies the reliability of diagnostic data at the patient level using six components, all computed from structured EHR metadata alone:
| Component | What It Measures |
|---|---|
| Diagnostic Specificity | Proportion of specific vs. non-specific codes |
| Temporal Consistency | Stability of diagnoses across encounters |
| Entropy | Diversity and uncertainty in coding patterns |
| Contextual Concordance | Alignment with appropriate care settings |
| Medication Alignment | Correspondence between diagnoses and treatments |
| Trajectory Stability | Consistency of disease progression patterns |
Each component captures a different dimension of diagnostic reliability. A patient with specific dementia codes, consistent documentation across visits, appropriate specialist encounters, and matching antidementia medications has high signal fidelity. A patient with non-specific codes, inconsistent documentation, only emergency department visits, and no disease-specific medications has low signal fidelity.
The composite SFI score—the arithmetic mean of these six normalized components—provides a single metric for data quality at the individual patient level.
SFI-Aware Calibration
The key insight is that prediction reliability should correlate with diagnostic fidelity. High-fidelity data supports confident predictions. Low-fidelity data warrants skepticism.
We implemented this through a simple calibration formula:
ŷ_calibrated = ŷ_raw × [1 + α × (SFI_i − SFI_ref) / SFI_ref]
When a patient’s SFI exceeds the reference mean, predictions are amplified. When SFI falls below the reference, predictions are attenuated. The calibration strength parameter α is optimized once per phenotype using labeled validation data, then deployed to unlimited unlabeled populations.
This hybrid framework is crucial: you need labels to determine optimal α, but once established, you can apply SFI-aware calibration anywhere without requiring outcomes in the target domain.
Results Across Six Phenotypes
We evaluated SFI-aware calibration across six clinically diverse conditions, generating 3,550 synthetic EHR datasets. The phenotypes ranged from complex, under-diagnosed conditions with high diagnostic variability to well-diagnosed conditions with standardized criteria.
| Phenotype | F1 Improvement | AUC Improvement | Optimal α |
|---|---|---|---|
| Dementia | +34.9% | +7.3% | 2.5 |
| Geriatric Bipolar | +28.4% | +14.7% | 2.5 |
| Adult ADHD | +31.6% | +4.7% | 2.5 |
| Fibromyalgia | +4.2% | +5.2% | 0.5 |
| Type 2 Diabetes | +13.3% | +25.0% | 2.25 |
| Hypertension | +8.0% | +40.1% | 1.0 |
Complex Phenotypes Benefit Most
Dementia, geriatric bipolar disorder, and adult ADHD—all conditions characterized by diagnostic ambiguity, specialist requirements, and high coding heterogeneity—showed the largest improvements. Optimal α values of 2.5 indicate that strong calibration adjustments are warranted when diagnostic signal decay is substantial.
Even Well-Diagnosed Conditions Benefit
Perhaps surprisingly, type 2 diabetes and hypertension—conditions with objective diagnostic criteria (HbA1c thresholds, blood pressure measurements)—also showed substantial gains. This suggests that documentation heterogeneity affects all EHR-based phenotyping, not just complex conditions.
The variation in optimal α across phenotypes (0.5 to 2.5) emphasizes the need for phenotype-specific calibration strength. But all phenotypes showed positive improvements across wide α ranges, indicating robustness to moderate misspecification.
Why Traditional Calibration Fails
We compared SFI-aware calibration against five established methods: Platt scaling, isotonic regression, beta calibration, temperature scaling, and histogram binning.
The results were sobering.
Isotonic regression collapsed to near-chance performance across all phenotypes, with AUC values degrading to 0.51–0.56. F1 scores dropped by up to 91%.
Platt scaling showed inconsistent effects—improving some phenotypes, harming others, with no predictable pattern.
The fundamental problem: traditional calibration methods apply uniform adjustments to all predictions, regardless of underlying data quality. A prediction derived from a patient with specific diagnostic codes, specialist encounters, and appropriate medications receives identical calibration as one based on non-specific codes from a single primary care visit.
This uniformity assumption breaks down when diagnostic signal quality varies within the target population.
SFI-aware calibration never degraded performance on any phenotype or metric—demonstrating substantially greater robustness than any traditional approach.
The Dual Mechanism
Brier score decomposition revealed that SFI-aware calibration improves performance through two simultaneous mechanisms:
-
Enhanced Calibration: Reduced reliability error (11–29% reduction), indicating better alignment between predicted probabilities and observed frequencies
-
Improved Discrimination: Increased resolution (+35% to +238%), demonstrating better separation of true positives from false positives
Traditional calibration methods typically face a tradeoff—improving calibration often degrades discrimination, and vice versa. SFI-aware calibration avoids this tradeoff because the SFI functions as a feature-derived confidence estimator. High SFI scores identify patients where diagnostic codes reliably reflect true disease status, enabling selective upweighting of trustworthy predictions.
Implications for Clinical AI
These findings have immediate practical implications:
For Model Developers
- Report rare-case performance, not just aggregate metrics
- Measure signal fidelity in your training data
- Validate across settings with different documentation practices
- Incorporate SFI-aware calibration before deployment
For Health Systems
- Audit diagnostic coding practices for consistency
- Recognize that AI performance varies by documentation quality
- Invest in diagnostic specificity as a data quality measure
- Consider fairness implications when coding patterns correlate with demographics
For Policymakers
- Documentation quality affects AI equity
- Rural and underserved populations face compounded disadvantage
- Standardized diagnostic criteria benefit both clinical care and AI reliability
The Road Ahead
Diagnostic signal decay is not inevitable. It’s a measurable phenomenon with tractable solutions.
The Signal Fidelity Index provides a practical framework for quantifying data reliability at the patient level. SFI-aware calibration offers a deployment-ready method for improving model performance across heterogeneous healthcare contexts—without requiring outcome labels in every new setting.
Real-world validation remains essential. Our simulation framework, while incorporating realistic epidemiological parameters, cannot capture the full complexity of clinical documentation. Multi-site validation using national cohorts with robust outcome adjudication is the critical next step.
But the core insight is clear: data quality is not noise to be ignored—it’s signal to be leveraged.
When we acknowledge that diagnostic codes carry different meanings in different contexts, and when we build that acknowledgment into our calibration strategies, we can develop AI systems that generalize not just across datasets, but across the messy, heterogeneous reality of healthcare delivery.
The signal is decaying. Now we know how to measure it—and how to restore it.
Read the papers:
- Quantifying diagnostic signal decay in dementia: a national study of Medicare hospitalization data — Spoto, Tian, Hügel, et al. Alzheimer’s & Dementia (2025)
- Signal Fidelity Index-aware calibration for addressing distributional shift in predictive modeling across heterogeneous real-world data — Cheng, Tian, Spoto, et al. Scientific Reports (2026)
Code availability: github.com/clai-group/SFI