Diagnostic Signal Decay and the Signal Fidelity Index

Why AI Models Fail When They Leave Home—And How to Fix It

Created Dec 18, 2025 - Last updated: Dec 18, 2025

Your AI model performs beautifully in development. You deploy it to a new hospital system. Performance collapses.

This isn’t a bug in your algorithm. It’s a feature of how healthcare data works.

The diagnostic codes your model learned from don’t mean the same thing everywhere. A dementia diagnosis at a major academic medical center—backed by neuroimaging, specialist evaluation, and neuropsychological testing—carries different information than the same ICD-10 code assigned during a rushed emergency department visit at a rural hospital. Same code. Radically different signal.

We call this phenomenon diagnostic signal decay. And in two recent papers, we’ve shown not only how to measure it, but how to fix it.

The Hidden Crisis in Clinical AI

Machine learning models in healthcare increasingly rely on administrative and electronic health record data—particularly diagnostic codes—to define phenotypes, identify at-risk cohorts, and predict outcomes. The computational capacity to detect complex patterns has grown enormously. But the fidelity of the clinical signals embedded in these data has received far less scrutiny.

Nowhere is this gap more consequential than in dementia, where diagnosis is often missing, delayed, ambiguous, or inconsistently documented across healthcare settings.

Quantifying Signal Decay: A National Study

In our first study, published in Alzheimer’s & Dementia, we analyzed 2016–2018 Medicare Part A hospitalization claims across more than 3,000 U.S. counties. We examined 17 ICD-10 dementia codes grouped into five categories: Alzheimer’s disease, vascular dementia, Pick’s disease (behavioral variant frontotemporal dementia), other neurocognitive disorders, and non-specific dementia.

The findings were striking.

Non-Specific Codes Dominate

The single most common dementia diagnosis nationally was F03.90—“Unspecified dementia, unspecified severity.” Non-specific codes appeared in over 75% of dementia hospitalizations across all states.

This dominance of non-specific codes represents reduced diagnostic resolution—an early indicator of signal density loss in administrative data.

Signal Decays Over Time

Using temporal sequence analysis with our transitive sequential pattern mining (tSPM+) algorithm, we found consistent transitions from specific to non-specific codes over time. A patient might initially receive a specific Alzheimer’s disease code (G30.9), but subsequent hospitalizations increasingly document non-specific dementia (F03.90).

This isn’t patients getting better or diagnoses becoming uncertain. It’s documentation entropy—the systematic degradation of diagnostic specificity as clinical information passes through the healthcare system.

Geography and Demographics Matter

Counties with higher proportions of rural residents, Medicaid-eligible patients, and Black or Hispanic dementia patients demonstrated significantly lower similarity to national diagnostic coding patterns. Our model explained 38% of the variation in local-to-national diagnostic alignment.

This geographic and demographic variation means that AI models trained predominantly on data from well-resourced academic medical centers may systematically underperform—or actively mislead—when deployed in underserved communities.

The Generalization Problem

These findings have direct implications for clinical AI. When diagnostic codes carry systematically different meanings across settings, models trained in one context will fail in another. This isn’t a simple “domain shift” that can be addressed with standard transfer learning techniques.

The problem is more fundamental: the labels themselves have become unreliable.

Consider what happens when you deploy a dementia risk prediction model trained at Massachusetts General Hospital to a rural health system in Mississippi:

The training data contained high-fidelity diagnoses backed by specialist evaluations
The deployment data contains codes assigned under time pressure without specialist input
The same ICD-10 code carries different predictive value in each setting
Model performance degrades not because the algorithm is wrong, but because the signal has decayed

Traditional approaches—recalibration, fine-tuning, domain adaptation—assume you have labeled outcomes in the target domain. But in real-world deployment, you often don’t. You’re flying blind.

The Signal Fidelity Index

In our second paper, published in Scientific Reports, we introduced a solution: the Signal Fidelity Index (SFI).

The SFI quantifies the reliability of diagnostic data at the patient level using six components, all computed from structured EHR metadata alone:

Component	What It Measures
Diagnostic Specificity	Proportion of specific vs. non-specific codes
Temporal Consistency	Stability of diagnoses across encounters
Entropy	Diversity and uncertainty in coding patterns
Contextual Concordance	Alignment with appropriate care settings
Medication Alignment	Correspondence between diagnoses and treatments
Trajectory Stability	Consistency of disease progression patterns

Each component captures a different dimension of diagnostic reliability. A patient with specific dementia codes, consistent documentation across visits, appropriate specialist encounters, and matching antidementia medications has high signal fidelity. A patient with non-specific codes, inconsistent documentation, only emergency department visits, and no disease-specific medications has low signal fidelity.

The composite SFI score—the arithmetic mean of these six normalized components—provides a single metric for data quality at the individual patient level.

SFI-Aware Calibration

The key insight is that prediction reliability should correlate with diagnostic fidelity. High-fidelity data supports confident predictions. Low-fidelity data warrants skepticism.

We implemented this through a simple calibration formula:

ŷ_calibrated = ŷ_raw × [1 + α × (SFI_i − SFI_ref) / SFI_ref]

When a patient’s SFI exceeds the reference mean, predictions are amplified. When SFI falls below the reference, predictions are attenuated. The calibration strength parameter α is optimized once per phenotype using labeled validation data, then deployed to unlimited unlabeled populations.

This hybrid framework is crucial: you need labels to determine optimal α, but once established, you can apply SFI-aware calibration anywhere without requiring outcomes in the target domain.

Results Across Six Phenotypes

We evaluated SFI-aware calibration across six clinically diverse conditions, generating 3,550 synthetic EHR datasets. The phenotypes ranged from complex, under-diagnosed conditions with high diagnostic variability to well-diagnosed conditions with standardized criteria.

Phenotype	F1 Improvement	AUC Improvement	Optimal α
Dementia	+34.9%	+7.3%	2.5
Geriatric Bipolar	+28.4%	+14.7%	2.5
Adult ADHD	+31.6%	+4.7%	2.5
Fibromyalgia	+4.2%	+5.2%	0.5
Type 2 Diabetes	+13.3%	+25.0%	2.25
Hypertension	+8.0%	+40.1%	1.0

Complex Phenotypes Benefit Most

Dementia, geriatric bipolar disorder, and adult ADHD—all conditions characterized by diagnostic ambiguity, specialist requirements, and high coding heterogeneity—showed the largest improvements. Optimal α values of 2.5 indicate that strong calibration adjustments are warranted when diagnostic signal decay is substantial.

Even Well-Diagnosed Conditions Benefit

Perhaps surprisingly, type 2 diabetes and hypertension—conditions with objective diagnostic criteria (HbA1c thresholds, blood pressure measurements)—also showed substantial gains. This suggests that documentation heterogeneity affects all EHR-based phenotyping, not just complex conditions.

The variation in optimal α across phenotypes (0.5 to 2.5) emphasizes the need for phenotype-specific calibration strength. But all phenotypes showed positive improvements across wide α ranges, indicating robustness to moderate misspecification.

Why Traditional Calibration Fails

We compared SFI-aware calibration against five established methods: Platt scaling, isotonic regression, beta calibration, temperature scaling, and histogram binning.

The results were sobering.

Isotonic regression collapsed to near-chance performance across all phenotypes, with AUC values degrading to 0.51–0.56. F1 scores dropped by up to 91%.

Platt scaling showed inconsistent effects—improving some phenotypes, harming others, with no predictable pattern.

The fundamental problem: traditional calibration methods apply uniform adjustments to all predictions, regardless of underlying data quality. A prediction derived from a patient with specific diagnostic codes, specialist encounters, and appropriate medications receives identical calibration as one based on non-specific codes from a single primary care visit.

This uniformity assumption breaks down when diagnostic signal quality varies within the target population.

SFI-aware calibration never degraded performance on any phenotype or metric—demonstrating substantially greater robustness than any traditional approach.

The Dual Mechanism

Brier score decomposition revealed that SFI-aware calibration improves performance through two simultaneous mechanisms:

Enhanced Calibration: Reduced reliability error (11–29% reduction), indicating better alignment between predicted probabilities and observed frequencies
Improved Discrimination: Increased resolution (+35% to +238%), demonstrating better separation of true positives from false positives

Traditional calibration methods typically face a tradeoff—improving calibration often degrades discrimination, and vice versa. SFI-aware calibration avoids this tradeoff because the SFI functions as a feature-derived confidence estimator. High SFI scores identify patients where diagnostic codes reliably reflect true disease status, enabling selective upweighting of trustworthy predictions.

Implications for Clinical AI

These findings have immediate practical implications:

For Model Developers

Report rare-case performance, not just aggregate metrics
Measure signal fidelity in your training data
Validate across settings with different documentation practices
Incorporate SFI-aware calibration before deployment

For Health Systems

Audit diagnostic coding practices for consistency
Recognize that AI performance varies by documentation quality
Invest in diagnostic specificity as a data quality measure
Consider fairness implications when coding patterns correlate with demographics

For Policymakers

Documentation quality affects AI equity
Rural and underserved populations face compounded disadvantage
Standardized diagnostic criteria benefit both clinical care and AI reliability

The Road Ahead

Diagnostic signal decay is not inevitable. It’s a measurable phenomenon with tractable solutions.

The Signal Fidelity Index provides a practical framework for quantifying data reliability at the patient level. SFI-aware calibration offers a deployment-ready method for improving model performance across heterogeneous healthcare contexts—without requiring outcome labels in every new setting.

Real-world validation remains essential. Our simulation framework, while incorporating realistic epidemiological parameters, cannot capture the full complexity of clinical documentation. Multi-site validation using national cohorts with robust outcome adjudication is the critical next step.

But the core insight is clear: data quality is not noise to be ignored—it’s signal to be leveraged.

When we acknowledge that diagnostic codes carry different meanings in different contexts, and when we build that acknowledgment into our calibration strategies, we can develop AI systems that generalize not just across datasets, but across the messy, heterogeneous reality of healthcare delivery.

The signal is decaying. Now we know how to measure it—and how to restore it.

Read the papers:

Quantifying diagnostic signal decay in dementia: a national study of Medicare hospitalization data — Spoto, Tian, Hügel, et al. Alzheimer’s & Dementia (2025)
Signal Fidelity Index-aware calibration for addressing distributional shift in predictive modeling across heterogeneous real-world data — Cheng, Tian, Spoto, et al. Scientific Reports (2026)

Code availability: github.com/clai-group/SFI