The Average Patient Fallacy

Why Medical AI Fails the Patients Who Need It Most—And How to Fix It

Created Oct 28, 2025 - Last updated: Nov 20, 2025

The biostatistician who killed a duck by shooting once to the left and once to the right of the bird has become the dark parable of modern medical AI. On average, he hit the target perfectly. The duck, unfortunately, was not average.

Medical AI has an averaging problem. The very mathematics that makes machine learning powerful—minimizing error across large datasets—creates a systematic bias against the patients who need precision most: those with rare conditions, atypical presentations, and minority demographics.

We call this the average patient fallacy.

The Mathematics of Marginalization

Supervised learning seeks parameters θ* that minimize an expected loss across all patients:

θ = arg min E[L(y, f(x))]*

The expectation is, by construction, frequency-weighted. Common presentations dominate the gradient updates. The parameters are coerced into a configuration optimized for the majority.

This isn’t a bug—it’s how the math works. And for the average patient, it works brilliantly. The problem is that rare cases—the ones with the highest stakes—get mathematically suppressed. Their gradients are drowned out by prevalence.

Where the Fallacy Kills

Consider these clinical scenarios where optimizing for averages fails the patients who need precision most:

The Rare Responder in Oncology

A patient with metastatic cancer harbors a rare fusion mutation that predicts exceptional response to a targeted therapy. The AI system, trained predominantly on common mutation profiles, assigns a low probability of benefit. The treatment isn’t offered. The patient who might have responded never gets the chance.

The Atypical Emergency

A young woman presents with fatigue and vague abdominal discomfort. An AI early-warning system trained largely on sepsis, pneumonia, and trauma—relying on inflammatory markers and stereotyped deterioration patterns—never issues an alert. She has an ectopic pregnancy and autoimmune hemolysis, neither matching the common signatures. By the time humans notice, precious hours have passed.

The Vision-Threatening Variant

An AI diabetic retinopathy screener excels at detecting the most common patterns but misclassifies a subretinal hemorrhage in a patient with proliferative disease. The rare morphology falls outside the training distribution. Weeks pass before human review catches the error.

These aren’t edge cases to be dismissed. They are precisely the cases where clinical stakes are highest and AI failures most harmful.

The Collision of Two Worldviews

Here we confront the collision of two disparate intellectual traditions.

Classical machine learning is unapologetically utilitarian: maximize overall utility, even at the cost of sacrificing performance in the tail of the distribution. Accuracy at 99% prevalence matters more than accuracy at 1% prevalence—mathematically speaking.

Clinical medicine operates on a fundamentally different principle: each patient deserves individual consideration. The rare diagnosis isn’t an outlier to be discarded; it’s often where the stakes are highest. A missed rare disease isn’t a rounding error—it’s a preventable tragedy.

The average patient fallacy isn’t just technically problematic. It’s ethically corrosive. It privileges the common at the expense of the vulnerable.

From Population Averages to N-of-1

If the problem is optimizing for the average, the solution is designing for the individual.

We propose a different architecture: a multi-agent ecosystem for N-of-1 decision support. Instead of a single model optimized for population means, we envision specialized agents clustered by organ systems, patient populations, and analytic modalities—each contributing expertise to the assessment of a single patient.

These agents draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision-support packet:

Risk estimates bounded by confidence ranges — not false certainty
Outlier flags — explicit acknowledgment when the patient doesn’t fit the training distribution
Linked evidence — traceable reasoning, not black-box pronouncements

The key insight: validation shifts from population averages to individual reliability. We measure error in low-density regions, calibration in the small, and risk-coverage trade-offs that explicitly account for rare cases.

Operational Fixes

Beyond architectural changes, we propose concrete metrics and methods:

Rare Case Performance Gap (RCPG)

Measure the difference between model performance on common vs. rare presentations. If your model achieves 95% accuracy on typical cases but only 60% on rare ones, that 35-point gap is your RCPG—and it should be reported.

Rare-Case Calibration Error

Standard calibration metrics average across the population. We need calibration measures specifically for low-prevalence subgroups. A model can be well-calibrated overall while being catastrophically miscalibrated for rare conditions.

Prevalence-Utility Definition of Rarity

“Rare” shouldn’t be defined by prevalence alone. A condition affecting 0.1% of patients but causing severe, preventable harm deserves more weight than a benign condition affecting 1%. Rarity × clinical utility = priority.

Clinically Weighted Objectives

Instead of pure frequency weighting, incorporate clinical stakes into the loss function. Errors on rare-but-severe conditions should incur higher penalties. Weight selection should follow structured deliberation with clinicians and ethicists—not arbitrary hyperparameter tuning.

Safeguards Against Failure Modes

The N-of-1 ecosystem acknowledges what medicine has always known: atypical cases demand extra scrutiny. We build in explicit safeguards:

Outlier detection: Flag when a patient falls outside the training distribution
Abstention: Allow the system to say “I don’t know” rather than force a prediction
Consensus checks: Multiple agents must agree before high-confidence recommendations
Human oversight: Final decisions remain with clinicians who can integrate context the system cannot see

These aren’t just technical features—they’re philosophical commitments. AI in medicine must detect exceptional cases because of their significance, not despite it.

The Road Ahead

The average patient fallacy is not inevitable. It’s a design choice—one we can unmake.

Building AI that serves exceptional patients requires:

Acknowledging the problem: Report rare-case performance, not just aggregate metrics
Changing the math: Incorporate clinical weights, not just prevalence weights
Architecting for individuals: Multi-agent systems that converge on the single patient
Building in humility: Uncertainty quantification, outlier detection, and explicit abstention

Medicine has always been the art of treating the individual, not the average. AI must learn the same lesson.

The duck was not average. Neither are your patients.

Read the papers:

The Average Patient Fallacy — Azhir, Murphy, Estiri. arXiv:2509.26474
An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine — Fard, Azhir, Rezaii, Tian, Estiri. arXiv:2510.24359