Building a Digital Clinical Team

Five AI Agents That Critique Each Other — And Outperform Human Prompt Engineering

Created Jan 20, 2026 - Last updated: Feb 7, 2026

Publication 📝

What if we could build AI systems that optimize themselves—without human intervention—while remaining transparent enough to trust in healthcare?

Our new study, published in npj Digital Medicine, introduces the first fully autonomous agentic workflow for clinical AI. Instead of relying on a single model or endless human prompt engineering, we built a system where five specialized AI agents collaborate, critique each other, and refine their reasoning—much like clinicians would in a case conference.

The Challenge: Scaling Clinical AI

Traditional cognitive screening tools like the Mini-Mental State Examination (MMSE) require dedicated administration time, trained personnel, and in-person visits—resources that overstretched healthcare systems often cannot spare. Yet early detection has become increasingly critical: FDA-approved Alzheimer’s therapies like lecanemab are most effective when administered early in the disease course.

Clinical notes already contain the signals we need. Word-finding difficulties, family concerns noted in passing, subtle changes in how patients describe their symptoms—these linguistic whispers of cognitive decline are documented at every healthcare encounter but rarely systematically analyzed.

The technical challenge lies in prompt engineering. In high-stakes medical applications, subtle phrasing variations can yield inconsistent AI classifications. Manual refinement demands extensive clinical expertise and iterative testing—difficult to scale across healthcare settings.

Our Solution: A Society of Specialized Agents

Rather than building a single AI model, we built a digital clinical team. The system coordinates five specialized agents, each representing a distinct clinical reasoning domain:

The Specialist Agent functions as the primary clinical decision-maker—equivalent to an expert neurologist. It processes clinical notes and generates binary classifications with detailed clinical reasoning.

Sensitivity and Specificity Improver Agents analyze misclassified cases. The Sensitivity Improver identifies subtle clinical evidence the system may have overlooked, while the Specificity Improver examines false positives to reduce overdiagnosis.

Summarizer Agents synthesize improvements across cases into coherent prompt refinements, ensuring that changes maintain clinical validity.

These agents operate in an iterative loop, refining their detection capabilities through structured collaboration until performance targets are met—all without human intervention.

Key Results: Autonomous Optimization Works

On our refinement dataset, the autonomous agentic workflow achieved an F1 score of 0.93, outperforming expert-driven prompt optimization (F1 = 0.87). The system optimized itself better than we could optimize it manually.

On real-world validation data, the system achieved 98% specificity—when it flags a concern, clinicians can trust it. This high precision is essential for adoption in busy clinical settings where alert fatigue undermines AI adoption.

When AI and Humans Disagreed

One of our most striking findings emerged from expert re-adjudication of cases where the AI system and human reviewers reached different conclusions.

In 44% of cases initially classified as AI errors (false negatives), independent expert review determined that the AI’s reasoning was actually clinically appropriate. The system was making defensible, conservative judgments based on the available evidence.

This challenges a fundamental assumption in medical AI: that human annotation is always the gold standard. Sometimes the AI saw patterns that initial human review missed.

Honest About Limitations

We’re publishing exactly where this system struggles. Sensitivity dropped from 91% under balanced conditions to 62% under real-world prevalence. This “prevalence shift” effect is a calibration challenge that every medical AI system faces when moving from development to deployment—but most papers hide it.

We quantify the problem and provide a roadmap: threshold optimization, prevalence-aware recalibration, and cost-sensitive learning. The field needs this transparency if clinical AI is to be trusted.

Privacy-Preserving, Locally Deployable

The entire system runs on Llama 3.1 8B—an open-weight model with 8 billion parameters that can be deployed locally within hospital IT infrastructure. No patient data is transmitted to external servers. Everything runs behind hospital firewalls.

Expert-level clinical reasoning doesn’t require frontier-scale models or cloud APIs. A moderate-scale open model, running locally, was sufficient. This democratizes clinical AI beyond well-resourced institutions.

Open Source: Pythia Library

To accelerate adoption, we’re releasing Pythia, an open-source Python library for autonomous prompt optimization via multi-agent collaboration.

Pythia provides a framework for deploying multi-agent prompt refinement on any large language model task—not only healthcare applications. Any research team can use it to build autonomous clinical AI without a dedicated AI team.

GitHub: https://github.com/clai-group/Pythia

What This Means

We’re in a unique window where AI is powerful enough to be clinically useful yet still interpretable enough to be trusted. Every routine clinical encounter could become a cognitive screening opportunity—without adding a single test or minute to the visit.

The goal isn’t just to publish a paper—it’s to change how clinical AI gets built. Autonomous systems that optimize themselves, explain their reasoning, and acknowledge their limitations represent a new paradigm for scalable, trustworthy medical AI.

Read the full paper: An autonomous agentic workflow for clinical detection of cognitive concerns using large language models, npj Digital Medicine, 2026.

Authors: Jiazi Tian, Pedram Fard, Cameron Cagan, Neguine Rezaii, Rebeka Bustamante Rocha, Liqin Wang, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag J. Patel, Shawn N. Murphy, Lidia M.V.R. Moura, Hossein Estiri