Here’s a paradox that should concern anyone building autonomous AI systems: the more an AI agent tries to improve itself, the worse it can become.

We call this phenomenon optimization instability—and it represents a critical, poorly understood failure mode of the autonomous agentic systems that are rapidly being deployed across healthcare, finance, and critical infrastructure.

In our latest work, we’ve characterized this failure mode in clinical AI systems, demonstrated why standard evaluation metrics hide it from view, and shown that our intuitions about how to fix it are exactly backwards.

The Promise of Self-Improving AI

Autonomous agentic workflows—AI systems that iteratively refine their own behavior—hold considerable promise. The vision is compelling: deploy an AI system that gets better over time without constant human intervention. Let the system learn from its mistakes, refine its approaches, and continuously improve.

This is the dream powering the current wave of “agentic AI” development. Multi-agent systems that collaborate. Prompt optimization frameworks that refine themselves. Autonomous pipelines that tune their own parameters.

But what happens when that self-improvement goes wrong?

The Failure Mode Nobody Talks About

Using Pythia, our open-source framework for automated prompt optimization, we investigated what happens when autonomous AI systems iteratively refine their own behavior for clinical symptom detection.

We evaluated three symptoms with varying prevalence:

  • Shortness of breath (23% prevalence)
  • Chest pain (12% prevalence)
  • Long COVID brain fog (3% prevalence)

What we observed was alarming.

Validation sensitivity oscillated wildly between 1.0 and 0.0 across iterations. The system would achieve perfect detection, then collapse to detecting nothing, then recover, then collapse again. The severity of these oscillations was inversely proportional to class prevalence—the rarer the condition, the more unstable the optimization.

At 3% prevalence (brain fog), the system achieved something particularly insidious: 95% accuracy while detecting zero positive cases.

The Metric That Hides the Problem

That 95% accuracy number is the key to understanding why optimization instability is so dangerous.

If only 3% of patients have a condition, a classifier that predicts “negative” for everyone achieves 97% accuracy. A system that detects a handful of positives but misses most achieves 95% accuracy. The accuracy metric—the number most systems optimize for—actively obscures catastrophic failure.

This is exactly what happened in our experiments. The autonomous optimization system, chasing better performance metrics, discovered that it could boost its scores by becoming more conservative. Each iteration refined the system toward higher specificity at the cost of sensitivity. Eventually, sensitivity collapsed entirely.

The system reported excellent performance. It had learned to detect nothing.

Why Active Intervention Makes Things Worse

Our intuition was that active oversight would solve the problem. We implemented a “guiding agent”—an AI system that monitored the optimization process and actively redirected it when things went wrong.

The guiding agent made things worse.

Rather than stabilizing optimization, active intervention amplified overfitting. The guiding agent, seeing that the system was becoming too conservative, pushed it toward detecting more positives. But this overcorrection swung sensitivity too high, sacrificing specificity. The next iteration overcorrected again. The oscillations intensified.

Active intervention—the approach most would consider obvious—paradoxically destabilized an already unstable process.

What Actually Works: Retrospective Selection

What did work was counterintuitive: passive retrospective selection.

Instead of trying to guide the optimization process, we implemented a “selector agent” that simply waited until optimization was complete, then looked back across all iterations and selected the best-performing one.

This simple intervention prevented catastrophic failure entirely. With selector agent oversight, the system outperformed expert-curated lexicons by 331% on brain fog detection and 7% on chest pain, despite requiring only a single natural language term as input.

The lesson: don’t try to steer a chaotic process. Let it run, then pick the best outcome.

Why This Matters Beyond Healthcare

The implications extend far beyond clinical NLP.

Autonomous optimization is everywhere. Reinforcement learning systems that tune themselves. Multi-agent frameworks that collaborate and refine. Prompt engineering pipelines that iterate toward better performance. AutoML systems that search architecture spaces. All of these can exhibit optimization instability.

Standard metrics hide the failure. Any system optimizing for aggregate performance under class imbalance is vulnerable. The more imbalanced the problem, the more likely that “improvement” means learning to ignore the minority class.

Our intuitions are wrong. The obvious solution—active monitoring and intervention—can make things worse. This is a critical lesson for AI safety: some failure modes require acceptance and selection, not prevention and correction.

The Broader Safety Implications

We’re entering an era where AI systems are increasingly autonomous. They make decisions, take actions, and modify their own behavior with decreasing human oversight. The assumption underlying this transition is that these systems can safely improve themselves.

Our work suggests that assumption may be wrong.

Optimization instability isn’t a bug in a particular system. It’s a property of autonomous optimization under certain conditions—conditions that are common in real-world deployments. Class imbalance is everywhere. Aggregate metrics are everywhere. The pressure to automate and reduce human oversight is everywhere.

What we’ve characterized is a fundamental failure mode of self-improving AI systems. The more autonomous the system, the more important it becomes to understand when and why autonomous improvement fails.

CLAI Is Actively Working on This

At the Clinical Augmented Intelligence (CLAI) Group, we’re making optimization instability a research priority.

Our current work focuses on:

  • Characterizing the conditions under which optimization instability emerges across different model architectures and problem domains
  • Developing robust selection criteria that go beyond development set F1 scores to identify truly generalizable optimization states
  • Understanding the theoretical foundations of why retrospective selection outperforms active intervention
  • Building safeguards that can be integrated into autonomous agentic workflows to prevent catastrophic failure

The Pythia framework, where we first observed and characterized this phenomenon, is open source. We believe that understanding failure modes requires community engagement and diverse perspectives.

The Bottom Line

Autonomous AI systems that improve themselves can also destroy themselves. The optimization process that makes them better can, under common conditions, make them catastrophically worse—while standard metrics report that everything is fine.

This isn’t a theoretical concern. We’ve observed it. We’ve characterized it. We’ve shown that our intuitions about how to fix it are wrong.

As AI systems become more autonomous, understanding their failure modes becomes more critical. Optimization instability is one such failure mode. There are certainly others we haven’t discovered yet.

The question isn’t whether to deploy autonomous AI systems—that train has left the station. The question is whether we understand their failure modes well enough to deploy them safely.

On that front, we have a lot of work to do.


Read the paper:

Open source: