Product

The Failures That Led to 2x Agent Accuracy Improvement

06/26/2026

12 min read

The Failures That Led to 2x Agent Accuracy Improvement

I have been at Resolve AI since the start, so I have lived through every iteration of how we have built our agents, including those that did not behave as we expected. In internal evaluations, the new architecture, announced in May, delivers more than 2x improvement in root cause quality of the prior model, the same model that helped customers like DoorDash reduce time to root cause by up to 87%. Getting there required going through a sequence of architectural failures and understanding what each one was telling us.

In this blog, I share the journey that led us to our current architecture and the lessons that we learned along the way. This post follows up from our Behind the Build: How we 2x'd agent accuracy AMA to explain our approaches, learnings, and why the agent architecture matters specifically for production systems.

Why this problem is harder than it looks

What makes it hard to get to the root cause of a production incident? The telemetry is, for all practical purposes, infinite. The scale of data you have to process and understand is so far beyond the cognitive load our brains can handle or a context window that it might as well be infinite. So the real problem becomes: how do you query and summarize that data in a way that brings it down to something you can reason over, and then how do you reason over it to come up with theories that are actually useful? Get the exploration wrong in either direction, and you pay for it: over-explore and you burn the latency window that matters during a live incident; under-explore and you miss the actual cause entirely.

On top of that, production incidents span many different teams and knowledge bases. No one person at the company has the full mental model of how everything fits together, which is why incidents end up pulling in 10 or more engineers, each holding a partial picture. I like to say runbooks are always about 50 percent accurate. They are always somewhat right and somewhat out of date, and you need to understand which parts are which, making even the organizational knowledge layer a moving target.

And the cost of being wrong is very different from other domains. In a coding agent, an incorrect answer is most commonly caught by reviews or tests. In production, a confident wrong answer can send a junior engineer down a path that wastes an hour during a live outage, or it can lead to a remediation action that makes things worse. You can run a single command to drop an entire database. That asymmetry shapes every decision you make when you are building this kind of system.

The workflow architecture: rigid by design, rigid by consequence

In early 2024, when we started on our journey, models were not reliably calling tools in sequence. You could not count on getting through twenty tool calls in a row without something going wrong, whether that was a dropped call, malformed JSON that needed repair before the next stage could run, or context loss partway through. So our initial architecture was a tightly scripted workflow DAG to work around these limitations, where each stage was a narrow prompt with one job: classify the alert, gather evidence from logs and metrics, generate a relevant query, extract insights, and rank hypotheses. Between stages, we ran JSON repair and forwarded the sanitized results.

For the common cases, this worked reasonably well. The problem was the tail incidents, which are the ones that matter most in practice. Rare incidents are disproportionately severe. When an incident deviated from the workflow’s expectations, there was no mechanism to go off-script. The system would follow its predetermined evidence-gathering path, produce a report that appeared complete, and miss the actual cause entirely because it was outside the workflow’s field of view. Therefore, our investigation model was locked into specific investigation flows that did not scale with complexity. The approach worked for predictable patterns that covered a substantial portion of the production issues we were dealing with, but failed in scenarios where the on-callers were looking for a helping hand, because even they lacked the necessary context for those novel issues.

The single-agent architecture: three problems we encountered

In mid 2025, frontier models evolved, and tool calling became reliable enough for us to move to a genuinely agentic approach. Rather than scripting the investigation, a single primary agent would receive the alert, consult the knowledge graph we had built for the production environment, and dynamically decide which data to gather and in what order. The knowledge graph is a structured map of services, dependencies, deployments, and runbooks that provides agents with live context on how the system fits together. We still used task-specific subagents for bounded tasks to manage context, but the investigation, reasoning, and orchestration logic lived in a single primary agent. It could pursue unexpected leads, adjust based on what it found, and handle novel incident types that the workflow system could not. With this approach, we saw drastic improvements in accuracy and output quality, which reflected the surge in customer adoption because, at the time, it was the leading approach, better than other systems or internal processes.

However, three failure modes began to emerge from this that we had not fully anticipated at the architectural level.

Context saturation. In a multi-service incident, the evidence trail gets very long. The agent is accumulating log summaries, metrics, deployment histories, dependency graphs, and intermediate conclusions throughout an ongoing investigation. At some point, it starts degrading, contradicting conclusions it reached earlier, asking for data it already retrieved, and missing something it surfaced a few steps back. This was not the classic hallucination of inventing things from nothing. It was the degradation you get when you load a system beyond what it can coherently hold, and it got worse as incident complexity increased. In hindsight, not a surprise that machines could face similar limitations as humans do.
Single reasoning process shortfalls. A single agent generates a hypothesis and then evaluates it using the same reasoning process that produced it. There is no independent pressure on the conclusion. In practice, this looked like the agent latching onto a plausible root cause early in an investigation, then, because it was already looking through that theory’s lens, finding confirming evidence and stopping. The cases where this broke down were exactly the ones where the early theory was wrong but defensible, and the real root cause was in a thread the agent had deprioritized. We would get reports that were internally coherent and completely wrong, or in other words, the confident wrong answer.
Organizational learning gaps. Many incidents are not something you can investigate effectively without knowing what has been changing over time, what incidents have occurred in the past, and what caused them. The single-agent architecture had a knowledge graph that provided environmental context about the production system, but that context was static relative to the investigation itself. What it did not have was a way to accumulate learnings across investigations and make them accessible as a shared artifact for other agents or engineers picking up where a previous session left off. Production is a multiplayer problem that plays out across teams and time. When an incident spans a handoff between engineers, or when a recurring issue surfaces in a slightly different form, the investigation needs to carry forward what was learned before, not restart from the same baseline every time.

Multi-agent architecture: building around the actual constraints

The multi-agent architecture, a coordinated set of purpose-built agents each with a distinct role, came directly out of those three failure modes. We distributed the investigation across multiple agents each carrying a focused slice of the investigation rather than the full accumulation, built in structural independence so that conclusions could be evaluated by something that had not already committed to them, and made the investigation itself a team artifact with a shared evidence layer that any agent or engineer could access, that outlived any single session, and that accumulated organizational knowledge across investigations over time.

The system is structured around three agent types. Investigators pursue specific hypotheses across specific parts of the system in parallel, each carrying only the evidence relevant to their thread. Verifiers apply adversarial pressure to the conclusions investigators surface, checking whether the evidence actually supports what is being claimed rather than just whether the claim sounds plausible. Communicators manage stakeholder updates and keep the rest of the organization informed while the investigation runs. The verifier being structurally independent of the investigation is what catches the class of errors the single-agent system propagated most regularly: conclusions that were internally coherent but did not hold up when examined by something that had not already committed to the theory. That is the primary driver of the improvement in accuracy.

How we measure and how evals govern the decisions

The 2x figure is measured against the prior single-agent architecture on an internal benchmark of production-representative incidents. Accuracy here means whether the agents correctly identified the root cause, not whether they produced a coherent narrative. Those are genuinely different things, and conflating them was part of why the single-agent system looked better in informal review.

We do not move to a new architecture because it seems more principled or better reflects how humans work. The only way to make that decision responsibly is with data and evals. So we spend a lot of time building out simulated environments with production-like issues we have seen in real life, and we are always testing the most up-to-date approaches against the current system to see where we actually are. Each architectural transition happened when the eval data showed the new approach outperforming the old one on the incident types that matter, including latency. We ran the workflow system longer than we felt comfortable because the single-agent approach was not yet demonstrably better, and made the move when it was.

Latency has been a constant constraint throughout. An answer within 30 minutes is better than no answer, but during a live incident, it is often insufficient. Getting initial insights within a couple of minutes and deeper root cause within a few minutes, depending on complexity, has been the product target across all three architectures. The multi-agent approach maintains that because the parallelism offsets the coordination overhead.

The model layer and what Resolve AI Labs is working on

A question that comes up a lot is how much of this improvement is agent architecture versus models. Both matter, and they are not independent of each other. Architectural changes become possible as models improve, and model improvements yield larger gains when the architecture is built to leverage them effectively. Running a heavier frontier model does not solve the context saturation problem. The model improvements that actually help are the ones that make each individual agent’s reasoning more accurate within its focused scope.

There is also a specific place where frontier models fall short for production RCA, regardless of architecture. Causal reasoning in production systems requires understanding why a change in one service caused latency to increase in a dependent service three hops away, or why a query that performed fine under normal load starts causing cascading read IOPS failures at a specific traffic level. Frontier models are trained to be general-purpose, which makes them highly capable across a wide range of tasks and sets a ceiling for this particular task. Cause-and-effect reasoning within complex distributed systems is a distinct capability, not just general intelligence applied to a new domain.

Resolve AI Labs launched a couple of months ago and is focused on exactly this. The work is on domain-specific post-trained models for causal reasoning in production systems, verifier models for evaluating open-ended RCA conclusions, and the simulation and replay infrastructure needed to generate training data for incident types that are, by definition, rare in the wild. Production incidents do not happen frequently enough to generate useful training signal in real time, and the telemetry from an incident six weeks ago has largely rotated out. Building synthetic environments and replay infrastructure to train and evaluate durably, independent of what telemetry still exists, is one of the core research problems the team is working on.

The multi-agent approach we ship today runs on a combination of post-trained and frontier models, with the orchestration, context management, and evaluation infrastructure that make them effective for this problem. As the Resolve AI Labs models continue to mature, they slot into the architecture where domain-specific causal reasoning matters most, without requiring the overall architecture to change.

What this looks like in practice

The agents Resolve AI ships today run investigations the way an experienced incident team would. Parallel threads pursuing independent hypotheses, a verifier applying explicit pressure to conclusions before they reach the report, and an output the on-call engineer can interrogate directly, following up on specific data points or asking for the IDs behind an aggregated finding. The evidence used in the investigation is embedded in the report, so nothing is a black box.

The 2x accuracy improvement over the single-agent architecture means fewer false leads, fewer incorrect root causes that consume engineering time during a live outage, and more incidents in which the on-call engineer arrives at the Slack thread to find the investigation already completed. And the architecture will keep evolving. Every new production scenario we encounter and every meaningful model improvement feeds back into how the system is built. With more than 80 engineers focused exclusively on AI for prod, that iteration cycle is one of the things that keeps the system at the frontier. The failures that got us here were not fun to go through, but they were the only way to understand what production RCA actually requires.

See the agents that run and fix software in action

Join our engineering leads for "Behind the Build", a webinar series deep-dive into how we built agents that run software.

Watch now

Steven Karis

Chief Architect & Founding Engineer

Steven is a founding engineer at Resolve AI. He is focused on building the agentic AI systems that powers Resolve's AI Production Engineer. He has previously held engineering roles at Splunk and Uber.

Content

Why this problem is harder than it looks
The workflow architecture: rigid by design, rigid by consequence
The single-agent architecture: three problems we encountered
Multi-agent architecture: building around the actual constraints
How we measure and how evals govern the decisions
The model layer and what Resolve AI Labs is working on
What this looks like in practice

The AI ROI Playbook

Learn how to measure AI value across the full SLDC.

Download

Steven Karis

Chief Architect & Founding Engineer

Company

Bag More 9s at AWS Summit

The question isn't whether AI belongs in production anymore. Here's what engineers at AWS Summit NYC 2026 told us about how agents run your software, what guardrails they want, and how the pricing should work.

Product

When Resolve AI traced an app error through k8s to a hardware problem in just 3 minutes

Watch how Resolve AI investigates a service timeout from application logs through Kubernetes pods down to failing memory modules in a UCS blade - building a complete causation chain in 3 minutes. See the stark contrast between traditional multi-team incident response (4 teams, multiple tools, hours of coordination) and AI-native investigation that maps dependencies from app code to storage infrastructure without organizational handoffs. Learn why engineering silos slow incident response and how AI agents can reason across the entire production stack as one connected system.

Fireside Chat: How FinServ Companies Optimize Cost with AI for Prod

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Social

Machines on call for humans

Join the conversation