Announcing our $125M Series A at a $1B valuation

At Resolve AI, we're building frontier long-horizon AI agents that have to reason across code, infrastructure, and lots of telemetry data. We tested Anthropic's Opus 4.6 on our most difficult production investigation scenarios to understand its capabilities and constraints.
Here's what we found: Even when used as a drop-in replacement for Opus 4.5, we saw a 5-10% lift in all of our benchmarks. Opus 4.6 handled async coordination with higher coherence, investigated deeply without explicit prompting, and maintained better focus across large contexts.
Async tools are integral to keeping long-running agents interactive. But a common failure mode we've seen in prior models with async tools is they quickly lose coherence. They lose track of what got started and why.
Opus 4.6 understands async tools really well. In our evals, it systematically executed and tracked parallel work, maintaining clear awareness of dependencies between async tools. This extends naturally to subagent orchestration as well. It knows what it delegated, why, and how to synthesize results back into the main thread.
Prior Claude versions took frequent shortcuts, and we had to carefully design architectures that specifically counteracted this laziness.
Opus 4.6 is a very thorough model that can go deep in its investigations without explicit prompting to "be exhaustive" or "don't skip steps." If your prompts were tuned for a lazier model, you will now get extreme thoroughness, which increases end-to-end task times.
When used as a drop-in replacement for Opus 4.5, we observed a 40% increase in task completion times and we had to iterate on our prompts to make them work with Opus 4.6 under our latency constraints. For agents embedded in mission-critical workflows, this thoroughness may be warranted.
Agents fill up their context quickly, and long-running agents can never have a large enough context window.
But even with very large context windows, the deeper into the context you go, the weaker a model's attention to its earlier tokens becomes. We at Resolve AI call this recency bias.
Opus 4.6 holds focus better and is much more resilient to recency bias. In our experiments, the model's outputs stayed aligned with prompt instructions and tool specifications even when those details were buried 200k+ tokens back.
Production systems are mission-critical and change fast. AI agents for production should handle ambiguity, know when they're stuck, and coordinate across production data at scale. As part of our ongoing research, we want to evaluate frontier models on the following directions:
Async subagent orchestration. The ability to use async subagents in orchestrating agent-swarms to reason across code, infrastructure, and terabytes of telemetry data, while maintaining interactivity. These will require advanced abilities in nudging and early stopping of misguided subagents.
Human-agent collaboration. Asking for help and improved human-agent collaboration. When an agent gets confused due to incorrect instructions (e.g., outdated runbooks), can it actively ask for clarifications?
Adaptive thinking calibration. Can frontier models themselves scale computation based on task complexity, without extra nudges? Opus 4.6 is the first of its kind with adaptive thinking. We are hopeful for other advances in this direction.

Discover why most AI approaches like LLMs or individual AI agents fail in complex production environments and how multi-agent systems enable truly AI-native engineering. Learn the architectural patterns from our Stanford presentation that help engineering teams shift from AI-assisted to AI-native workflows.

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Software engineering has embraced code generation, but the real bottleneck is production. Downtime, degradations, and war rooms drain velocity and cost millions. This blog explains why an AI SRE is the critical next step, how it flips the script on reliability, and why it must be part of your AI strategy now.