Technology

Our early impressions of Claude Opus 4.6

02/05/2026

3 min read

Our early impressions of Claude Opus 4.6

Evaluating Opus 4.6 for Production AI Agents

At Resolve AI, we're building frontier long-horizon AI agents that have to reason across code, infrastructure, and lots of telemetry data. We tested Anthropic's Opus 4.6 on our most difficult production investigation scenarios to understand its capabilities and constraints.

Here's what we found: Even when used as a drop-in replacement for Opus 4.5, we saw a 5-10% lift in all of our benchmarks. Opus 4.6 handled async coordination with higher coherence, investigated deeply without explicit prompting, and maintained better focus across large contexts.

Uses async tools and subagents effectively

Async tools are integral to keeping long-running agents interactive. But a common failure mode we've seen in prior models with async tools is they quickly lose coherence. They lose track of what got started and why.

Opus 4.6 understands async tools really well. In our evals, it systematically executed and tracked parallel work, maintaining clear awareness of dependencies between async tools. This extends naturally to subagent orchestration as well. It knows what it delegated, why, and how to synthesize results back into the main thread.

Less lazy, even in deep investigations

Prior Claude versions took frequent shortcuts, and we had to carefully design architectures that specifically counteracted this laziness.

Opus 4.6 is a very thorough model that can go deep in its investigations without explicit prompting to "be exhaustive" or "don't skip steps." If your prompts were tuned for a lazier model, you will now get extreme thoroughness, which increases end-to-end task times.

When used as a drop-in replacement for Opus 4.5, we observed a 40% increase in task completion times and we had to iterate on our prompts to make them work with Opus 4.6 under our latency constraints. For agents embedded in mission-critical workflows, this thoroughness may be warranted.

Attention doesn't drift in long context

Agents fill up their context quickly, and long-running agents can never have a large enough context window.

But even with very large context windows, the deeper into the context you go, the weaker a model's attention to its earlier tokens becomes. We at Resolve AI call this recency bias.

Opus 4.6 holds focus better and is much more resilient to recency bias. In our experiments, the model's outputs stayed aligned with prompt instructions and tool specifications even when those details were buried 200k+ tokens back.

Looking ahead

Production systems are mission-critical and change fast. AI agents for production should handle ambiguity, know when they're stuck, and coordinate across production data at scale. As part of our ongoing research, we want to evaluate frontier models on the following directions:

Async subagent orchestration. The ability to use async subagents in orchestrating agent-swarms to reason across code, infrastructure, and terabytes of telemetry data, while maintaining interactivity. These will require advanced abilities in nudging and early stopping of misguided subagents.

Human-agent collaboration. Asking for help and improved human-agent collaboration. When an agent gets confused due to incorrect instructions (e.g., outdated runbooks), can it actively ask for clarifications?

Adaptive thinking calibration. Can frontier models themselves scale computation based on task complexity, without extra nudges? Opus 4.6 is the first of its kind with adaptive thinking. We are hopeful for other advances in this direction.

Vamsi Bedapudi

Member of Technical Staff

@ Resolve AI

Rushin Shah

VP of Engineering

Rushin Shah is VP of Engineering at Resolve AI, with over a decade of AI expertise across DeepMind, Google, Meta, and Apple. Rushin led teams that shipped frontier AI capabilities like Deep Research, Canvas, Gemini in Chrome, and many more.

Technology

The role of multi agent systems in making software engineers AI-native

Discover why most AI approaches like LLMs or individual AI agents fail in complex production environments and how multi-agent systems enable truly AI-native engineering. Learn the architectural patterns from our Stanford presentation that help engineering teams shift from AI-assisted to AI-native workflows.

Fireside Chat: How FinServ Companies Optimize Cost with AI for Prod

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Technology

AI SRE: The Next Critical Application of AI in Software Engineering

Software engineering has embraced code generation, but the real bottleneck is production. Downtime, degradations, and war rooms drain velocity and cost millions. This blog explains why an AI SRE is the critical next step, how it flips the script on reliability, and why it must be part of your AI strategy now.

Social

Machines on call for humans

Join the conversation

Our early impressions of Claude Opus 4.6

Evaluating Opus 4.6 for Production AI Agents

Uses async tools and subagents effectively

Less lazy, even in deep investigations

Attention doesn't drift in long context

Looking ahead

Related Post

The role of multi agent systems in making software engineers AI-native

Fireside Chat: How FinServ Companies Optimize Cost with AI for Prod

AI SRE: The Next Critical Application of AI in Software Engineering