Resolve.ai
  • Product
    • AI SRE
    • Debugging prod
  • Customers
    • AI SRE buyers guide
    • Evaluating AI for prod
    • AI ROI Playbook
    • Prompt library
    • Integrations
    • Security
    • Docs
  • Blog
    • About us
    • Careers
    • Events
Book a demo
Sign up
Sign up
  • Home
  • Product
  • Use cases
  • Customers
  • Resources
  • Blog
  • Company

Social

  • Linkedin
  • Youtube
  • X
Sign up
Privacy PolicyTerms of Service

©Resolve.ai - All rights reserved

Resolve.ai logo

Machines on call for humans

Contact us
Product
Use cases
AI SREDebugging prod
Customers
Resources
AI SRE buyers guideEvaluating AI for prodAI ROI PlaybookPrompt libraryGlossary
Blog
Company
About usCareersEvents
ProductCustomersBlog

Join the conversation

LinkedInX/TwitterYouTube

©Resolve.ai - All rights reserved

Terms of ServicePrivacy Policy
Back to Blog
Technology

Our early impressions of Claude Opus 4.6

02/17/2026
3 min read
Share:
Our early impressions of Claude Opus 4.6Our early impressions of Claude Opus 4.6

Evaluating Opus 4.6 for AI Production Agents

At Resolve AI, we're building frontier long-horizon AI agents that have to reason across code, infrastructure, and lots of telemetry data. We tested Anthropic's Opus 4.6 on our most difficult production investigation scenarios to understand its capabilities and constraints.

Here's what we found: Even when used as a drop-in replacement for Opus 4.5, we saw a 5-10% lift in all of our benchmarks. Opus 4.6 handled async coordination with higher coherence, investigated deeply without explicit prompting, and maintained better focus across large contexts.

Uses async tools and subagents effectively

Async tools are integral to keeping long-running agents interactive. But a common failure mode we've seen in prior models with async tools is they quickly lose coherence. They lose track of what got started and why.

Opus 4.6 understands async tools really well. In our evals, it systematically executed and tracked parallel work, maintaining clear awareness of dependencies between async tools. This extends naturally to subagent orchestration as well. It knows what it delegated, why, and how to synthesize results back into the main thread.

Less lazy, even in deep investigations

Prior Claude versions took frequent shortcuts, and we had to carefully design architectures that specifically counteracted this laziness.

Opus 4.6 is a very thorough model that can go deep in its investigations without explicit prompting to "be exhaustive" or "don't skip steps." If your prompts were tuned for a lazier model, you will now get extreme thoroughness, which increases end-to-end task times.

When used as a drop-in replacement for Opus 4.5, we observed a 40% increase in task completion times and we had to iterate on our prompts to make them work with Opus 4.6 under our latency constraints. For agents embedded in mission-critical workflows, this thoroughness may be warranted.

Attention doesn't drift in long context

Agents fill up their context quickly, and long-running agents can never have a large enough context window.

But even with very large context windows, the deeper into the context you go, the weaker a model's attention to its earlier tokens becomes. We at Resolve AI call this recency bias.

Opus 4.6 holds focus better and is much more resilient to recency bias. In our experiments, the model's outputs stayed aligned with prompt instructions and tool specifications even when those details were buried 200k+ tokens back.

Looking ahead

Production systems are mission-critical and change fast. AI agents for production should handle ambiguity, know when they're stuck, and coordinate across production data at scale. As part of our ongoing research, we want to evaluate frontier models on the following directions:

Async subagent orchestration. The ability to use async subagents in orchestrating agent-swarms to reason across code, infrastructure, and terabytes of telemetry data, while maintaining interactivity. These will require advanced abilities in nudging and early stopping of misguided subagents.

Human-agent collaboration. Asking for help and improved human-agent collaboration. When an agent gets confused due to incorrect instructions (e.g., outdated runbooks), can it actively ask for clarifications?

Adaptive thinking calibration. Can frontier models themselves scale computation based on task complexity, without extra nudges? Opus 4.6 is the first of its kind with adaptive thinking. We are hopeful for other advances in this direction.

Get the “AI for prod” newsletter

Get the “AI for prod” newsletter

Stay current on how the best engineering teams are using AI in production. Customer spotlights, product updates, how-tos, and more delivered monthly.

Authors

Vamsi Bedapudi

Member of Technical Staff

@ Resolve AI

Rushin Shah

VP of Engineering

Rushin Shah is VP of Engineering at Resolve AI, with over a decade of AI expertise across DeepMind, Google, Meta, and Apple. Rushin led teams that shipped frontier AI capabilities like Deep Research, Canvas, Gemini in Chrome, and many more.

AI for prod ebook

AI for prod ebook

Learn how top engineering teams use AI to run production.

Download
lead-title-icon

Related Post

The role of multi agent systems in making software engineers AI-native
Technology

The role of multi agent systems in making software engineers AI-native

Discover why most AI approaches like LLMs or individual AI agents fail in complex production environments and how multi-agent systems enable truly AI-native engineering. Learn the architectural patterns from our Stanford presentation that help engineering teams shift from AI-assisted to AI-native workflows.

Fireside Chat: How FinServ Companies Optimize Cost with AI for Prod

Fireside Chat: How FinServ Companies Optimize Cost with AI for Prod

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Introducing Resolve AI
Company

Introducing Resolve AI

Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.