Learn how Zscaler uses AI for prod to get to RCA for 150K alerts in minutes

In our previous post, we shared our early impressions of Claude Opus 4.6’s strengths in agent coordination, thoroughness, and long-context attention. In this version, we focus on Sonnet 4.6, the effort parameter, and what adaptive thinking means in practice for production agents.
Both Opus 4.6 and the newly released Sonnet 4.6 share these features. We benchmark these models against a curated set of real production incidents (from subtle misconfigurations to cascading failures), scoring on root cause accuracy and investigation completeness. Sonnet 4.6 at medium effort with adaptive thinking came surprisingly close to Opus 4.6 on our hardest investigations, at a fraction of the cost. Adaptive thinking eliminated the need for manually calibrating reasoning depth, and the effort parameter gave us a reliable lever for the quality-latency tradeoff.
Previous Claude models (Sonnet 4.5, Opus 4.5) offered binary extended thinking: off, or on with a fixed token budget. The 4.6 generation introduces adaptive thinking: the model decides for itself when and how deeply to reason.
This matters because production incidents are unpredictable. A cascading failure across three services might look routine at first, with an obvious spike in error rates. But under specific conditions, it could be a novel incident that requires deep investigation. Fixed thinking budgets can't handle this well. You either over-allocate reasoning on routine steps and burn latency, or under-allocate on the hard parts and miss the root cause. Adaptive thinking handles the transition naturally.
In practice, the model spends less time thinking early in an investigation, gathering evidence, pulling logs, and querying metrics. It dispatches tool calls and moves through evidence collection without over-deliberating.
As the investigation deepens and the agent starts correlating evidence, the behavior shifts. When it needs to cross-reference timestamps across multiple signals, evaluate whether evidence supports or contradicts a hypothesis, or reason through causal chains between services, the model thinks significantly more. It self-reflects on evidence relevance and pays close attention to temporal ordering.
Adaptive thinking naturally allocates deeper reasoning to the hard parts (correlation vs causation, determining next investigation steps) and stays light on the routine parts.
Always set a high max output token limit, at least 16k. Thinking and output tokens share the same budget. With lower limits, the model hits the ceiling mid-reasoning and cuts off abruptly, no graceful degradation. We default to 32k max_tokens and tune this down only for simpler subagent tasks.
Set effort explicitly. The effort parameter controls how much the model explores before committing. Sonnet 4.6 defaults to high effort. If you're migrating from Sonnet 4.5 and not setting effort, you'll see higher latency and may notice the model overthinking. Start with effort set to medium and adjust from there.
Write precise tool descriptions. The 4.6 models select tools based on what they say they do, not just surrounding context. We found that precision in tool names and parameter descriptions directly impacts tool selection accuracy.
The model is more proactive, so tune your prompts accordingly. Instructions like "be thorough" or "think carefully" which were common workarounds for Sonnet 4.5 amplify the model's already-proactive behavior on 4.6 and can cause overthinking loops. The effort parameter is a better lever for controlling depth.
With Sonnet 4.6 we observed ~10% improvement over Opus 4.5 with thinking disabled, and ~20% with high thinking on our investigation eval suite. Against Sonnet 4.5, the jump is even larger.
The tradeoff is latency.
Every new model generation shifts what's possible for AI agents in production. Capabilities like adaptive thinking don't just improve results, they open up new architectural patterns we hadn't considered before. Evaluating these frontier models is a continuous effort and part of how we build. We're actively researching how these advances reshape agent design.
This is the kind of work that sits at the intersection of frontier AI research and real-world systems engineering. If this is the kind of problem you want to work on, we're hiring. And if you're building agents for production or thinking about how frontier models fit into your engineering workflows, we'd love to talk.


Discover why most AI approaches like LLMs or individual AI agents fail in complex production environments and how multi-agent systems enable truly AI-native engineering. Learn the architectural patterns from our Stanford presentation that help engineering teams shift from AI-assisted to AI-native workflows.

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.