Build or buy? See where eng teams are landing

We are used to the excitement when frontier models get better at generative tasks. We speak fluently about tokens, context windows, and benchmark scores at every release.
On the flip side, we are also used to disappointment. Point AI at a real enterprise task and the right answer does not magically appear. We blame the model or wait for the next one. But we don’t ask what is missing. Instead we are forcing AI usage and measuring progress with tokens.
A model is open-ended and can produce a 1000 coherent answers, but most enterprise work has exactly one correct answer. Channeling open-ended capability toward that one right answer is unsolved in enterprise AI. Closing that gap takes two things. One is the architecture that turns raw capability into the right answer. The other is the platform that delivers that architecture to a real enterprise, which is a full story on its own. This piece is about the architecture.
Production incidents are the hardest litmus test of whether that architecture works. To reach the root cause, AI has to navigate enormous and messy context, reason across many hops, stay inside strict guardrails, and converge on the one correct answer, all while revenue drains and the clock runs. It is the single-right-answer problem under the most demanding conditions for solving it.
A bigger model will not close it. A model has no access to your production state or definition of what correct means in your environment. These gaps persist at any model or model quality, because they are properties of producing a real outcome inside a specific system. The right architecture in conjunction with the right models is the answer. It involves:
Model orchestration: No single model is best at everything. One model reasons best over causal chains, another is stronger at vision, and another writes the cleanest query against a specific data model. The art is decomposing a task and routing each piece to the best model for the task, then recomposing the results into one answer. Done well, this is also where cost is controlled. Running a frontier model on every step is how budgets balloon. Matching each step to the cheapest model that can do it well is what keeps quality and value aligned. And this mapping changes every few weeks. Every model release requires the orchestration to shift and push the state of the art in accuracy.
Context engineering: A model does not know that Service A calls Service B, that B shipped twenty minutes ago, or who owns the config that moved. The hard part is engineering what enters the model's context at the moment it reasons. Give it too little and it produces a confident answer about an environment it has never seen. Give it too much and the signal drowns and quality collapses. Deciding what to retrieve, what to compress, and what to leave out, against a production system that changes under you, is the work.
Causal reasoning: A single model investigating an incident commits to one path early and gets worse as its context fills with detail from domains it was never strong in. We run a team instead. A lead frames the investigation, specialized agents pursue competing hypotheses in parallel, and a verifier checks each finding against production evidence rather than against how plausible it sounds. That last point is the whole game. A model is rewarded for answers that read well. Production only rewards the one that is true, and the only way to tell them apart is to test the hypothesis against the actual system. This is the layer where most of the intelligence lives, and most of it sits outside the model.
Governed actions: Knowing the fix and being allowed to apply it are different things. Reading from your systems is one risk class. Acting on them is another: silencing an alert, reverting a commit, opening a PR, running a workflow require a different level of trust. The architecture decides what runs autonomously, what waits for a human, and what is never allowed, and it enforces that as a property of the system rather than as an instruction you hope the model honors.
Continual learning: A raw model is frozen at training time. Your environment is not. The engineer who corrects an investigation today is teaching the system something the model never saw, and that correction has to become a skill any agent can retrieve next time. The hard problem is doing this without the system drifting, learning the wrong lesson from a one-off and applying it everywhere. Done right, the platform that handles your hundredth incident is materially better at your environment than the one that handled your first.
Domain evals: The entire architecture starts and ends at evals. Model benchmarks say nothing about whether you get the right root cause in your systems. The work is building an eval framework that mirrors how your own engineers investigate, so that every time you swap a model or change the architecture, you can prove it got better for you rather than worse.
These six layers produce the right answer. Delivering it to a real enterprise is the second effort, and it runs as deep as the first: integrations across your stack, security on every action an agent takes, cost controls, audit, and deployment that clears a procurement review. The architecture is fundamentally a research problem. Delivering it completely is an engineering one.
The flywheel comes from how these layers work with each other. Better context makes the reasoning sharper. Better reasoning makes the actions safer. Evals make all of it improvable. A better model improves one input to this system and leaves the other five exactly where they were.
For three years the industry measured one thing: the potential of AI. Higher benchmark scores, more parameters, longer context windows, what the model could conceivably do. The demos sold possibilities, and we optimized the ceiling of what was achievable and called it progress.
That conversation is finally turning. The questions now are about outcomes, ROI, and workflows, three ways of asking what the model actually got done rather than what it could do. It is the right turn, and it exposes what is missing: a token is not an outcome.

Join our engineering leads for "Behind the Build", a webinar series deep-dive into how we built agents that run software.

Spiros Xanthos
Founder and CEO
Spiros is the Founder and CEO of Resolve AI. He loves learning from customers and building. He helped create OpenTelemetry and started Log Insight (acquired by VMware) and Omnition (acquired by Splunk), most recently he was an SVP and the GM of the Observability business at Splunk.

The question isn't whether AI belongs in production anymore. Here's what engineers at AWS Summit NYC 2026 told us about how agents run your software, what guardrails they want, and how the pricing should work.

Watch how Resolve AI investigates a service timeout from application logs through Kubernetes pods down to failing memory modules in a UCS blade - building a complete causation chain in 3 minutes. See the stark contrast between traditional multi-team incident response (4 teams, multiple tools, hours of coordination) and AI-native investigation that maps dependencies from app code to storage infrastructure without organizational handoffs. Learn why engineering silos slow incident response and how AI agents can reason across the entire production stack as one connected system.

Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.