Technology

You Can Try to Build an AI SRE. But Should You?

10/10/2025

11 min read

You Can Try to Build an AI SRE. But Should You?

It’s 2:13 a.m. The alert fires, again. Your on-call engineer stumbles into Slack, paging through logs while dashboards reload. The root cause isn’t obvious. In this groggy state, engineers fall back on instinct, remembering a similar incident, jumping to a fix, hoping it works. This fight-or-flight response kicks in before real reasoning begins.

This is where agentic AI changes the game. Purpose-built to reason across signals, take action, and learn from every outcome, these systems promise to transform how teams respond to incidents, debug in production, and build software with context already embedded in the workflow. With >65% of developer time spent on non-coding work, incidents costing millions annually, and on-call engineers often needing 6–9 months to become productive, the case for an AI SRE is no longer theoretical. It is urgent.

In this blog, we’ll explore:

Why in-house AI SRE prototypes rarely move beyond limited “strawman” builds
The real requirements of a production-ready multi-agent system
The costs, expertise, and tradeoffs leaders must weigh in build-versus-buy decisions
What it looks like when agentic systems actually work in production

The 70% Prototype Illusion: Why Even 5% Is Hard to Reach

Most internal attempts at AI builds start as one of three patterns:

Incident RAG bots: databases of past incidents connected to a basic retrieval system, able to summarize ongoing incidents or offer “last time this happened” tips.
Single-model MCP setups: thinking models like Claude or GPT hooked up to a handful of MCP servers, with a custom prompt to chain tool calls.
Hybrid approaches: combinations that layer a retrieval database on top of a single-model orchestration setup, multiplying fragility without addressing deeper challenges of reasoning, scale, and reliability in production.

These approaches can be helpful in small, controlled settings, but they fall apart in real production. Getting even 30% of the way to a real multi-agent AI SRE is harder than it looks. The reality is even more dire: according to MIT’s MLQ State of AI in Business 2025 report, only 5% of custom enterprise AI tools ever make it into production. ¹

The reason is simple: real failures are often novel, not repeats of the past, unlike the noisy alerts engineers quickly learn to ignore. General-purpose models can pattern match against what they have seen before, but when infrastructure-specific or proprietary failure modes emerge, pretraining provides no advantage. By contrast, a systematic multi-agent design is built to reason through new, unexpected scenarios, not just replay history.

Why? Production environments are not static. Systems change daily. An agent must understand telemetry, service dependencies, infrastructure layers, and code-level changes, and keep up as those evolve. It also needs deep reasoning across infrastructure issues, third-party API failures, internal bugs, and flaky deploys. Doing any of these well is hard. Doing all of them well, consistently, is extremely difficult.

And reality hits fast. A new frontier model version ships, and your agent behaves differently. Telemetry pipelines shift formats, causing hallucinations. A prototype trained on one non-prod cluster collapses under the heterogeneity of production data. Imagine your AI confidently declaring that a 99.9% uptime service is down because it misread a deployment marker as an error spike. Engineers lose trust, and feedback loops shut down.

As Gabor Angeli, AI Research Engineer at Resolve AI and former Google DeepMind researcher, put it: “You write something, run it a few times, it works. Then someone else tries it, and everything breaks.” ² The challenge is not just technical. It is systemic. Building agentic systems that learn continuously, interpret shifting signals, and make accurate tool calls across a sprawling live stack requires AI expertise that most organizations simply do not have. And if you do, is this really where you want that talent focused? That is the crux: most internal builds stall at brittle prototypes, while production-ready systems require a very different architecture.

What a production-ready multi-agent system actually requires

True AI SREs do more than triage incidents. They answer questions, provide historical and architectural context, and act within guardrails to help engineers build and operate better software.

After an incident, an engineer might ask: “What’s the health of our payments system right now?” The system should retrieve relevant telemetry, connect it to active alerts and recent pull requests, flag past regressions, and summarize it in context. A developer may notice a subtle latency shift and want to vibe debug, just like they vibe code, before it becomes an outage. A platform engineer may review a rollout across distributed services and assess failure impact without chasing logs for hours.

To deliver this, a multi-agent system must coordinate four capabilities:

Knowledge: Builds complete production context from distributed infra, telemetry, observability, code, and documentation
Reasoning: Formulates plans, tests hypotheses, surfaces root cause with evidence, and explains how it derived its outcomes
Action: Uses production tools like humans do to fetch evidence in addition to proposing or executing changes, like generating GitHub pull requests or adjusting service configurations
Learning: Builds memories to continuously improve by observing interactions, decisions, outcomes, and direct feedback

These agents must also hand off structured outputs to one another consistently, which is far harder than it sounds. Without disciplined orchestration, reasoning collapses into noise.

The true cost of building agentic AI in-house

Building agentic AI that operates reliably in live environments requires far more than APIs and prompts. It demands an integrated system that mirrors how seasoned engineers think, act, and learn, with robust guardrails.

Recent research underscores the difficulty:

McKinsey estimates trillions of dollars in AI potential, yet finds that only a fraction of enterprises are operationalizing it at scale ³ ⁴
MIT Technology Review reports that aligning LLMs often requires adversarial prompts during training to expose failure modes, reinforcing how fragile and unpredictable these systems can be in real-world environments ⁵
MLQ’s State of AI in Business 2025 report found that 95% of custom enterprise AI initiatives fail to reach production, underscoring how wide the gap is between prototypes and deployed systems ¹
Gartner predicts that by 2027, 40% of agentic AI initiatives will be abandoned or re-architected due to performance issues ⁶

Getting this right means strategically investing in three critical areas without compromises:

Foundation Layer:
- Structured, systems-aware knowledge grounding that unifies documentation, telemetry, service dependencies, and runbooks into machine-readable substrates
- Agent-centric CI/CD infrastructure with regression testing, golden datasets, and episodic evaluation
Intelligence Layer:
- Post-training tuning loops, including fine-tuning, reward model alignment, and implicit learning from human feedback
- Production-quality orchestration with memory modules, multi-agent routing, and fallback policies for degraded contexts
Trust and Safety Layer:
- Instrumentation for judgment, enabling agents to surface uncertainty, missing data, or unexplainable steps
- Ongoing model supervision that adapts to changes in model APIs, workflows, and system topology

Even simple internal tools accrue hidden costs. Teams that replace vendor solutions with custom builds often see early wins but soon need new test infrastructure, CI for behavior, and constant instrumentation just to maintain trust. Multi-agent AI systems magnify this challenge, since they must reason over live telemetry and system state without breaking under change.

And the cost is not just technical. Even a modest efficiency gain of 2-3% in engineering productivity translates into millions of dollars in value for most large organizations. Every week your best engineers spend on firefighting incidents, debugging brittle AI prototypes, or maintaining infrastructure is a week not spent shipping business-critical innovation. That is why this decision is as much about financial leverage as technical feasibility.

When it actually works: The upside

When executed well, the impact is transformative. Consider an AI SRE that discovers a creeping performance regression before it becomes an outage, tracing the issue to a misconfigured rate limit. Or a deployment failure traced in minutes to a Dockerfile change through agentic reasoning that built a causal chain across logs and build metadata.

As Mayank Agarwal, CTO and co-creator of OpenTelemetry, puts it: “It’s not about saving ten engineer-hours a week. It’s about ensuring your best engineers aren’t the only ones who can untangle your system at 2 a.m.”

Developers use these systems to explore architecture, plan sprints, and assess deployment risks. SREs rely on them to correlate telemetry and suggest remediations grounded in historical outcomes. These are not chatbot demos. They are production-grade engineering workflows, reimagined.

The strategic decision framework

For engineering leaders navigating build-versus-buy, success starts with asking the right questions:

Strategic Value Assessment
Is this agentic capability core to your product or IP? Will building it generate strategic differentiation, or are you recreating capabilities others already offer? If it is not core, you are diverting scarce AI expertise toward work that will never set you apart.
Capability & Resource Reality Check
Do you have the AI and domain expertise to maintain and evolve it? Building is not a one-time investment. It requires ongoing tuning, testing, and learning systems that adapt with your stack. Without a deep bench of AI and domain talent, you risk ending up with brittle systems that collapse in production.
Opportunity Cost Analysis
Will this effort accelerate or delay your broader roadmap? Internal projects often redirect senior engineers from high-leverage initiatives. The opportunity cost is not just headcount; it is the innovation you delay by turning your best engineers into platform maintainers.
Value Optimization Strategy
Can you achieve 80% of the value faster through partnership? Buying foundational capabilities while customizing at the edge often delivers the best of both worlds, giving you speed without losing control of differentiation.

As Kent Wills of Yelp succinctly put it: “If you're building for parity, not strategic advantage, you've chosen a very expensive form of experimentation.”

Making the Call

Agentic AI is not just another feature. It is an architectural transformation in how software is designed, operated, and improved. Like any foundational shift, the question is not whether you can build it; it is whether you can build it well. Most teams would not build a code-generation model from scratch because they are commoditized. In contrast, a fully functional multi-agent system for SRE and engineering is defined by scale, complexity, and domain depth.

The reality is simple: DIY attempts at AI SRE break under production change, they consume scarce AI and domain talent, and they impose massive opportunity costs by pulling your best engineers away from work that grows the business. The teams that succeed will match their approach to reality, building where it creates a durable advantage, buying where it accelerates time-to-value, and always keeping engineers focused on what matters most.

Next: Explore how to evaluate an AI SRE.

Resources

Manveer Sahota

Product Marketing

Content

The 70% Prototype Illusion: Why Even 5% Is Hard to Reach
What a production-ready multi-agent system actually requires
The true cost of building agentic AI in-house
When it actually works: The upside
The strategic decision framework
Making the Call
Resources

Manveer Sahota

Product Marketing

Technology

The role of multi agent systems in making software engineers AI-native

Discover why most AI approaches like LLMs or individual AI agents fail in complex production environments and how multi-agent systems enable truly AI-native engineering. Learn the architectural patterns from our Stanford presentation that help engineering teams shift from AI-assisted to AI-native workflows.

Technology

AI SRE: The Next Critical Application of AI in Software Engineering

Software engineering has embraced code generation, but the real bottleneck is production. Downtime, degradations, and war rooms drain velocity and cost millions. This blog explains why an AI SRE is the critical next step, how it flips the script on reliability, and why it must be part of your AI strategy now.

Product

Is Vibe debugging the answer to effortless engineering?

Vibe debugging is the process of using AI agents to investigate any software issue, from understanding code to troubleshooting the daily incidents that disrupt your flow. In a natural language conversation, the agent translates your intent (whether a vague question or a specific hypothesis) into the necessary tool calls, analyzes the resulting data, and delivers a synthesized answer.

Social

Shaping the future of software engineering

Join the conversation

You Can Try to Build an AI SRE. But Should You?

The 70% Prototype Illusion: Why Even 5% Is Hard to Reach

What a production-ready multi-agent system actually requires

The true cost of building agentic AI in-house

When it actually works: The upside

The strategic decision framework

Making the Call

Resources

Related Post

The role of multi agent systems in making software engineers AI-native

AI SRE: The Next Critical Application of AI in Software Engineering

Is Vibe debugging the answer to effortless engineering?