This AI SRE buyer's guide covers the 6 criteria for evaluating AI SRE

Comparing different AI approaches for SRE workflows

SRE teams are adopting AI for alert triage, incident investigation, and postmortems. But not every approach works the same way. This guide compares general-purpose LLMs, tool-augmented models, AI-augmented SaaS tools, and multi-agent systems across the workflows that matter most.

Site reliability engineers are responsible for the availability, performance, and operability of production systems. In practice that covers a wide range of workflows: responding to on-call alerts, triaging and investigating incidents, debugging production issues, and producing operational reports like postmortems. Each of these involves different systems, different data, and often different teams. AI is increasingly being applied across all of them, but what it can actually do varies significantly depending on the approach. Understanding those differences is what this piece is about.

A typical day for an SRE involves a lot of context-switching. You might start the morning reviewing deployment pipelines, get pulled into a production incident by mid-morning, spend the afternoon chasing down a performance regression someone flagged in Slack, and end the day writing up a postmortem from an incident three days ago that still isn't fully documented. Somewhere in there, you're also responding to alerts, answering questions from application teams, and updating runbooks that haven't been touched in a year.

The challenge is that this work spans so many systems. An incident might involve checking Grafana for the latency spike, Loki for the corresponding errors, GitHub for what deployed recently, Kubernetes for whether any pods are restarting, and a six-month-old Slack thread where someone mentioned something similar happening before. Each of these is a separate tool, a separate context, a separate query language. The 45 minutes you spend on an incident isn't usually spent fixing the problem. It's spent reconstructing what's happening across all those systems before you can even form a hypothesis.

This is the core challenge AI is being applied to in SRE. Not all approaches work the same way, and understanding the differences matters a lot for figuring out where to invest. Below is a breakdown of the main approaches in play today, followed by a look at three workflows where SREs are actually using them.

Different AI approaches to use in your SRE workflows

General-purpose LLMs

Tools like Claude, ChatGPT, and Gemini are where most SREs have started with AI, and for good reason. They're immediately useful for writing work: drafting postmortems, generating runbook templates, explaining what an obscure error message means, or thinking through a debugging strategy out loud. If you can describe your situation, they can help you work through it.

The limitation is obvious once you try to use them for anything involving live data. They know nothing about your systems unless you tell them. Pasting in a stack trace works. Asking them to check your production metrics doesn't.

LLMs with tool access

The next step is giving a model direct access to your systems. MCP (Model Context Protocol) has become the common pattern for this: you connect a model like Claude to tools like GitHub, PagerDuty, Datadog, Jira, or Slack, and it can query those systems, look up context, and take actions on your behalf within a conversation. An engineer can ask "what changed in the payment service in the last 24 hours" and get an actual answer from your real data.

This is genuinely more useful for SRE work. For an engineer who knows what questions to ask, it significantly reduces the tool-switching overhead of an investigation. The gap is that these setups don't drive an investigation themselves. They help you get answers faster. You still have to know what to look for, form the hypotheses, and decide when you have enough evidence to act.

AI-augmented SaaS tools

Rather than building on top of LLMs, the other path is the one observability and incident management vendors have taken: embedding AI directly into their products. Datadog's Bits AI can surface anomalies, explain probable root causes, and help you write queries in natural language. New Relic has similar capabilities within its own observability context. Microsoft has released an Azure SRE Agent that works within Azure Monitor and the broader Azure stack. AWS has announced analogous capabilities for its own ecosystem.

The appeal here is integration. These tools work with data they already have, they fit into workflows engineers already use, and they don't require additional setup. Datadog's Bits AI SRE might be an excellent application for using data already in Datadog. It may not have access to what happened in your code, deployment pipeline, or infrastructure if that data isn't already there. That's a reasonable tradeoff for many teams, and worth understanding before evaluating what it can do.

AI agents and multi-agent systems

The most capable approach today involves AI that can pursue an investigation autonomously across multiple systems rather than answering individual questions. Resolve AI, Incident.io, and Rootly all sit somewhere in this category, though they do different things.

One approach is to focus primarily on the coordination layer: automating status updates, keeping stakeholders informed, managing the incident timeline, and generating postmortem drafts from the incident record. Tools that do this well reduce the overhead of running an incident and free engineers from the logistics so they can focus on the technical work.

Resolve AI is built more specifically for the investigation itself. Specialized agents for code, infrastructure, metrics, logs, and change history run in parallel during an investigation, coordinated by an orchestrator that manages hypothesis formation and synthesizes findings across all of them. The multi-agent architecture is designed to handle what a single agent struggles with: covering multiple domains at depth simultaneously, rather than spreading thin across all of them. The tradeoff is that these systems are the most complex to set up and evaluate, and genuinely novel failure modes still benefit from experienced engineers working alongside them.

Where AI helps in SRE workflows

Alert triage during on-call

On-call shifts involve a constant stream of alerts, most of which don't require action. Evaluating each one, figuring out whether it's real, whether it's correlated with something else, and whether it warrants waking someone up is exhausting work even when nothing is actually wrong.

What good alert triage with AI looks like: every alert arrives with context already assembled. What's affected, how it compares to historical baselines, whether it correlates with other alerts firing at the same time or a recent deployment, and what similar alerts have typically meant in the past. The on-call engineer reads the context, makes a judgment call, and moves on.

AI-augmented tools like Datadog Bits AI and PagerDuty's AIOps already do meaningful work here. Alert correlation, noise suppression, and prioritization based on historical patterns are all areas where these tools have gotten genuinely good. For teams with mature observability setups, enabling these features is probably the highest-value, lowest-effort AI investment available today.

Where they fall short is in the cross-system context. An alert that looks routine might be significant in combination with a deployment from 90 minutes ago or an error spike in a service two hops away. Pulling that context together still requires someone to go looking for it. Agent-based systems can surface that cross-system context automatically at alert time, which changes the starting point for triage from a single signal to something closer to a situation report.

Incident investigation

Investigation is where most of the time in SRE work actually goes, and where the biggest opportunity for AI is. The work that consumes hours during a major incident, gathering evidence across systems, forming and revising hypotheses, ruling out alternative explanations, is exactly the kind of systematic multi-step reasoning that AI agents are built to do.

What good AI-assisted investigation looks like: by the time an engineer opens the incident, there's already a working theory with supporting evidence. Here's what happened, here's what probably caused it, here's why the other obvious explanations don't fit. The engineer's job is to review the reasoning, challenge what doesn't look right, and decide what action to take.

General-purpose LLMs can help engineers think through hypotheses or interpret data they've already gathered. MCP-connected setups let engineers pull cross-system context more efficiently. Both are useful, but neither investigates on its own. AI-augmented observability tools can surface likely root causes within their data, which is valuable up to the point where the root cause involves something outside that data.

Agent-based systems are closest to the ideal here. Resolve AI begins a parallel investigation automatically when an incident is opened, gathering evidence from code, infrastructure, and observability simultaneously and synthesizing it into ranked findings with explicit confidence levels. This doesn't replace the engineer's judgment, but it changes what the judgment is applied to: evaluating evidence rather than gathering it.

The honest caveat: investigations involving genuinely novel failure modes, unusual combinations of issues, or significant organizational context still benefit from an experienced engineer guiding the investigation rather than just reviewing it. The tooling is good and improving, but it isn't a replacement for people who know the systems.

Operational reporting and knowledge management

Postmortems are the part of SRE work that most often doesn't get done properly. By the time an incident is resolved, the team is tired and there's something else to deal with. The institutional knowledge that would make the next incident faster to resolve, the actual causal chain, the specific configuration state that mattered, the fix that worked, ends up either undocumented or buried in a Slack thread nobody will find.

General-purpose LLMs have already made a real dent here. Drafting a postmortem from a timeline you describe, structuring a runbook section, writing up an explanation for an unfamiliar failure mode: these are tasks where Claude, ChatGPT, and similar tools save meaningful time. This is probably where AI is most widely adopted in SRE workflows today, with good reason.

Tools with incident record integration go further by pulling from the actual timeline automatically, rather than depending on you to reconstruct it from memory. Postmortems generated this way are more complete and accurate than ones written after the fact.

The longer-term opportunity is that knowledge doesn't have to be captured in a separate step at all. Agent-based systems that sit in the investigation path build an evidence trail as the investigation runs. The causal chain, the hypotheses that were ruled out, the corrections engineers made along the way: all of it is captured as part of the investigation record rather than reconstructed afterward. Over time, that record feeds back into how the system investigates similar problems in the future. The knowledge that currently lives in senior engineers' heads, and leaves with them when they change teams, becomes part of the institutional record instead.

Why AI in SRE needs a full-stack approach

Looking at the approaches above, a pattern is visible. AI that works well in one part of SRE tends to fall short in another, and the reason is usually the same: it was built to solve one layer of the problem, not all of them together.

A useful way to think about this is in terms of what a complete AI system for SRE actually requires. At the model layer, general-purpose models trained on general data have limitations when applied to production systems. Production reasoning requires models that understand how infrastructure behaves, how observability data is structured, and what good investigation looks like. Post-trained or fine-tuned models purpose-built for production do better here.

Models alone aren't useful without agents that can apply them. Agents need to detect issues, form hypotheses, and pursue multi-step diagnosis across systems. They also need to work with each tool the way an expert would: knowing the right queries to run, interpreting results in context, and knowing where to look next based on what came back. And beyond analysis, agents need the ability to act: creating a ticket, running a remediation, executing a kubectl command, with appropriate guardrails given how costly mistakes in production can be.

Context is what makes all of the above useful in your specific environment. A system that reasons well about production in general but knows nothing about your services, your team's conventions, your past incidents, or your current architecture will still miss the things that matter most. Managing knowledge, learning from every investigation, and applying that learning in future incidents is what separates a system that gets better over time from one that stays at the same baseline.

Finally, the interface layer determines whether any of this actually reaches engineers during the workflows that matter. An agentic interface that connects to your code, infrastructure, telemetry, and knowledge and surfaces findings where engineers already work (Slack, a CLI, a web interface) is different from a standalone product that requires a separate context switch.

Most AI approaches in SRE today address one or two of these layers well. The cross-domain investigation problem that consumes the most engineering time requires all five working together, which is why adding AI features to an existing tool, or connecting a model to a few data sources, often shows diminishing returns when applied to the hardest problems.

Putting it together

These approaches aren't alternatives to each other. Most SRE teams will use AI features in their existing observability tools, occasionally use general-purpose LLMs for writing work, and may adopt MCP-based setups to reduce tool-switching overhead. These are additive improvements that fit into existing workflows.

The question is what's still slow or painful despite having all of that. For most teams, the answer is cross-domain incident investigation and the institutional knowledge that never quite gets documented. Those problems require AI that participates in investigations rather than assisting alongside them, and that learns from your specific environment over time. That's a different category of tool with a different category of setup and evaluation burden. Understanding that difference is what makes it possible to make the right investment at the right time.

Get the “AI for prod” newsletter

Get the “AI for prod” newsletter

Stay current on how the best engineering teams are using AI in production. Customer spotlights, product updates, how-tos, and more delivered monthly.