Software reliability in the age of AI-generated code
Software reliability is the probability a system runs without failure over time. Why AI-generated code strains it, and why keeping systems reliable now needs AI.
When software creation was deterministic, reliability was under the control. Or, at least, its reliability was deterministic. An engineer making a mistake in X would lead to a reliability issue at Y. You shipped on a release cadence, watched a handful of services, and reasoned about failure with statistics that assumed the system would sit still long enough to study.
Most of that has stopped being true. Software is now created in such volume and at such speed that no single person can keep track of the whole system anymore. The change that takes down production might have shipped hours ago, from a service you don't own, written by an agent nobody on the team can fully account for.
The mistake in X still causes the failure at Y. And X and Y sit further apart than they used to: the failure shows up in one service while the cause lives in another, so connecting them means hopping across services, infrastructure, and dependencies, often several layers deep. There are just too many Xs now, and they change too fast, for anyone to trace cause to effect by hand.
When software gets generated faster than anyone can review it, reliability turns into a continuous, system-wide problem: catching failures as they happen and tracing symptoms back to a cause that might already be three deploys old. That's work AI is far better suited to than people are.
What software reliability measures
Software reliability is the probability that a system runs without failure for a given stretch of time, in a given environment. It's one of the quality characteristics defined in ISO/IEC 25010, the international standard for software quality, and it's been a measured discipline within software engineering for decades. A Service Level Indicator (SLI) is the signal you measure, like success rate or request latency; Google's SRE book distills these to four golden signals, latency, traffic, errors, and saturation, the set that catches most user-facing problems on its own.
Reliability covers several distinct properties, and it's worth keeping them apart:
- Maturity. How often does the system fail under normal conditions in the first place?
- Availability. Whether it's up and reachable the moment you need it.
- Fault tolerance. Whether it continues to work when a component beneath it fails.
- Recoverability. How quickly it gets back to a good state after something breaks.
Day-to-day, teams turn those properties into numbers they can track and set targets for. A handful do most of the work.
- Availability, usually called uptime. The fraction of time a service is usable, often measured as the share of well-formed requests that succeed. The industry tends to express it in "nines," where each additional nine means an order of magnitude less downtime. Google's SRE book lays out the common targets.
- MTBF and MTTF. Mean time between failures and mean time to failure, which both capture how long a system runs before it breaks (the first for things you repair, the second for things you replace).
- MTTR. Mean time to recovery is the average time to detect, diagnose, and restore service once an incident starts. Many teams split it further into time to detect, acknowledge, and resolve, since those are very different problems.
- Failure rate. How many failures occur per unit of time, or per million requests.
| Availability target | Roughly this much downtime a year |
|---|---|
| 99% (two nines) | About 3.65 days |
| 99.9% (three nines) | About 8.8 hours |
| 99.99% (four nines) | About 53 minutes |
| 99.999% (five nines) | About 5 minutes |
Those numbers connect. Availability works out to MTBF / (MTBF + MTTR), which is a useful reminder that reliability depends as much on how fast you recover as on how rarely you fail. Cutting recovery time moves the figure just as surely as preventing the failure does.
Most teams now operationalize all of this through service level objectives. A Service Level Indicator (SLI) is the signal you measure, like success rate or request latency. A Service Level Objective (SLO) is the target you hold yourself to, say, 99.9% over a rolling 30-day period. The gap between that target and 100% is your error budget, the amount of unreliability you're allowed to spend before reliability work has to take priority over shipping. Site reliability engineering is, in large part, the practice of managing that budget.
Why AI-generated code breaks the traditional reliability model
Software stopped behaving like that for a few reasons. Services multiplied into a microservices architecture, deployments went continuous, and the scale of modern systems outran what classical software reliability models were built to handle. But the change straining reliability the most right now is the sheer volume of code AI writes, because it breaks the assumptions on which those models and the usual safeguards were built.
| Assumption traditional reliability leaned on | What AI code does to it |
|---|---|
| Someone on the team understands every code path | Code ships in patterns and dependencies nobody chose or reviewed closely |
| Coverage and review reflect real scrutiny | AI writes the code and its tests, so passing tests can confirm the same error |
| Failures resemble ones you've seen before | Unfamiliar code paths produce novel failure modes and fresh edge cases |
| Output is bounded by how fast people can write it | Code arrives faster than review, testing, or ops can keep up with |
The leading indicators teams trusted lose signal first. Static analysis still flags known bug classes and obvious security vulnerabilities, but it can't tell you whether unfamiliar code actually does the right thing. Test coverage stops meaning much when the same model writes both the code and the tests, since a passing suite can just confirm its own misreading. And a thorough code review doesn't scale to the amount of code an agent can produce in an afternoon.
The rest is throughput. Code shows up faster than anyone can absorb it, and every generated service brings its own dependencies and configuration, widening the surface where things can fail. Technical debt accumulates faster than it gets written down. Scalability used to be about handling more load; now it's also about scaling review and operations to match how much code AI puts out, which you can't do by hiring.
AI writes code faster than teams can keep it reliable
None of this is a fringe practice. In Google's 2025 DORA research, about 90% of technology professionals reported using AI at work, and over 80% credited it with making them more productive. The single most common use is writing new code. Generative AI has become the default way many software products are built.
The catch shows up downstream. That same DORA research found that higher AI adoption was linked to a rise in both software delivery throughput and instability. The 2024 report had already put a number on the stability side: a 25% jump in AI adoption tracked with an estimated 7.2% drop in delivery stability. More code is going out, and more of it is breaking.
DORA's own read is that AI works as an amplifier, magnifying whatever an organization is already good or bad at. Solid reliability practices and AI make the team faster. Shaky ones, and it mostly helps you ship the problems faster. Either way, the bottleneck has shifted from writing code to keeping it reliable once it's live.
The same speed-versus-quality pattern shows up at the project level. A study of Cursor adoption across hundreds of open-source GitHub projects found the speed-up was front-loaded and short-lived: lines of code jumped for a month or two, then velocity settled back to baseline. The costs stuck around. Static analysis warnings rose about 30% and code complexity about 41%, both persisted, and the added complexity then dragged future velocity back down, a self-reinforcing cycle the authors trace to quality assurance becoming the bottleneck.
This is the real reason software reliability now needs its own AI. When AI code breaks at 2 am, whoever's on call is usually reading logic that nobody on the team actually authored. Pointing AI at operations, and not just at authoring, is how teams keep up with the code their own generative AI tools are producing.
What production-focused agents add to software reliability
The same agent approach that made coding tools fast and capable can be pointed at production. An agent built for operations investigates an incident and acts on it about as quickly as a coding agent ships a feature, which is the capability reliability has been missing while code generation raced ahead.
Production isn't a tidy repository, though. It sprawls across dozens of tools, runs on tribal knowledge and half-stale runbooks, and punishes a confident wrong answer at 3 am more than it punishes no answer at all. To be useful there, an agent has to bring a few specific things to the work.
- It knows the system, not just the symptom. A production agent works from a live map of your stack: service dependencies, who owns what, on-call schedules, and how each piece has failed before. When something breaks, it already knows what lies downstream and who to pull in, rather than starting from nothing.
- It investigates every signal at once. A lead agent reasons through the problem while specialist agents pull from code, infrastructure, metrics, logs, and traces in parallel, and a reviewer checks the result. That's how root cause analysis drops from hours of dashboard-hopping to minutes, and it's exactly the correlation across mixed-format telemetry that LLMs are good at.
- It learns what normal looks like. Machine learning baselines pick up each service's usual rhythm and flag genuine deviations, catching both sudden latency spikes and the slow drifts that predictive analytics surfaces before they cross a line a person would notice. This kind of anomaly detection runs around the clock, with no one monitoring the dashboard.
- It stays disciplined at scale. A good agent queries your observability tools the way a senior SRE would, pulling exactly what it needs instead of dragging everything into context. That keeps the cost of each investigation predictable as volume climbs and avoids triggering rate limits across your monitoring stack.
- It compounds from being corrected. Engineers can steer an investigation while it runs, redirecting a hypothesis or confirming a finding, and those corrections feed back so the next one starts sharper. A failure caught once becomes a check that guards against it everywhere.
A lot of this runs without prompting, with always-on agents watching deployments for regressions and picking up routine operational work before anyone gets paged.
This is how Resolve AI agents work. They run on models post-trained for production reasoning and route each step to the right model, which is how investigation quality holds up while the cost per investigation stays flat, even at thousands a week.
Where AI still needs human engineers
AI shifts the reliability math, but it doesn't finish the job. What changes is your role: less of the manual investigation, more of supplying what the agent can't get on its own, and judging what it shouldn't decide alone. A few places make that split clear.
- The genuinely new. A model reasons from what it's already seen. Something with no precedent, an attack out of nowhere, a bug in code that deployed an hour ago, can register as "something's off" without the agent pinning down why. The edge cases that have never happened are the ones it handles worst, and recognizing them is on you.
- The context it doesn't have. An agent only sees what's instrumented and written down. The most important context about a system is often neither: it’s the service with a memory leak everyone tolerates, or the alert that's noisy by design. On this spot, logging was never added. Part of working with these agents is feeding them that context and closing those gaps, ideally across the software development lifecycle rather than in the middle of an incident.
- Finding the real root cause. The fastest investigations are collaborative. The agent works through the evidence in parallel while you steer it, ruling out a wrong hypothesis early or confirming a hunch it can't check on its own. Your sense of a system's known weak spots is often what turns a plausible answer into the actual root cause analysis.
- The judgment calls. Do you roll back a deploy that's throwing minor errors but also ships a fix for a known security vulnerability? There's revenue and risk on both sides. The agent can lay out all the evidence, but a human still makes the call.
- Explainability and trust. An answer you can't check is hard to act on at 2 am. DORA's 2024 data found 39% of respondents had little or no trust in AI-generated code, and that wariness carries straight over to what an AI concludes about production. Explainability is what lets you verify a conclusion before acting on it; an agent that hides its reasoning just gets worked around.
Worked this way, the SRE role scales rather than shrinks. One engineer steering a set of agents covers ground that used to take a whole on-call rotation, which is the scalability shift SRE teams are feeling first.