How can you reduce alert fatigue in your SRE team?

Alert fatigue slows real incident response, lets genuine outages slip through, and grinds down all your experienced engineers. Reducing it is not about changing your alert thresholds, but putting machines on call.

It usually starts small. A CPU alert threshold gets added during a launch, set to fire above 80 percent so nobody misses a problem on release day. The launch goes fine, but why turn off the alert? It’s helpful, right? Over the next few months, that alert fires under normal load but nothing bad ever happens. Every one of those was a false positive, a page with no real problem behind it, and each on-call engineer learns exactly one thing: it means nothing.

And then one day, it means something. This time the CPU is genuinely pegged. Requests back up, and latency climbs. But by then, the habit is set. The alert goes unanswered until timeouts lead to errors users actually see and the “alerts” start coming from users.

This is alert fatigue. Alert fatigue slows real incident response, lets genuine outages slip through, and grinds down all your experienced engineers. Reducing it is not about changing your alert thresholds, but putting machines on call.

Key facts:

Alert fatigue is desensitization: engineers stop reacting to pages, including the real ones, because most turn out to be nothing.
Google's SRE book puts a sustainable shift at fewer than two actionable incidents per twelve hours, since one incident takes roughly six hours to work end to end.
The fix is subtraction plus automation: alert on symptoms and SLO burn, then let something else triage what's left before it reaches a person.

What is alert fatigue?

Alert fatigue is when engineers get so many alerts that they stop reacting to them, including the ones that signal real problems.

Psychologically, it is the desensitization that sets in when engineers receive so many alerts, many of them noise, that they start ignoring, silencing, or slowly acknowledging them. They aren’t “real,” so don’t require a response. These false alarms are the bulk of what a monitored system produces. It shows up anywhere people monitor systems, including security operations and healthcare, where the term was first coined.

Here, we’re focusing on on-call and production engineering alert fatigue, where the alerts come from monitoring, deploys, and incident tooling. It manifests as:

Pages acknowledged on reflex, sometimes before the alert body has been read.
Alerts muted, snoozed, or routed to a channel nobody watches.
Acknowledgment times slowly climbing as the pager loses its urgency.
Real incidents noticed by users or downstream teams before the on-call engineer.

Importantly, alert fatigue doesn't stay with the alert that caused it. One noisy CPU rule teaches the team to distrust the whole pager, including the alerts that are well built.

Where the term “alert fatigue” comes from

Alert fatigue has pedigree, and it started in medicine and clinical decision support rather than software. In a 2013 Sentinel Event Alert, the Joint Commission flagged alarm fatigue as a patient-safety problem, citing estimates that 85 to 99 percent of clinical false-alarm signals never required intervention. Clinicians had grown so used to the noise that real alarms were missed, and the problem was serious enough to become a National Patient Safety Goal the following year.

The mechanism in a production system is the same. A person exposed to constant low-value automated alerts stops reacting to all of them, including the ones that matter. The stakes usually differ, but the failure pattern is identical, and so are most of the fixes.

Event, alert, and incident are not the same thing

A lot of alert fatigue comes from collapsing three things that should stay separate.

Layer	What it is	How often it should reach a human
Event	Any change of state in your system	Almost never on its own
Alert	An event you've decided is worth surfacing	Only when it needs attention
Incident	A confirmed problem that needs a response	Every time

Most events should never become alerts, and most alerts should resolve without ever becoming incidents. Keeping the layers distinct is what makes alerting sustainable. A good alert fires as close as possible to the incident layer, on signals that reliably track real problems, which is exactly why symptom and SLO-based alerting works. When teams alert at the event layer instead, paging on every state change, the gap between alerts and incidents widens until the pager is mostly noise.

What alert fatigue costs your team

The first cost is slower response to real incidents. When most pages are noise, engineers acknowledge them out of habit or stop acknowledging them at all. Thus, alert fatigue drives up MTTR: engineers acknowledge later, investigate less, and occasionally let a page sit because the last fifty were nothing.

Low-priority alerts that interrupt an on-call engineer erode productivity, and the fatigue they cause makes serious alerts get less attention than they need.

There's a hard ceiling underneath this. Google's SRE book puts a sustainable shift at fewer than two actionable incidents per twelve hours, since a single incident takes about six hours to work through end to end. Constant noise pushes teams well past that line, which is where both response quality and morale start to slip.

The second cost is employee burnout. On-call is demanding on its own, and constant noise turns a workable rotation into a driver of burnout and attrition. Losing an experienced engineer who knows your systems costs far more than the noise that pushed them out, and it takes much longer to recover from.

What causes alert fatigue?

Most alert fatigue traces back to a small set of patterns, and nearly all of them are fixable. The usual causes:

Non-actionable alerts. The page fires but there's nothing for a human to do, or the issue clears on its own. A high false-positive rate is the engine of it. Once most pages turn out to be false positives, the next one gets less scrutiny than it deserves. Rob Ewaschuk's canonical alerting guide from Google is blunt about this: every page should demand a real action, not just record that something fired again.
Thresholds that never get tuned. Static limits set during a launch keep firing long after they stopped meaning anything, with no seasonality and no awareness of what normal load looks like.
Alert storms. One underlying problem fans out into dozens of pages. The SRE book specifically calls out controlling how many alerts a single incident is allowed to generate.
No severity ranking. When every alert carries the same urgency, engineers have no fast way to tell a page that can wait from one that can't.
Missing ownership and routing. Alerts that broadcast to a whole channel instead of the team that owns the service spread noise without assigning responsibility.
No context attached. A page with no runbook, no recent-deploy summary, and no links to the relevant dashboards forces the responder to rebuild the situation before they can start.
Alerting on causes instead of symptoms. Paging on every internal cause, a full disk or a slow query, generates far more noise than paging on user-visible symptoms. Ewaschuk's guidance is to alert on symptoms and keep cause-level detail inside the alert or on dashboards.

How to reduce alert fatigue

There is a manual way to reduce alert fatigue: reduce alerts. Cutting alert fatigue is really mostly subtraction. You raise the bar for what earns a page and remove everything that doesn't clear it. The practices that work, roughly mapping to the causes above:

Alert on symptoms and SLO burn, not causes. Page when users are affected or when you're burning your error budget too fast. The four golden signals from the SRE book, latency, traffic, errors, and saturation, are a good starting point. A multi-window, multi-burn-rate approach keeps a brief blip from paging anyone.
Make every page actionable. Before adding an alert, decide what a human will do when it fires. If the answer is nothing, or the response is fully scriptable, automate it or drop it. Ewaschuk's rule of thumb is to err toward removing noisy alerts, since over-monitoring is harder to fix than under-monitoring.
Deduplicate and group. Collapse the related pages from a single incident into one notification, so a cascading failure doesn't page the on-call engineer thirty times for one root cause.
Rank severity and route by owner. Give every alert a clear priority and send it to the team that owns the service rather than a shared channel.
Attach context to every alert. Include recent deploys, a runbook link, and the relevant dashboards, so the responder starts with the context already gathered instead of assembling it by hand.
Prune on a schedule. Audit your alerts regularly and delete the ones that never lead to action. An alert that has never once required a response is pure cost.
Protect the rotation. The SRE book caps on-call and operational load and reserves at least half of an engineer's time for project work, partly so the job of fixing noisy alerts actually gets done.

Taken together, these are the core of good alert management: deciding what fires, who it reaches, and when.

Metrics that show alert fatigue is improving

You can't manage alert fatigue without measuring it. A few signals tell you whether your changes are working:

Pages per on-call shift. The SRE book's sustainable target is fewer than two actionable incidents per twelve-hour shift. Trending toward that is the clearest sign of progress.
Percent of alerts that are actionable. Track how many pages lead to a real action. A low and falling ratio means you're training the team to ignore the pager. The inverse, your false-positive rate, is the same signal read the other way. A rising one means the pager is filling with noise.
Mean time to acknowledge and resolve. Rising acknowledgement time usually means engineers have stopped trusting their alerts.
Alerts per incident. Falling counts here show your grouping and deduplication are doing their job.

How Resolve investigates alerts before they page you

Resolve AI is an agentic AI that reduces alert fatigue by investigating every alert the moment it fires, grouping the ones that belong to a single incident, and attaching the context an engineer would otherwise gather by hand, so fewer pages reach a human, and the ones that do are already triaged.

Earlier AIOps tools leaned on machine learning to flag anomalies, which surfaces the unexpected but rarely tells you whether it actually matters.

Unlike rule-based or machine-learning monitoring, agentic AI can investigate an alert end to end: pull the data, form a hypothesis, and confirm or rule it out. When a page comes in, a Resolve agent picks it up, pulls the relevant logs, metrics, traces, and recent changes, and works out whether it represents real user impact or noise before anyone is pulled in.

That directly addresses the patterns that cause fatigue. It correlates and groups the flood of pages from a single incident into one investigation, so alert storms stop reaching the pager as thirty separate notifications and needing manual triage. It attaches the context that low-fidelity alerts lack, including recent deploys and affected services. And it works out the difference between a symptom and its root cause, which is the triage that usually lands on a tired on-call engineer at the worst possible time.

The investigation runs automatically, and what happens next is up to you. Resolve starts advisory, handing the engineer an evidence-backed summary of what's happening and why, and takes on more of the execution for low-risk, well-understood problems as your team gets comfortable with it. Everything else stays a human decision. That model, AI-led investigation with graduated autonomy on the fix, is how Resolve approaches on-call.

Frequently asked questions

What causes alert fatigue?

Alert fatigue is caused by too many low-value alerts: non-actionable pages, untuned static thresholds, duplicate alerts from a single incident, missing severity and ownership, and alerting on internal causes rather than user-visible symptoms. The common thread is a low ratio of signal to noise, which trains people to stop reacting.

How do you reduce alert fatigue?

You reduce alert fatigue by raising the bar for what earns a page: alert on symptoms and SLO burn rather than every internal cause. An AI SRE can then investigate each alert as it fires, groups the pages from a single incident, and attaches context automatically, so fewer reach a human and the ones that do arrive are already triaged, with a person still deciding what to act on.

What is the difference between alert fatigue and alarm fatigue?

Alarm fatigue and alert fatigue describe the same effect in different fields. Alarm fatigue is the original clinical term, formalized by the Joint Commission for medical-device alarms. Alert fatigue is the equivalent in software and security operations. Notification fatigue is sometimes used for the broader case that adds chat, email, and push on top of monitoring alerts.

How many alerts per on-call shift is sustainable?

Google's SRE book puts the ceiling at fewer than two actionable incidents per twelve-hour on-call shift, based on the roughly six hours an incident takes to work through end to end. Most teams that feel constant alert fatigue are running well above that.

Social

How can you reduce alert fatigue in your SRE team?

What is alert fatigue?

Where the term “alert fatigue” comes from

Event, alert, and incident are not the same thing

What alert fatigue costs your team

What causes alert fatigue?

How to reduce alert fatigue

Metrics that show alert fatigue is improving

How Resolve investigates alerts before they page you

Frequently asked questions

What causes alert fatigue?

How do you reduce alert fatigue?

What is the difference between alert fatigue and alarm fatigue?

How many alerts per on-call shift is sustainable?

Get the “AI for prod” newsletter

AI for prod ebook

Machines on call for humans

Join the conversation