I needed to investigate a critical alert: the frontend service was logging errors when it normally shouldn't. The goal: identify the root cause, understand the blast radius, assess user impact, and determine if downstream service failures were bubbling up through the entrypoint.
What makes this hard?
Your alert fires when frontend logs errors, but doesn't tell you why. Logs show connection failures, but not which service is down or what caused it. Traces reveal failed requests, but don't explain the timeline or correlation between service failures. Metrics show traffic spikes, but don't connect them to service crashes. Manual incident investigation requires racing through disconnected tools under pressure:
- Query alert system to understand what triggered and when it fired
- Search frontend logs to find error patterns and affected endpoints
- Check traces for failed requests to identify downstream dependencies
- Query metrics separately for each suspected service to find anomalies
- Investigate infrastructure for deployments or configuration changes
- Manually connect: alert timeline → error logs → trace failures → service metrics → root cause
How did Resolve AI help?
With one query, Resolve AI automatically investigated the alert across logs, traces, metrics, and infrastructure to build a complete incident timeline:
- Identified primary root cause with metrics correlation: Cart service hit by 8,500 request traffic spike at 07:03:56, crashed completely at 07:04:35, P95 latency spiked to 15,000ms starting at 07:12:07
- Found contributing factor through trace analysis: 452 error spans showing flagd service 100% error rate for 'authSigningKey' feature flag, with continuous pod restarts beginning at 07:07:32
- Quantified user impact from logs: 363 error logs with 100% failure rate—zero successful transactions, checkout operations completely failed, 327 connection refused errors (90% of failures) to cart service IP 172.20.117.2:8080
- Mapped blast radius across services: Frontend returning HTTP 500 on /api/checkout (3ms traces showing early failures), shipping quote service failures, 31 PostgreSQL connection resets from 07:21:49 to 07:40:47
- Built precise causal timeline: Cart outage began 07:04:35, frontend errors started 07:21:22 (17 minutes later), alert fired 07:21:30, cart recovery 07:22:45, alert cleared 07:40:30
- Identified investigation gap requiring follow-up: Initial 8,500-request traffic spike source unknown—needs analysis to determine if load test, bot traffic, or legitimate surge
What did Resolve AI deliver?
Resolve AI produced a structured incident report with four critical components:
- Root Cause Analysis: Cart service overwhelmed by traffic spike causing complete crash and unavailability. Frontend connection failures occurred because cart service was down, not responding to requests. Flagd service instability contributed through missing feature flag configuration causing authentication issues.
- Causal Timeline: 14-event sequence showing service failure progression—flagd pod shutdown (07:07:32) → cart traffic spike (07:03:56) → cart crash (07:04:35) → frontend errors appear (07:21:22) → alert fires (07:21:30) → peak error volume 1.27MB/min (07:22:34) → recovery begins (07:37:30) → normal state restored (07:40:30).
- Impact Assessment: 363 error logs, 100% failure rate, zero successful transactions, peak 392 errors at 07:21pm. Resources impacted: frontend (7 pods), cart (primary), flagd (contributing). User impact: checkout completely failed, product browsing disrupted, shipping quotes unavailable—all users received HTTP 500 errors during 36-minute outage.
- Notable Evidence: Organized by category—user impact (100% failure rate, 1.27MB/min peak), frontend patterns (HTTP 500 on /api/checkout with 3ms traces), cart outage (8,500 request spike, near-zero volume after crash, 0% error rate because unavailable), flagd instability (continuous restarts, 'authSigningKey' not found), alert correlation (no cart/flagd alerts fired despite failures, indicating monitoring gaps).
Resolve AI connected the 17-minute delay between cart crash and frontend errors (connection timeout period), correlated flagd restarts with authentication failures, and identified that cart showed 0% error rate because it was completely unavailable rather than returning errors—nuances that would take hours to piece together manually across multiple monitoring systems.