Averted a Production Incident by investigating a Memory Spike

1.png

Summary

  • A seemingly benign feature flag rollout introduced a significant memory increase in a critical service, triggering alerts and threatening production stability.
  • This single issue, if left unchecked, could have led to a cascading failure and a full-blown outage, costing hours of engineering time to debug across multiple systems.
  • By simultaneously investigating code changes, feature flag configs, and historical monitoring data, Resolve AI instantly correlated the memory spike to the exact rollout event.
  • Beyond diagnosis, it then synthesized operational knowledge to provide a safe, step-by-step procedure to disable the problematic flags, turning a potential crisis into a controlled, 10-minute resolution
This would have taken us half a day to piece together manually. The alert fired, I had a vague suspicion, and asked a question in Slack. Resolve AI didn't just give me charts; it connected the dots between a code change, a feature flag, and the memory spike, and then told me exactly how to fix it safely.

What was the incident?

An alert for KubeDeploymentReplicasMismatch fired for the svc-entity-graph, indicating instability. Iain, suspecting a resource issue, asked a broad question in Slack: "were there any changes to svc-entity-graph yesterday afternoon that might have increased memory usage or introduced a leak?"

How was the resolution?

3.png

Question in Slack triggered Resolve AI, which began a parallel investigation across completely different systems.

2.png

It simultaneously queried:

  • Deployment systems for changes to svc-entity-graph.
  • Configuration repositories to parse the meaning of those changes, identifying new feature flags.
  • Grafana for historical memory metrics of specific pods.
  • Kubernetes APIs to confirm that pod resource definitions had not changed.

Resolve AI collapsed what would have been hours of manual, sequential work into a single, synthesized action. It didn't just return a list of deployments and a chart; it connected the dots into a single, actionable narrative: a major code change enabling new ingestion pipelines via feature flags (ingestRawObjectsToEG) was rolled out at 4:54 PM, and memory usage for the new pods began spiking dramatically around the same time.

With the cause identified, my next question was about the solution.

4.png

Resolve AI then synthesized operational knowledge, providing a detailed, safe procedure for disabling the flags, complete with best practices ("coordinate with your team," "use a low-traffic period") and a pre-planned rollback plan.

What was the impact?

By following Resolve AI's guidance, the team was able to implement the fix and successfully deploy the service.

  • Averted a potential production outage by identifying and providing a fix for the memory issue before it caused cascading failures.
  • Reduced investigation time from hours to minutes, freeing up senior engineers from tedious, manual correlation work.
  • Provided a safe, vetted procedure for disabling the problematic feature flags, reducing the risk of making the situation worse.
  • Pinpointed the precise impact of a feature flag, providing critical feedback to the development team for future releases.

Handoff your headaches to Resolve AI

Get back to driving innovation and delivering customer value.

Join our community

©Resolve.ai - All rights reserved

semi-circle-shape
square-shape
shrinked-square-shape
bell-shape