Gallery

Summary

A seemingly benign feature flag rollout introduced a significant memory increase in a critical service, triggering alerts and threatening production stability.
This single issue, if left unchecked, could have led to a cascading failure and a full-blown outage, costing hours of engineering time to debug across multiple systems.
By simultaneously investigating code changes, feature flag configs, and historical monitoring data, Resolve AI instantly correlated the memory spike to the exact rollout event.
Beyond diagnosis, it then synthesized operational knowledge to provide a safe, step-by-step procedure to disable the problematic flags, turning a potential crisis into a controlled, 10-minute resolution

What was the incident?

An alert for KubeDeploymentReplicasMismatch fired for the svc-entity-graph, indicating instability. Iain, suspecting a resource issue, asked a broad question in Slack: "were there any changes to svc-entity-graph yesterday afternoon that might have increased memory usage or introduced a leak?"

How was the resolution?

Question in Slack triggered Resolve AI, which began a parallel investigation across completely different systems.

It simultaneously queried:

Deployment systems for changes to svc-entity-graph.
Configuration repositories to parse the meaning of those changes, identifying new feature flags.
Grafana for historical memory metrics of specific pods.
Kubernetes APIs to confirm that pod resource definitions had not changed.

Resolve AI collapsed what would have been hours of manual, sequential work into a single, synthesized action. It didn't just return a list of deployments and a chart; it connected the dots into a single, actionable narrative: a major code change enabling new ingestion pipelines via feature flags (ingestRawObjectsToEG) was rolled out at 4:54 PM, and memory usage for the new pods began spiking dramatically around the same time.

With the cause identified, my next question was about the solution.

Resolve AI then synthesized operational knowledge, providing a detailed, safe procedure for disabling the flags, complete with best practices ("coordinate with your team," "use a low-traffic period") and a pre-planned rollback plan.

What was the impact?

By following Resolve AI's guidance, the team was able to implement the fix and successfully deploy the service.

Averted a potential production outage by identifying and providing a fix for the memory issue before it caused cascading failures.
Reduced investigation time from hours to minutes, freeing up senior engineers from tedious, manual correlation work.
Provided a safe, vetted procedure for disabling the problematic feature flags, reducing the risk of making the situation worse.
Pinpointed the precise impact of a feature flag, providing critical feedback to the development team for future releases.

Social

Debugging a Memory Spike

Shaping the future of software engineering

Join our community