When Kafka consumer lag alerts fired, this investigation identified whether the issue stems from infrastructure problems, message queue health, or application code—and traced the complete causal chain from symptoms to root cause.
A Kafka consumer lag alert fired for the fraud-detection service processing orders. The goal: identify why lag spiked to 7,000+ records during business hours, determine if this was an isolated incident or recurring pattern, and find the root cause before it impacts fraud detection accuracy.
What makes this hard?
Consumer lag could stem from infrastructure problems (pod restarts, network issues), message queue problems (partition rebalancing, broker issues), or application problems (slow processing, memory leaks, deadlocks). Each requires checking different systems. Grafana shows you the lag spike, but doesn't explain why processing slowed. Kubernetes shows pod health, but doesn't connect to consumer behavior. Kafka metrics show timeouts, but don't reveal what's blocking poll() calls. Application logs show errors, but don't tie to code changes. Manual investigation requires disconnected work:
- Query Grafana to establish lag timeline and severity patterns
- Check Kubernetes for pod restarts, deployments, or resource exhaustion
- Analyze Kafka metrics for partition status and consumer group health
- Search application logs for timeout errors and processing failures
- Review git history to find consumer code changes
- Manually connect: lag spike → poll timeouts → rebalancing cycles → code changes → processing delays
How did Resolve AI help?
Resolve AI delivered a complete root cause analysis in minutes, eliminating hours of manual correlation across Grafana, Kubernetes, Kafka metrics, application logs, and git history:
Pattern Recognition:
- Established this was a daily recurring pattern (5,000-7,000 record spikes every 24 hours), not an isolated incident
- Identified poll timeouts occurring at exact 5-minute intervals (10:15, 10:20, 10:25)
Root Cause:
- Traced the causal chain: processing delays → missed poll() calls within 45-second session timeout → consumer group ejections → rebalancing cascades (generations 7, 8, 9)
- Found the specific code commit from April 30, 2024 that introduced intentional processing delays with stated purpose to "introduce a consumer side delay leading to a lag spike"
Impact Assessment:
- Ruled out infrastructure failure: both fraud-detection pods stable with 0 restarts, no deployments during incident
- Confirmed natural recovery: lag cleared completely by 10:34am once processing resumed
- Determined this was a demo scenario, not a production bug requiring immediate mitigation