Kafka Consumer Lag

A Kafka consumer lag alert fired for the fraud-detection service processing orders. The goal: identify why lag spiked to 7,000+ records during business hours, determine if this was an isolated incident or recurring pattern, and find the root cause before it impacts fraud detection accuracy.

What makes this hard?

Consumer lag could stem from infrastructure problems (pod restarts, network issues), message queue problems (partition rebalancing, broker issues), or application problems (slow processing, memory leaks, deadlocks). Each requires checking different systems. Grafana shows you the lag spike, but doesn't explain why processing slowed. Kubernetes shows pod health, but doesn't connect to consumer behavior. Kafka metrics show timeouts, but don't reveal what's blocking poll() calls. Application logs show errors, but don't tie to code changes. Manual investigation requires disconnected work:

Query Grafana to establish lag timeline and severity patterns
Check Kubernetes for pod restarts, deployments, or resource exhaustion
Analyze Kafka metrics for partition status and consumer group health
Search application logs for timeout errors and processing failures
Review git history to find consumer code changes
Manually connect: lag spike → poll timeouts → rebalancing cycles → code changes → processing delays

How did Resolve AI help?

Resolve AI delivered a complete root cause analysis in minutes, eliminating hours of manual correlation across Grafana, Kubernetes, Kafka metrics, application logs, and git history:

Pattern Recognition:

Established this was a daily recurring pattern (5,000-7,000 record spikes every 24 hours), not an isolated incident
Identified poll timeouts occurring at exact 5-minute intervals (10:15, 10:20, 10:25)

Root Cause:

Traced the causal chain: processing delays → missed poll() calls within 45-second session timeout → consumer group ejections → rebalancing cascades (generations 7, 8, 9)
Found the specific code commit from April 30, 2024 that introduced intentional processing delays with stated purpose to "introduce a consumer side delay leading to a lag spike"

Impact Assessment:

Ruled out infrastructure failure: both fraud-detection pods stable with 0 restarts, no deployments during incident
Confirmed natural recovery: lag cleared completely by 10:34am once processing resumed
Determined this was a demo scenario, not a production bug requiring immediate mitigation

Social

Kafka Consumer Lag

What makes this hard?

How did Resolve AI help?

Debugging a Memory Spike

Fix deployment failure

Shaping the future of software engineering

Join the conversation