Understanding kafka clusters, their infra, and health

What makes this hard?

Your Kubernetes dashboard shows pod status and resource limits, but doesn't explain what services actually produce or consume from Kafka. GitHub shows you that three services import Kafka libraries, but not which topics they use or how they're configured. Grafana shows latency spikes, but doesn't connect them to specific services or code patterns.

Understanding Kafka architecture requires disconnected work across multiple tools:

Read deployment YAMLs to find Kafka pods, memory limits, and service configurations
Search codebase for Kafka client usage across multiple languages (Go, C#, Kotlin)
Manually trace topic names, consumer groups, and producer configurations in code
Check dashboards for service health metrics and error rates over time
Query infrastructure metrics separately for CPU, memory, and network utilization
Manually connect: deployment specs → code implementations → topic flows → production health → security configurations

How did Resolve AI help?

With one query, Resolve AI simultaneously analyzed Kubernetes infrastructure and source code across three languages to map the complete Kafka ecosystem:

Identified producer with code evidence: Checkout service (Go/Sarama AsyncProducer) publishes OrderResult protobuf to orders topic from src/checkout/kafka/producer.go
Found both consumers with configurations: Accounting service (C#/Confluent.Kafka, group: accounting) and Fraud Detection (Kotlin/Apache Kafka clients v3.9.0, group: fraud-detection) both consuming from orders topic, with init containers waiting for Kafka health
Mapped cluster infrastructure: Single Kafka broker with 1500Mi memory, 400M heap, 16 partitions, running on ClusterIP at kafka:9092 with plaintext protocol
Analyzed production health from dashboards: Discovered critical issues—checkout service showing 15,000ms P99 latency spikes with recurring daily pattern, 100% error rate spike at specific timestamp, frontend outage starting at 08:15 UTC
Identified security vulnerabilities across code and infrastructure: No SASL authentication on any service, insecure gRPC with insecure.NewCredentials(), hardcoded JWT key "signing-key-abc123" at line 695, missing security contexts on all three deployments
Detected monitoring gaps: No Kafka-specific infrastructure metrics available, all CPU/memory/network queries timing out, preventing full resource utilization assessment
Generated prioritized remediation plan: 19 recommendations ordered by severity with effort estimates—Priority 1: enable Kafka SASL/TLS (2-3 days), fix hardcoded credentials (1 day), add security contexts (1 day)

Resolve AI connected code-level implementation details (checkout uses Sarama AsyncProducer with RequiredAcks: NoResponse) to infrastructure configuration (ClusterIP service, no network policies) to production symptoms (15s latency spikes, error rate incidents) to security posture (plaintext protocol, no authentication). Every finding included file citations and specific line numbers, creating a complete operational picture from a single question

Social

Understanding kafka clusters, their infra, and health

What makes this hard?

How did Resolve AI help?

Shaping the future of software engineering

Join our community