Meet us at AWS re:Invent | Booth # 712:

Understanding kafka clusters, their infra, and health

I needed to understand my Kafka cluster's complete architecture: producers, consumers, message flows, deployment configurations, and current health status. I wanted to create a comprehensive view connecting code implementation to production behavior to identify operational risks.

What makes this hard?

Your Kubernetes dashboard shows pod status and resource limits, but doesn't explain what services actually produce or consume from Kafka. GitHub shows you that three services import Kafka libraries, but not which topics they use or how they're configured. Grafana shows latency spikes, but doesn't connect them to specific services or code patterns.

Understanding Kafka architecture requires disconnected work across multiple tools:

  • Read deployment YAMLs to find Kafka pods, memory limits, and service configurations
  • Search codebase for Kafka client usage across multiple languages (Go, C#, Kotlin)
  • Manually trace topic names, consumer groups, and producer configurations in code
  • Check dashboards for service health metrics and error rates over time
  • Query infrastructure metrics separately for CPU, memory, and network utilization
  • Manually connect: deployment specs → code implementations → topic flows → production health → security configurations

How did Resolve AI help?

With one query, Resolve AI simultaneously analyzed Kubernetes infrastructure and source code across three languages to map the complete Kafka ecosystem:

  • Identified producer with code evidence: Checkout service (Go/Sarama AsyncProducer) publishes OrderResult protobuf to orders topic from src/checkout/kafka/producer.go
  • Found both consumers with configurations: Accounting service (C#/Confluent.Kafka, group: accounting) and Fraud Detection (Kotlin/Apache Kafka clients v3.9.0, group: fraud-detection) both consuming from orders topic, with init containers waiting for Kafka health
  • Mapped cluster infrastructure: Single Kafka broker with 1500Mi memory, 400M heap, 16 partitions, running on ClusterIP at kafka:9092 with plaintext protocol
  • Analyzed production health from dashboards: Discovered critical issues—checkout service showing 15,000ms P99 latency spikes with recurring daily pattern, 100% error rate spike at specific timestamp, frontend outage starting at 08:15 UTC
  • Identified security vulnerabilities across code and infrastructure: No SASL authentication on any service, insecure gRPC with insecure.NewCredentials(), hardcoded JWT key "signing-key-abc123" at line 695, missing security contexts on all three deployments
  • Detected monitoring gaps: No Kafka-specific infrastructure metrics available, all CPU/memory/network queries timing out, preventing full resource utilization assessment
  • Generated prioritized remediation plan: 19 recommendations ordered by severity with effort estimates—Priority 1: enable Kafka SASL/TLS (2-3 days), fix hardcoded credentials (1 day), add security contexts (1 day)

Resolve AI connected code-level implementation details (checkout uses Sarama AsyncProducer with RequiredAcks: NoResponse) to infrastructure configuration (ClusterIP service, no network policies) to production symptoms (15s latency spikes, error rate incidents) to security posture (plaintext protocol, no authentication). Every finding included file citations and specific line numbers, creating a complete operational picture from a single question

Resolve.ai logo

Shaping the future of software engineering

Let’s talk strategy, scalability, partnerships, and the future of autonomous systems.

©Resolve.ai - All rights reserved

Terms of ServicePrivacy Policy
green-semi-circle-shape
green-square-shape
green-shrinked-square-shape
green-bell-shape