Meet us at AWS re:Invent | Booth # 712:

Understanding k8s cluster infrastructure and resource allocation

I needed to map our entire Kubernetes cluster—namespaces, deployments, pod distributions, resource allocations, node topology, and networking controllers. The goal: create a comprehensive infrastructure overview to understand capacity, identify resource patterns, and visualize the AWS deployment architecture.

What makes this hard?

kubectl shows you pods and deployments, but requires separate commands for each namespace and doesn't aggregate resource totals. AWS console shows you EC2 nodes and availability zones, but doesn't connect them to Kubernetes workloads. Your monitoring dashboards show CPU metrics, but don't label them by specific nodes or map to infrastructure topology. Building a complete infrastructure picture requires fragmented queries across multiple interfaces:

  • Run kubectl get namespaces, then query each namespace individually for deployments
  • Execute kubectl describe for each deployment to find resource limits and replica counts
  • Check kubectl get nodes for node details, then parse labels for region and AZ placement
  • Query daemonsets separately to understand which system controllers run where
  • Search for metrics with node identifiers to find CPU/memory utilization
  • Manually connect: namespace resources → node capacity → AWS topology → system controllers → utilization metrics

How did Resolve AI help?

With one query, Resolve AI navigated the infrastructure graph to build a complete cluster overview with resource breakdowns:

  • Mapped cluster structure across 6 namespaces: 27 deployments, 57 pods total—ecommerce-app (21 deployments, 44 pods), kube-system (2 deployments, 7 pods), satellite/karpenter/infra-system/amazon-guardduty system namespaces
  • Analyzed resource allocation for 21 ecommerce deployments: Total 9.2Gi memory allocated with no CPU requests/limits specified anywhere, highest consumers kafka and load-generator (1500Mi each), cart service (2 replicas, 1Gi per pod), fraud-detection (2 replicas, 750Mi per pod)
  • Identified node topology across AWS us-east-2: 7 nodes spanning 3 availability zones—us-east-2a (3 nodes: 2 Fargate + 1 EC2 r7g.xlarge ARM64), us-east-2b (1 Fargate), us-east-2c (3 Fargate)
  • Discovered networking and storage controllers: AWS VPC CNI Plugin (aws-node daemonset) with warm-ip-target=1 configuration, AWS EBS CSI Driver (ebs-csi-controller with 2 replicas, ebs-csi-node daemonset) for dynamic volume provisioning
  • Found critical architecture pattern: System daemonsets (aws-node, ebs-csi-node, kube-proxy) only run on the single EC2 node—6 Fargate nodes can't support traditional daemonsets due to serverless nature, creating hybrid model where EC2 handles system operations
  • Identified monitoring gap: Node-level CPU metrics unavailable despite queries—metrics lack node hostname labels, and the EC2 node is only 7 hours old, so 2-day historical data doesn't exist

Resolve AI generated a visual Mermaid diagram showing the AWS region containing three availability zones, each with specific nodes labeled by type (Fargate/EC2), architecture (amd64/arm64), and system daemonsets. The investigation revealed that no CPU limits are set cluster-wide (potential resource contention risk) and the single EC2 node is a single point of failure for system-level networking and storage operations.RetryClaude can make mistakes. Please double-check responses.

Resolve.ai logo

Shaping the future of software engineering

Let’s talk strategy, scalability, partnerships, and the future of autonomous systems.

©Resolve.ai - All rights reserved

Terms of ServicePrivacy Policy
green-semi-circle-shape
green-square-shape
green-shrinked-square-shape
green-bell-shape