Meet us at AWS re:Invent | Booth # 712:

Optimizing metrics cardinality across system components

Our metrics platform was storing 1.5M time series, driving up costs and slowing queries. We needed to identify the biggest cardinality contributors, understand why each metric generated excessive series through code analysis, and implement adaptive metrics rules to reduce storage without losing observability.

What makes this hard?

Your metrics platform shows total cardinality numbers, but doesn't break down which specific metrics or labels are the worst offenders. Code repositories show instrumentation, but don't explain the cardinality impact of each label. Existing cardinality management rules are scattered across configuration files with no view of what's protected versus exposed. Optimizing cardinality requires manual investigation across disconnected systems:

  • Query metrics database for cardinality by metric name and aggregate counts
  • Calculate label cardinality separately for each high-cardinality metric
  • Search codebase to find where metrics are instrumented and why labels were added
  • Review adaptive metrics rules to understand current cardinality management
  • Estimate reduction impact by manually calculating label combinations
  • Manually connect: metric series counts → label contributions → code usage → optimization rules → expected impact

How did Resolve AI help?

With one query, Resolve AI analyzed metrics cardinality, examined instrumentation code, and cross-referenced existing rules to generate optimization recommendations:

  • Identified top 5 cardinality offenders from 1.5M total series: traces_spanmetrics_latency_bucket (238,845 series), grpc_server_latency_bucket (40,346 series), http_server_duration_milliseconds_bucket (37,665 series), graph_ingestion_replay_queue_latency_ms_bucket (36,585 series), http_client_duration_milliseconds_bucket (34,720 series)
  • Analyzed top contributing labels per metric with cardinality breakdown: traces_spanmetrics_latency has span_name (160 values), __metrics_gen_instance (96 values), job (23 values); http_server_duration has net_host_name (612 values), satelliteGroupId (404 values), instance (231 values); grpc_server_latency has orgId (164 values), method (86 values), method2 (85 values)
  • Traced metrics to source code and explained high cardinality: __metrics_gen_instance comes from OpenTelemetry Collector's spanmetrics connector in otelcol-config.yml, generates unique identifier per collector instance (96 instances running); net_host_name added by Kubernetes per-pod instrumentation creating 612 unique hostnames; method2 is redundant Grafana dashboard workaround duplicating method label
  • Discovered existing adaptive metrics coverage gap: 54 rules configured but NONE cover the top 5 highest cardinality metrics—existing rules aggressively drop 11 labels from rpc_server_duration_milliseconds_bucket but miss the 388,161 series (26% of total cardinality) from top offenders
  • Calculated reduction impact for proposed rules: Dropping __metrics_gen_instance and instance from traces_spanmetrics_latency_bucket would reduce 238,845 → ~2,500 series (95x reduction); dropping net_host_name (612 values) from http_server_duration_milliseconds_bucket would eliminate worst single-label contributor
  • Generated specific adaptive metrics rule configurations: Five new rules with drop_labels arrays and sum:counter aggregations following existing system patterns—prioritized by impact with expected total reduction from 1.5M → ~1.1M series (26% decrease)

Resolve AI connected cardinality data (metrics_gen_instance: 96 values contributing 252,414 series) to code evidence (spanmetrics connector generates instance identifiers for 96 distributed collectors) to optimization strategy (drop internal tracking labels unused in dashboards). The double underscore prefix () indicated the label was meant for internal use, not observability—a nuance that justified aggressive removal without breaking monitoring.

Resolve.ai logo

Shaping the future of software engineering

Let’s talk strategy, scalability, partnerships, and the future of autonomous systems.

©Resolve.ai - All rights reserved

Terms of ServicePrivacy Policy
green-semi-circle-shape
green-square-shape
green-shrinked-square-shape
green-bell-shape