Designing a multi tenant rate limiter with production context

What makes this hard?

You can read the code to understand service architecture, but that doesn't tell you how the system behaves under load. GitHub shows you that checkout calls 11 downstream services, but not that it takes 4-6 seconds in production. Your monitoring shows database connection exhaustion, but doesn't connect it to which API endpoints cause the problem.

Designing rate limits in a vacuum means guessing at quotas—too strict and you block legitimate users, too loose and you miss abuse. Manual investigation requires disconnected work across multiple tools:

Read code to map service dependencies and fan-out patterns
Check APM traces to measure actual latency and span counts
Analyze logs to find abuse signals and failure patterns
Query metrics to understand request distribution and load patterns
Calculate appropriate limits by correlating all this evidence
Manually connect: code structure → production behavior → abuse patterns → quota recommendations

How did Resolve AI help?

With one request, Resolve AI queried code, traces, logs, and metrics simultaneously to build an evidence-based design:

Analyzed codebase structure: 14 files showing checkout triggers 11 downstream services, Kafka message flows, JWT-based tenant identification
Measured production reality: Tempo traces revealed 48-56 spans per checkout, 4-6s latency, with Kafka introducing retry behavior
Identified abuse patterns: Loki logs showed 178 SSL failures from a single IP, database connection pool exhaustion, 1700% traffic spike on metrics API
Discovered load amplification: Product-catalog called by multiple services in parallel during checkout flow
Analyzed actual traffic distribution: 3,848 requests over 1 hour: 37% products, 25% recommendations, 12% checkout
Derived specific quotas from evidence: Free tier checkout 5/min (vs observed 478/hr), products 60/min (matching measured 1,432/hr), per-IP limits after seeing single-IP SSL failures

Resolve AI connected code-level understanding (checkout has 11 downstream calls) to production evidence (4-6s Kafka delays cause retries) to generate specific protections (idempotency keys required with 300s deduplication window). Every rate limit was justified by measured behavior, not theoretical best practices.

The design doc included implementation-ready LUA scripts, rollout phases with validation queries, and cost-based quotas derived from actual fan-out patterns. All grounded in how this specific system behaves in production.

Social

Designing a multi tenant rate limiter with production context

What makes this hard?

How did Resolve AI help?

Shaping the future of software engineering

Join our community