Meet us at AWS re:Invent | Booth # 712:

Designing a multi tenant rate limiter with production context

To design a new multi-tenant rate limiter for our ecommerce platform, I wanted to analyze the codebase and create implementation-ready quotas that would block abuse without impacting legitimate users.

What makes this hard?

You can read the code to understand service architecture, but that doesn't tell you how the system behaves under load. GitHub shows you that checkout calls 11 downstream services, but not that it takes 4-6 seconds in production. Your monitoring shows database connection exhaustion, but doesn't connect it to which API endpoints cause the problem.

Designing rate limits in a vacuum means guessing at quotas—too strict and you block legitimate users, too loose and you miss abuse. Manual investigation requires disconnected work across multiple tools:

  • Read code to map service dependencies and fan-out patterns
  • Check APM traces to measure actual latency and span counts
  • Analyze logs to find abuse signals and failure patterns
  • Query metrics to understand request distribution and load patterns
  • Calculate appropriate limits by correlating all this evidence
  • Manually connect: code structure → production behavior → abuse patterns → quota recommendations

How did Resolve AI help?

With one request, Resolve AI queried code, traces, logs, and metrics simultaneously to build an evidence-based design:

  • Analyzed codebase structure: 14 files showing checkout triggers 11 downstream services, Kafka message flows, JWT-based tenant identification
  • Measured production reality: Tempo traces revealed 48-56 spans per checkout, 4-6s latency, with Kafka introducing retry behavior
  • Identified abuse patterns: Loki logs showed 178 SSL failures from a single IP, database connection pool exhaustion, 1700% traffic spike on metrics API
  • Discovered load amplification: Product-catalog called by multiple services in parallel during checkout flow
  • Analyzed actual traffic distribution: 3,848 requests over 1 hour: 37% products, 25% recommendations, 12% checkout
  • Derived specific quotas from evidence: Free tier checkout 5/min (vs observed 478/hr), products 60/min (matching measured 1,432/hr), per-IP limits after seeing single-IP SSL failures

Resolve AI connected code-level understanding (checkout has 11 downstream calls) to production evidence (4-6s Kafka delays cause retries) to generate specific protections (idempotency keys required with 300s deduplication window). Every rate limit was justified by measured behavior, not theoretical best practices.

The design doc included implementation-ready LUA scripts, rollout phases with validation queries, and cost-based quotas derived from actual fan-out patterns. All grounded in how this specific system behaves in production.

Resolve.ai logo

Shaping the future of software engineering

Let’s talk strategy, scalability, partnerships, and the future of autonomous systems.

©Resolve.ai - All rights reserved

Terms of ServicePrivacy Policy
green-semi-circle-shape
green-square-shape
green-shrinked-square-shape
green-bell-shape