This document describes the Redis-backed rate limiting system for controlling API usage.
The rate limiting system enforces global usage limits using a Redis-backed sliding window algorithm:
- Requests per minute
- Requests per hour
┌─────────────────┐ ┌────────────────────┐ ┌──────────────────┐
│ API Request │────▶│RateLimitMiddleware │────▶│ RedisRateLimiter │
│ │ │ │ │ (Redis + Lua) │
└─────────────────┘ └────────────────────┘ └──────────────────┘
| Component | File | Description |
|---|---|---|
RateLimitMiddleware |
ratelimit/middleware.py |
FastAPI middleware for enforcement |
RedisRateLimiter |
ratelimit/middleware.py |
Redis sliding window limiter using Lua scripts |
# Redis backend (required)
RATE_LIMIT_REDIS_URL=redis://localhost:6379/0
RATE_LIMIT_REDIS_TIMEOUT_MS=200
RATE_LIMIT_KEY_PREFIX=lightspeed:ratelimit
# Global rate limits
RATE_LIMIT_REQUESTS_PER_MINUTE=60
RATE_LIMIT_REQUESTS_PER_HOUR=1000Only specific paths are rate-limited:
| Path | Description |
|---|---|
/ |
A2A JSON-RPC endpoint (supports both send and streaming) |
Rate limits are evaluated across multiple principal dimensions:
order_id(tenant/subscription boundary)user_id(orclient_idifuser_idis unavailable)- IP fallback only when no authenticated principal is available
If both order_id and user_id are present, the request must pass both checks.
If either dimension exceeds the configured limit, the request is rejected with 429.
These paths are never rate-limited:
/health,/healthz,/ready- Health checks/metrics- Prometheus metrics/.well-known/agent.json- Agent card/docs,/openapi.json,/redoc- Documentation
When a request is rate-limited (429 response):
| Header | Description |
|---|---|
Retry-After |
Seconds until the limit resets |
X-RateLimit-Limit |
The limit per minute |
X-RateLimit-Remaining |
Remaining requests |
Example response:
HTTP/1.1 429 Too Many Requests
Retry-After: 60
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "Rate limit exceeded (per_minute)",
"retry_after": 60
}The rate limiter uses an atomic Redis + Lua sliding window algorithm:
- For each principal dimension (for example
order_idanduser_id), Redis keeps two sorted sets:- a minute window key (
:m) - an hour window key (
:h)
- a minute window key (
- Before checking limits, old entries are removed from each set with
ZREMRANGEBYSCOREso only in-window requests remain. - Redis counts current in-window requests with
ZCARDand compares them to configured limits. - If any dimension is already at/over the limit, the script returns
429metadata (includingRetry-After) and does not record the new request. - If all dimensions are under limits, the script records the new request with
ZADDand updates key expiry withPEXPIRE.
1. Request arrives
2. Middleware checks if path should be rate-limited
3. RedisRateLimiter executes an atomic Lua script in Redis
4. If within limits:
- Record timestamp
- Allow request
5. If exceeded:
- Return 429 Too Many Requests
- Include Retry-After header
# Make 70 requests quickly (default limit is 60/min)
for i in {1..70}; do
echo -n "Request $i: "
curl -s -o /dev/null -w "%{http_code}\n" \
-X POST http://localhost:8000/ \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"message/send","id":'$i',"params":{"message":{"role":"user","parts":[{"type":"text","text":"test"}]}}}'
doneYou should see 429 responses after 60 requests.
When the agent is deployed on Cloud Run, use your service URL and include a Bearer token (production typically requires authentication):
SERVICE_URL="https://your-service-xxxx-uc.a.run.app" # Your Cloud Run URL
TOKEN="your-oauth-token" # From DCR client_credentials or SSO
for i in {1..70}; do
echo -n "Request $i: "
curl -s -o /dev/null -w "%{http_code}\n" \
-X POST "$SERVICE_URL/" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"jsonrpc":"2.0","method":"message/send","id":'$i',"params":{"message":{"role":"user","parts":[{"type":"text","text":"test"}]}}}'
doneWith authentication, rate limits apply per order_id and user_id (from the token) instead of per IP. Redis (Cloud Memorystore) is internal to the VPC and cannot be inspected with redis-cli from outside.
The agent combines HTTP-level throttling, usage metering, and an optional per-run tool budget. They are separate layers:
| Layer | What it limits | When it runs | Shared across replicas? |
|---|---|---|---|
| HTTP rate limiting | Incoming A2A POSTs per principal (minute/hour windows) | FastAPI middleware before the ADK runner | Yes, when all instances use the same Redis |
| Usage tracking (DB) | Requests, tokens, completed tool calls for billing/analytics | ADK plugin (UsageTrackingPlugin) |
Yes, all instances write to the same database |
| Per-invocation tool budget | How many tools may start in one agent run | ADK before_tool_callback |
No (today): in-memory per process; see Per-invocation tool budget and proposed shared counters |
Comparison in plain terms: Redis rate limits stop a client from opening too many HTTP conversations. The tool budget stops a single conversation from hammering MCP with an unbounded tool–model loop. Metering records what actually ran for reporting, including tools that completed (blocked tools are not counted in after_tool_callback).
- Rate limits are enforced across replicas as long as they share the same Redis instance.
- The service verifies Redis connectivity at startup and fails fast when Redis is unavailable.
- Tool budgets are not distributed across replicas until a shared store (for example Redis with TTL keyed by
invocation_id) is implemented; see metering.md.