feat(rate_limiter): add SlidingWindowRateLimiter for strict per-minute caps#20799
feat(rate_limiter): add SlidingWindowRateLimiter for strict per-minute caps#20799dive2tech wants to merge 8 commits intorun-llama:mainfrom
Conversation
…e caps Add SlidingWindowRateLimiter as an alternative to TokenBucketRateLimiter. It enforces a strict sliding 60-second window for RPM/TPM, with no burst at window boundaries. Includes full sync/async support and tests. Co-authored-by: Cursor <cursoragent@cursor.com>
Extend SlidingWindowRateLimiter with optional request_burst and token_burst parameters so callers can configure limited burst headroom while keeping a sliding 60s window model. Includes tests covering request and token bursts. Made-with: Cursor
|
Hi, @rootInfluence . Thanks for your feedback. |
…window Replace the SlidingWindowRateLimiter __init__ override with a model_validator that enforces the RPM/TPM requirement and remove a redundant instance-type test. This keeps the Pydantic API ergonomic while maintaining invariants. Made-with: Cursor
|
Hi @AstraBert, |
|
Linting is failing, you should ensure linting by running: uv pip install pre-commit
pre-commit install
pre-commit run -aFrom the root folder of the llama_index repo |
Apply ruff/black formatting to rate_limiter module so it passes the standard pre-commit hooks used in this repository. Made-with: Cursor
|
Fixed lint error, ready to merge |
Add a small compatibility shim for NoneType so llama_index.core.base.llms.types imports cleanly on Python 3.9 where types.NoneType is not available. Made-with: Cursor
Made-with: Cursor
|
I've fixed the CI bot test error. Please try running it again. |
| # NOTE: | ||
| # Python 3.9 does not expose `types.NoneType`, so we define a local alias that | ||
| # works across all supported versions instead of importing it from `types`. | ||
| NoneType = type(None) |
There was a problem hiding this comment.
It is not necessary to stabilize integrations across versions: integrations will most probably fail all the times when a modification is made on core, and this is mostly due to flakiness in their own tests. Especially for 3.9, we are preparing to drop support, so there is no need to adapt
|
I will merge this one once the last commit is reverted and CI is done. If linting, type-checking and the core tests on 3.12 and 3.14 are passing, this is all I need to merge, it's not a problem if other CI checks fail |
This reverts commit ba84bd5.
|
Hi, @AstraBert, Thanks for your feedback. |
Summary
Adds a new rate limiter implementation, SlidingWindowRateLimiter, as an alternative to the existing
TokenBucketRateLimiter.Motivation
Token-bucket limiters allow bursts at the start of each window. Some APIs enforce strict limits over a rolling 60-second window with no burst allowance. This implementation enforces a strict sliding window: only requests (or tokens) within the last 60 seconds count toward the limit.
Changes
llama_index/core/rate_limiter.py: NewSlidingWindowRateLimiterclass implementingBaseRateLimiterwith:acquire()and asyncasync_acquire()tests/test_rate_limiter.py: Tests for creation, validation, blocking behavior, pruning, TPM limiting, async/concurrent behavior, and LLM/embedding integration.Usage
All existing rate limiter tests pass; 13 new tests added for the sliding-window implementation.