feat: add per-user rate limiting middleware for inference endpoints. And remove catch all.#67
Conversation
…And remove catch all. - Introduced `rate_limit_middleware` to enforce maximum 2 concurrent requests and 1 request per second per user. - Updated API routes to integrate rate limiting. - Removed unused `ALLOWED_PROXY_PATHS` constant and related tests. - Added tests for rate limiting functionality to ensure correct behavior across different users and request scenarios.
There was a problem hiding this comment.
Pull request overview
This PR introduces per-user rate limiting middleware for inference endpoints and refactors the API routing from a catch-all pattern to explicit route handlers. The rate limiting enforces 2 concurrent requests and 1 request per second per user. However, there are critical bugs that must be addressed before merging.
Key Changes:
- Added
RateLimitStatemiddleware with per-user concurrency and time-based rate limiting - Replaced catch-all proxy handler with explicit route handlers for
/v1/responses,/v1/model/list, and/v1/signature/{chat_id} - Removed
ALLOWED_PROXY_PATHSconstant and related whitelist validation logic
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
crates/api/src/middleware/rate_limit.rs |
New rate limiting middleware implementation with per-user state tracking (contains critical semaphore leak bug) |
crates/api/src/middleware/mod.rs |
Exports rate limiting middleware components |
crates/api/src/routes/api.rs |
Replaced catch-all handler with explicit proxy routes; missing some previously allowed paths |
crates/api/src/routes/mod.rs |
Integrated RateLimitState into router creation |
crates/api/src/consts.rs |
Removed ALLOWED_PROXY_PATHS constant |
crates/api/src/lib.rs |
Removed create_router export, keeping only create_router_with_cors |
crates/api/tests/common.rs |
Updated test setup to use create_router_with_cors directly |
crates/api/tests/rate_limit_tests.rs |
Added integration tests for rate limiting behavior |
crates/api/tests/proxy_path_tests.rs |
Removed tests for catch-all proxy path validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[tokio::test] | ||
| async fn test_rate_limit_blocks_second_request_within_window() { | ||
| let state = RateLimitState::new(); | ||
| let user = test_user_id(1); | ||
|
|
||
| // First request should succeed | ||
| let _guard1 = state.try_acquire(user.clone()).await.unwrap(); | ||
|
|
||
| // Second request within the same second should fail (rate limit) | ||
| let result = state.try_acquire(user).await; | ||
| assert!(matches!( | ||
| result, | ||
| Err(RateLimitError::RateLimitExceeded { .. }) | ||
| )); | ||
| } |
There was a problem hiding this comment.
Missing test coverage for the semaphore permit leak bug. The existing test test_rate_limit_blocks_second_request_within_window doesn't drop the first guard before the second request, so it won't catch the case where a permit is leaked when the rate limit (not concurrency limit) is exceeded.
A test should verify that after a rate-limited request is rejected, the concurrency limit is not affected. For example:
- Make a request and drop the guard
- Make a second request immediately (should fail with RateLimitExceeded)
- Wait for the window to expire
- Make max_concurrent requests simultaneously (should all succeed, proving no permits were leaked)
- Improved user rate limit state management by adding idle user cleanup and refining concurrency checks. - Updated rate limit error handling to differentiate between concurrency and rate limit exceeded scenarios. - Removed redundant comments and streamlined the rate limit middleware for clarity. - Added tests to ensure proper behavior of rate limiting under various conditions.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Introduced a new test to validate the behavior of the rate limiting middleware under concurrent requests. - The test ensures that at least two out of three concurrent requests are rate limited, adhering to the defined rate limit configuration.
rate_limit_middlewareto enforce maximum 2 concurrent requests and 1 request per second per user.ALLOWED_PROXY_PATHSconstant and related tests.