Skip to content

feat: add per-user rate limiting middleware for inference endpoints. And remove catch all.#67

Merged
think-in-universe merged 4 commits intomainfrom
rate-limit-catch-all
Dec 4, 2025
Merged

feat: add per-user rate limiting middleware for inference endpoints. And remove catch all.#67
think-in-universe merged 4 commits intomainfrom
rate-limit-catch-all

Conversation

@PierreLeGuen
Copy link
Contributor

  • Introduced rate_limit_middleware to enforce maximum 2 concurrent requests and 1 request per second per user.
  • Updated API routes to integrate rate limiting.
  • Removed unused ALLOWED_PROXY_PATHS constant and related tests.
  • Added tests for rate limiting functionality to ensure correct behavior across different users and request scenarios.

…And remove catch all.

- Introduced `rate_limit_middleware` to enforce maximum 2 concurrent requests and 1 request per second per user.
- Updated API routes to integrate rate limiting.
- Removed unused `ALLOWED_PROXY_PATHS` constant and related tests.
- Added tests for rate limiting functionality to ensure correct behavior across different users and request scenarios.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces per-user rate limiting middleware for inference endpoints and refactors the API routing from a catch-all pattern to explicit route handlers. The rate limiting enforces 2 concurrent requests and 1 request per second per user. However, there are critical bugs that must be addressed before merging.

Key Changes:

  • Added RateLimitState middleware with per-user concurrency and time-based rate limiting
  • Replaced catch-all proxy handler with explicit route handlers for /v1/responses, /v1/model/list, and /v1/signature/{chat_id}
  • Removed ALLOWED_PROXY_PATHS constant and related whitelist validation logic

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/api/src/middleware/rate_limit.rs New rate limiting middleware implementation with per-user state tracking (contains critical semaphore leak bug)
crates/api/src/middleware/mod.rs Exports rate limiting middleware components
crates/api/src/routes/api.rs Replaced catch-all handler with explicit proxy routes; missing some previously allowed paths
crates/api/src/routes/mod.rs Integrated RateLimitState into router creation
crates/api/src/consts.rs Removed ALLOWED_PROXY_PATHS constant
crates/api/src/lib.rs Removed create_router export, keeping only create_router_with_cors
crates/api/tests/common.rs Updated test setup to use create_router_with_cors directly
crates/api/tests/rate_limit_tests.rs Added integration tests for rate limiting behavior
crates/api/tests/proxy_path_tests.rs Removed tests for catch-all proxy path validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +233 to +247
#[tokio::test]
async fn test_rate_limit_blocks_second_request_within_window() {
let state = RateLimitState::new();
let user = test_user_id(1);

// First request should succeed
let _guard1 = state.try_acquire(user.clone()).await.unwrap();

// Second request within the same second should fail (rate limit)
let result = state.try_acquire(user).await;
assert!(matches!(
result,
Err(RateLimitError::RateLimitExceeded { .. })
));
}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the semaphore permit leak bug. The existing test test_rate_limit_blocks_second_request_within_window doesn't drop the first guard before the second request, so it won't catch the case where a permit is leaked when the rate limit (not concurrency limit) is exceeded.

A test should verify that after a rate-limited request is rejected, the concurrency limit is not affected. For example:

  1. Make a request and drop the guard
  2. Make a second request immediately (should fail with RateLimitExceeded)
  3. Wait for the window to expire
  4. Make max_concurrent requests simultaneously (should all succeed, proving no permits were leaked)

Copilot uses AI. Check for mistakes.
- Improved user rate limit state management by adding idle user cleanup and refining concurrency checks.
- Updated rate limit error handling to differentiate between concurrency and rate limit exceeded scenarios.
- Removed redundant comments and streamlined the rate limit middleware for clarity.
- Added tests to ensure proper behavior of rate limiting under various conditions.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Introduced a new test to validate the behavior of the rate limiting middleware under concurrent requests.
- The test ensures that at least two out of three concurrent requests are rate limited, adhering to the defined rate limit configuration.
@think-in-universe think-in-universe merged commit ed61228 into main Dec 4, 2025
1 check passed
@think-in-universe think-in-universe deleted the rate-limit-catch-all branch December 4, 2025 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants