server: improve slot fairness by limiting tokens per slot #14343
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses issue #6607 by implementing a minimal, safe solution to prevent large prompts from blocking other slots in the server's prompt processing queue.
Problem
The current FIFO slot processing approach allows large prompts to consume the entire batch size (n_batch), effectively blocking other slots from being processed until the large prompt is complete.
This creates an unfair situation where users with smaller prompts experience significant delays.
Solution
I've implemented a simple token-per-slot limitation that ensures fair processing across multiple concurrent requests:
n_batch / 4tokens per iterationImplementation Details
The change is minimal and focused:
tools/server/server.cppmax_tokens_per_slotlimit and counter to existing while loopExample with n_batch=512:
Testing
I have thoroughly tested this implementation to ensure it doesn't break existing functionality:
Build Verification: Successfully compiled with no errors or warnings using
make llama-server -j4Basic Functionality Tests: All basic server tests pass, confirming core functionality remains intact
Completion Tests: Comprehensive completion tests including parallel slot processing tests all pass (26 passed, 1 skipped). The successful
test_completion_parallel_slotstests specificallyvalidate that multiple concurrent slots work correctly with our changes
Edge Case Testing: Verified behavior with single slots, empty slot lists, and various slot configurations
Benefits
Notes for Reviewers
This is an intentionally conservative approach that prioritizes safety and simplicity. The 25% (n_batch/4) limit was chosen as a reasonable balance between fairness and efficiency, but this could be
made configurable in future iterations if needed.
The solution maintains all existing guarantees about slot processing order and batch efficiency while addressing the core blocking issue identified in #6607.
Thank you so much for taking the time to review this contribution! I truly appreciate the careful consideration that the llama.cpp maintainers give to every change, and I'm grateful for the
opportunity to contribute to this fantastic project. Your expertise and feedback are invaluable in making llama.cpp better for everyone.
Fixes #6607