feat: add max-num-batched-tokens configuration and implement request handling constraints (#83)#97
feat: add max-num-batched-tokens configuration and implement request handling constraints (#83)#97mohitpalsingh wants to merge 2 commits intollm-d:mainfrom
Conversation
…handling constraints
| c = createDefaultConfig(qwenModelName) | ||
| c.Port = 8001 | ||
| c.ServedModelNames = []string{"model1", "model2"} | ||
| c.MaxNumBatchedTokens = 2048 |
There was a problem hiding this comment.
Please move this (and all other occurrences) to createDefaultConfig().
| if outputTokens < 0 { | ||
| outputTokens = 0 | ||
| } | ||
| } |
There was a problem hiding this comment.
Isn't it that if maxCompletionTokens is nil, this function should just return s.config.MaxModelLen?
pkg/llm-d-inference-sim/simulator.go
Outdated
| promptTokens: req.getNumberOfPromptTokens(), | ||
| maxTokens: processingTokens, | ||
| totalTokens: processingTokens, | ||
| } |
There was a problem hiding this comment.
I could only find where 'totalTokens' is used (to update processingTokensCount), why 'promptTokens' and 'totalTokens' are needed? And if they are not used, I guess we don't need runningRequestsMap at all? (And requestID)
There was a problem hiding this comment.
You're right, that was unnecessary, I was biased on the general one fit for all use case approach in-case we would like further control over parallel requests in near future. But I guess removing it for now is better and leaner.
Removed unused fields and structures:
- runningRequest struct - Completely removed since promptTokens and maxTokens were never used
- runningRequestsMap sync.Map - Removed since we don't need to map request IDs to token counts
- requestID field - Removed from completionReqCtx since we no longer need unique request tracking
Simplified token tracking:
- Before: Store a complex runningRequest struct with 3 fields in a map, indexed by requestID
- After: Store just the processingTokens count directly in the completionReqCtx
Updated method signatures:
- addRunningRequest() now takes *completionReqCtx instead of (reqID, req)
- removeRunningRequest() now takes *completionReqCtx instead of reqID
- Both methods are simpler and more direct
|
@mohitpalsingh Thank you very much for this PR. After review of this PR along with issue #83 with a colleague of mine, we need to think about what exactly needs to be simulated here more carefully. Therefore, there will be a delay in the review of your PR. |
yeah no issues @irar2 , I've updated the PR as per your comments and it should be gtg for the current scope and expected behavior. Let me know if you decide to change something and I can be of help for that. |
✨ New Feature:
max-num-batched-tokensSupportThis PR implements the
max-num-batched-tokensparameter, which limits the total number of tokens (prompt + max output tokens) that can be processed simultaneously across all running requests.🔧 Technical Implementation
calculateProcessingTokens()function to compute total token requirements per request.canAcceptRequest()function to check constraint satisfaction.addRunningRequest()andremoveRunningRequest()for proper token tracking.📝 Configuration & Documentation
--max-num-batched-tokenswith proper help text.config.yaml(2048) andbasic-config.yaml(1024).🔄 Code Quality Improvements
0, only themax-num-seqsconstraint is enforced.🧪 Testing
📊 Behavior
When
max-num-batched-tokensis configured:max-num-seqsmax-num-batched-tokensWhen
max-num-batched-tokensis0or not set:max-num-seqsconstraint is enforced (existing behavior).🏗️ Files Modified
config.go- Added parameter and validationsimulator.go- Core implementation and renamed fieldpkg/llm-d-inference-sim/*_test.go- Updated tests and expectationsconfig.yaml&basic-config.yaml- Added parameterREADME.md- Documentation updates✅ Validation
--helpoutput🎯 Addresses
This implementation follows the
vLLMspecification for themax-num-batched-tokensparameter, ensuring requests only proceed when both sequence and token constraints are satisfied. This enables better resource management and throughput control.