-
Notifications
You must be signed in to change notification settings - Fork 675
feat: enable HTTP completion endpoint to accept arrays of prompts and generate multiple completions per prompt #3953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enable completion endpoint to accept arrays of prompts and generate n completions per prompt, matching vLLM behavior. - Add utility functions to handle prompt arrays (get_prompt_batch_size, extract_single_prompt) - Implement batch processing in HTTP handler with proper choice index remapping - Add validation for total choices (batch_size × n ≤ 128) - Generate unique request_id for each prompt to avoid conflicts - Add comprehensive tests for batch prompts and n parameter combinations - Maintain backward compatibility with single prompt requests Choice index formula matches vLLM: final_index = prompt_idx * n + choice_idx Example: 3 prompts with n=2 yields indices 0,1 (prompt0), 2,3 (prompt1), 4,5 (prompt2)
Signed-off-by: zhongdaor <[email protected]>
…eature-parity-testingllama-33
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
WalkthroughThe pull request implements batch-aware handling for LLM completions by introducing detection logic that routes single-prompt and multi-prompt requests through dedicated code paths. Batch utilities extract and validate prompts, enforce a total choices limit, and support per-prompt choice remapping with streaming and annotation handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
…eature-parity-testingllama-33
|
updated examples/test plan in description |
|
Syncing with main to pull in this change to hopefully fix all the failing deploy tests: https://github.com/ai-dynamo/dynamo/pull/4089/files |
…eature-parity-testingllama-33
|
May need this one for deploy test failures: #4130 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see a test for empty prompt array and make sure it's properly rejected. Otherwise LGTM!
Signed-off-by: zhongdaor <[email protected]>
…eature-parity-testingllama-33
Signed-off-by: zhongdaor <[email protected]>
…eature-parity-testingllama-33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| // Fallback to empty string if index out of bounds | ||
| dynamo_async_openai::types::Prompt::String(String::new()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would index exceed bounds? Should this be an error case? If so we can return Result<Prompt> and return Err for these out of bounds index cases, and then where we call extract_single_prompt we should also error out if we got an error instead of processing with an empty string or empty array right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually won’t exceed the bounds. When I implemented this, I was treating it as a standalone module and added the bounds check for robustness. We can just let it error out if the index goes out of range.
Signed-off-by: zhongdaor <[email protected]>
Overview:
This PR enables the completion endpoint to accept arrays of prompts and generate multiple completions per prompt.
Details:
Where should the reviewer start?
lib/llm/src/protocols/openai/completions.rs - contains the new validation logic and utility functions
lib/llm/src/http/service/openai.rs - contains the batch processing implementation with choice index remapping
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Test Plan
curl localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "prompt": ["Say test 1", "Say test 2"], "max_tokens": 50,"temperature": 0.7,"n": 1}' | jq% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 617 100 504 100 113 1446 324 --:--:-- --:--:-- --:--:-- 1772 { "id": "cmpl-342716fb-9fbe-42bf-b874-48dd150c6bba-1", "choices": [ { "text": "234, 3214, 4321, 4123, 1234, 2413, 1324, 4312, 413", "index": 0, "finish_reason": "length" }, { "text": "015\nLet $T_{n}$ be the set of all possible expressions of the form $\\frac{a_n}{b_n} + \\frac{c_n}{d_n}$, where $a_n, b_n, c_n", "index": 1, "finish_reason": "length" } ], "created": 1762282615, "model": "Qwen/Qwen3-0.6B", "system_fingerprint": null, "object": "text_completion", "usage": { "prompt_tokens": 4, "completion_tokens": 50, "total_tokens": 54 } }Summary by CodeRabbit
New Features
Bug Fixes