Skip to content

Conversation

@zhongdaor-nv
Copy link
Contributor

@zhongdaor-nv zhongdaor-nv commented Oct 29, 2025

Overview:

This PR enables the completion endpoint to accept arrays of prompts and generate multiple completions per prompt.

Details:

  • Add utility functions to handle prompt arrays (get_prompt_batch_size, extract_single_prompt)
  • Implement batch processing in HTTP handler with proper choice index remapping
  • Add validation for total choices (batch_size × n ≤ 128)
  • Generate unique request_id for each prompt to avoid conflicts
  • Add comprehensive tests for batch prompts and n parameter combinations
  • Maintain backward compatibility with single prompt requests

Where should the reviewer start?

lib/llm/src/protocols/openai/completions.rs - contains the new validation logic and utility functions
lib/llm/src/http/service/openai.rs - contains the batch processing implementation with choice index remapping

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Test Plan

  • Request Example:
    curl localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "prompt": ["Say test 1", "Say test 2"], "max_tokens": 50,"temperature": 0.7,"n": 1}' | jq
  • Response:
    % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 617 100 504 100 113 1446 324 --:--:-- --:--:-- --:--:-- 1772 { "id": "cmpl-342716fb-9fbe-42bf-b874-48dd150c6bba-1", "choices": [ { "text": "234, 3214, 4321, 4123, 1234, 2413, 1324, 4312, 413", "index": 0, "finish_reason": "length" }, { "text": "015\nLet $T_{n}$ be the set of all possible expressions of the form $\\frac{a_n}{b_n} + \\frac{c_n}{d_n}$, where $a_n, b_n, c_n", "index": 1, "finish_reason": "length" } ], "created": 1762282615, "model": "Qwen/Qwen3-0.6B", "system_fingerprint": null, "object": "text_completion", "usage": { "prompt_tokens": 4, "completion_tokens": 50, "total_tokens": 54 } }

Summary by CodeRabbit

New Features

  • Added support for batch processing of multiple prompts in a single completion request.
  • Introduced validation to enforce a maximum limit of 128 total choices per batch operation.

Bug Fixes

  • Fixed handling of multi-element prompt arrays; previously returned server errors, now process successfully with valid responses.

Enable completion endpoint to accept arrays of prompts and generate
n completions per prompt, matching vLLM behavior.

- Add utility functions to handle prompt arrays (get_prompt_batch_size, extract_single_prompt)
- Implement batch processing in HTTP handler with proper choice index remapping
- Add validation for total choices (batch_size × n ≤ 128)
- Generate unique request_id for each prompt to avoid conflicts
- Add comprehensive tests for batch prompts and n parameter combinations
- Maintain backward compatibility with single prompt requests

Choice index formula matches vLLM: final_index = prompt_idx * n + choice_idx
Example: 3 prompts with n=2 yields indices 0,1 (prompt0), 2,3 (prompt1), 4,5 (prompt2)
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
@zhongdaor-nv zhongdaor-nv marked this pull request as ready for review October 30, 2025 06:01
@zhongdaor-nv zhongdaor-nv requested review from a team as code owners October 30, 2025 06:01
@zhongdaor-nv zhongdaor-nv changed the title Zhongdaor/dis 871 5581615 p0llm nim dynamo feature parity testingllama 33 feat: enable HTTP completion endpoint to accept arrays of prompts and generate multiple completions per prompt Oct 30, 2025
@github-actions github-actions bot added the feat label Oct 30, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Walkthrough

The pull request implements batch-aware handling for LLM completions by introducing detection logic that routes single-prompt and multi-prompt requests through dedicated code paths. Batch utilities extract and validate prompts, enforce a total choices limit, and support per-prompt choice remapping with streaming and annotation handling.

Changes

Cohort / File(s) Summary
HTTP Service Batch Routing
lib/llm/src/http/service/openai.rs
Splits completions handling into completions_batch (multi-prompt) and completions_single (single-prompt) functions, with the main completions function delegating based on batch size detection. Handles per-prompt request setup, choice index remapping, stream merging, and metrics collection.
Protocol Layer Batch Utilities
lib/llm/src/protocols/openai/completions.rs
Adds get_prompt_batch_size and extract_single_prompt utilities for batch operations. Extends NvCreateCompletionRequest validation to check total choices via batch size. Implements raw_prompt() on NvExtProvider trait to expose prompt data when enabled.
Validation Enforcement
lib/llm/src/protocols/openai/validate.rs
Introduces MAX_TOTAL_CHOICES constant (128) and validate_total_choices function to enforce upper bounds on batch × n product.
Test Coverage
lib/llm/tests/openai_completions.rs, tests/frontend/test_completion_mocker_engine.py
Adds batch prompt utility tests covering batch sizing, per-prompt extraction, validation limits, and multi-prompt array handling. Updates Python test to expect success (200) instead of error (500) for multi-prompt arrays.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Batch routing logic in service layer with stream merging and choice remapping requires careful verification of index calculations and state handling
  • Cross-field validation integration in NvCreateCompletionRequest needs confirmation of interaction with existing validators
  • Test expectation shift from error to success for multi-prompt arrays should be verified against any related constraints or documentation
  • New trait method raw_prompt() on NvExtProvider should be reviewed for implementation consistency across all implementors

Poem

🐰 Hop, skip, and batch we go,
Multiple prompts in a row,
Choices remapped, streams all flowing,
Validations keeping totals glowing,
Completions batch—a hop-timal show! 🌟

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The pull request title "feat: enable HTTP completion endpoint to accept arrays of prompts and generate multiple completions per prompt" directly aligns with the main objective and changes in the changeset. The changes across multiple files (HTTP service, validation logic, utility functions, and tests) all implement exactly what the title describes: batch-aware prompt handling that allows arrays of prompts and per-prompt completion generation. The title is specific enough to convey the primary feature being added, concise as a single sentence, and avoids vague terminology. The changeset is entirely focused on enabling this batch completion capability while preserving backward compatibility for single-prompt requests.
Description check ✅ Passed The PR description follows the required template structure with all key sections: Overview, Details, Where should the reviewer start, Related Issues, and Test Plan with examples.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
@zhongdaor-nv
Copy link
Contributor Author

updated examples/test plan in description

@rmccorm4
Copy link
Contributor

rmccorm4 commented Nov 5, 2025

Syncing with main to pull in this change to hopefully fix all the failing deploy tests: https://github.com/ai-dynamo/dynamo/pull/4089/files

@rmccorm4
Copy link
Contributor

rmccorm4 commented Nov 5, 2025

May need this one for deploy test failures: #4130

Copy link
Contributor

@ryan-lempka ryan-lempka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see a test for empty prompt array and make sure it's properly rejected. Otherwise LGTM!

Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Copy link
Contributor

@ryan-lempka ryan-lempka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 109 to 110
// Fallback to empty string if index out of bounds
dynamo_async_openai::types::Prompt::String(String::new())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would index exceed bounds? Should this be an error case? If so we can return Result<Prompt> and return Err for these out of bounds index cases, and then where we call extract_single_prompt we should also error out if we got an error instead of processing with an empty string or empty array right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually won’t exceed the bounds. When I implemented this, I was treating it as a standalone module and added the bounds check for robustness. We can just let it error out if the index goes out of range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants