Skip to content

Conversation

@ayushag-nv
Copy link
Contributor

@ayushag-nv ayushag-nv commented Nov 6, 2025

Overview:

Problem:
image

  • There are multiple finish reasons in the stream response, which is causes by emitting empty chunks or leftover chunks even when the previous chunks have finish reason.
  • JailStream does not respect other finish_reasons in the stream except the last finish reason

How vllm Handles:

  • overwrite the last chunk with finish reason == ToolCall if finish reason is stop and tools are created.

Proposed Changes:

  • separate post processor logic in JailedStream that takes the jail stream output and overwrite the last chunk finish reason with ToolCall if there are any tool calls in choices.
  • use the original stream finish reason wherever possible
  • Unit tests are updated to validate one finish reason across the stream and also validating the value.
  • added tests for edge cases related to finish_reason = length

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved handling of finish_reason signals in streaming chat completions to prevent duplicate emissions and ensure accurate state tracking across multiple choice responses.

@ayushag-nv ayushag-nv requested a review from a team as a code owner November 6, 2025 06:58
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: ayushag <[email protected]>
@ayushag-nv ayushag-nv requested a review from grahamking November 6, 2025 07:04
@ayushag-nv ayushag-nv self-assigned this Nov 6, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 6, 2025

Walkthrough

These changes implement per-choice tracking of finish_reason emission states in streaming chat completion processing, preventing duplicate emissions across chunks. Additionally, test coverage is enhanced with assertions validating finish_reason values across tool parsing scenarios.

Changes

Cohort / File(s) Summary
Streaming finish reason state tracking
lib/llm/src/protocols/openai/chat_completions/jail.rs
Introduces finish_reason_emitted map to track per-choice emission state. Adds skip logic to prevent reprocessing choices with emitted finish reasons. Propagates state through various emission paths (responses, tool content, pass-throughs) and final stream end emissions.
Test coverage enhancements
lib/llm/tests/test_streaming_tool_parsers.rs
Adds FinishReason import and validates finish reason values (Stop or ToolCalls) on final chunks. Extends assertions to verify tool_calls data presence alongside finish reasons across no-tool and tool-enabled scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Jail logic state propagation: Verify finish_reason_emitted state is correctly maintained across all emission paths (responses, tool content, pass-throughs, final emissions).
  • Double-emission prevention: Ensure the skip check properly prevents reprocessing and no valid chunks are unexpectedly filtered.
  • Test assertion coverage: Confirm finish reason assertions are consistent with expected behavior across all tool parsing paths.

Poem

🐰 A reason to finish, now tracked with great care,
No duplicate reasons shall float in the air!
Through streams and through choices, state flows with precision,
Each chunk meets its end with a clear, clean decision. ✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and missing critical sections required by the template. Complete the 'Details' and 'Where should the reviewer start?' sections with a comprehensive description of changes and specific files to review. Update or remove the 'Related Issues' placeholder (#xxx).
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main problem being fixed: preventing multiple finish reasons in stream responses.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/llm/src/protocols/openai/chat_completions/jail.rs (1)

536-552: Avoid cloning the emission Vec before dispatch

tool_content_emissions isn’t reused after this call, so cloning the entire Vec just to satisfy ownership creates extra allocations on the hot path. Move the Vec directly into emit_choice_emissions instead.

-                            let responses = self.emit_choice_emissions(tool_content_emissions.clone(), chat_response, preserved_metadata);
+                            let responses = self.emit_choice_emissions(tool_content_emissions, chat_response, preserved_metadata);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a629b86 and 2a029f8.

📒 Files selected for processing (2)
  • lib/llm/src/protocols/openai/chat_completions/jail.rs (7 hunks)
  • lib/llm/tests/test_streaming_tool_parsers.rs (10 hunks)
🧰 Additional context used
🧠 Learnings (6)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
📚 Learning: 2025-09-10T15:27:42.511Z
Learnt from: ayushag-nv
Repo: ai-dynamo/dynamo PR: 2932
File: lib/llm/src/preprocessor.rs:768-844
Timestamp: 2025-09-10T15:27:42.511Z
Learning: In the tool calling jail implementation in lib/llm/src/preprocessor.rs, the design intentionally emits only the first accumulated choice that contains tool calls during unjailing, dropping other accumulated choices. This is a deliberate design decision, not a bug.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/jail.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: The create_choice method exists on multiple different objects in the codebase. The DeltaGenerator::create_choice in lib/llm/src/protocols/openai/chat_completions/delta.rs has its own signature that was updated to include reasoning_content, but other objects in lib/llm/src/engines.rs have their own separate create_choice methods with different signatures that are not related to chat completions.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/jail.rs
  • lib/llm/tests/test_streaming_tool_parsers.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/protocols/openai/chat_completions/jail.rs
  • lib/llm/tests/test_streaming_tool_parsers.rs
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/llm/tests/test_streaming_tool_parsers.rs
📚 Learning: 2025-09-10T05:04:58.417Z
Learnt from: ayushag-nv
Repo: ai-dynamo/dynamo PR: 2932
File: lib/llm/src/protocols/openai/chat_completions/aggregator.rs:66-86
Timestamp: 2025-09-10T05:04:58.417Z
Learning: In the dynamo codebase, tool call chunks from streaming responses always contain complete tool calls (one chunk = one tool call), unlike standard OpenAI streaming where tool calls can be fragmented across multiple chunks. The convert_tool_chunk_to_message_tool_call function correctly assumes complete tool call data in each chunk.

Applied to files:

  • lib/llm/tests/test_streaming_tool_parsers.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: Build and Test - dynamo

Comment on lines 477 to 486
// if we've already emitted a finish_reason for this choice,
// skip any subsequent chunks with finish_reason
if choice.finish_reason.is_some() && choice_states.has_emitted_finish_reason(choice.index) {
tracing::debug!(
"Skipping chunk with finish_reason {:?} for choice {} - already emitted finish_reason",
choice.finish_reason,
choice.index
);
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, the logic in vllm seems to be that they actually emit finish_reason=stop, but they replace it with finish_reason=tool_calls if it came from a tool calls response: https://github.com/vllm-project/vllm/blob/a404e2c0f1bf100d28453a5a2ab7bd2a29d9aa31/vllm/entrypoints/openai/serving_chat.py#L1512-L1529

Here we seem to generate an extra chunk even though we've already "finished" the response - are we doing something wrong? Are we hiding a bug by doing this marking/filtering? Why are we emitting 2 "finish" responses in general?

Copy link
Contributor

@rmccorm4 rmccorm4 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR;

  • It seems like vllm emits 1 final response and modifies finish_reason contextually for tool calling
  • It seems like we are emitting 2 or more final responses, and just skipping/hiding it here. This feels off to me, unless I'm missing something and this is also what other solutions are doing. Why don't we (aggregator, postprocessor, engine, etc.) know that we're done after emitting the finish_reason=tool_calls response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually yes. There is always one empty chunk that gets emitted in the stream and for some cases, they are trailing leftover markers that are there, that gets emitted.

I haven't seen any stream being continued after tool calls. So for now it can be a safe fix to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, Sglang/vLLM returns a package with empty content containing a finish_reason=stop at the end of a streaming request. (In special cases, such as when speculative sampling is enabled or when the output reaches max_tokens, the content of the last package is not empty.)

When we enable tool-call and JailedStream, packages with non-empty content are processed by JailedStream and merged into a single tool-call chunk with finish_reason == "tool_calls". But the package with empty content (containing finish_reason == "stop") is returned directly. This is why we observe two packages with finish_reason.
@rmccorm4 @ayushag-nv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Is there a simpler way to solve this problem by skipping directly to the empty packet?


let jailed_stream = jail.apply(input_stream);
let results: Vec<_> = jailed_stream.collect().await;
let fixed_stream = JailedStream::fix_finish_reason(jailed_stream);
Copy link
Contributor

@rmccorm4 rmccorm4 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tool calling specific tests, can we use apply_tool_calling_jail or a similar shared helper used in both real code and tests that encapsulates these steps together so we know for sure that we're testing the same function that would be used in deployment?

Basically if you forgot to add a JailedStream::fix_finish_reason(jailed_stream) line in one of the tests, or vice versa you put it in the test but it accidentally got removed from apply_tool_calling_jail, we may not catch it from tests right?

// Current in all tests
        let jail = JailedStream::builder()
            .tool_call_parser("nemotron_deci")
            .build();

        let jailed_stream = jail.apply(input_stream);
        let fixed_stream = JailedStream::fix_finish_reason(jailed_stream);
        let results: Vec<_> = fixed_stream.collect().await;
// Proposed for all tool calling tests (something like this, not necessarily exact)
        let fixed_stream = apply_tool_calling_jail("nemotron_deci", input_stream);
        let results: Vec<_> = fixed_stream.collect().await;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


let jailed_stream = jail.apply(input_stream);
let results: Vec<_> = jailed_stream.collect().await;
let fixed_stream = JailedStream::fix_finish_reason(jailed_stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-tool calling tests this fix_finish_reason isn't actually relevant currently, is it? Or are there other edge cases where we need to fix the finish reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other reason. But now, I will wrap the whole process of jail.apply() to jail.apply_with_finish_reason() that will handle both.

Comment on lines 773 to 782
// If this chunk has finish_reason and the choice had tool calls, override to ToolCalls
if let Some(ref mut data) = response.data {
for choice in &mut data.choices {
if choice.finish_reason.is_some()
&& has_tool_calls_per_choice.get(&choice.index).copied().unwrap_or(false)
{
choice.finish_reason = Some(FinishReason::ToolCalls);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vLLM logic appears stricter than this: https://github.com/vllm-project/vllm/blob/a404e2c0f1bf100d28453a5a2ab7bd2a29d9aa31/vllm/entrypoints/openai/serving_chat.py#L1515-L1519

            # In OpenAI's API, when a tool is called, the finish_reason is:
            # "tool_calls" for "auto" or "required" tool calls,
            # and "stop" for named tool calls.
            is_finish_reason_tool_calls = auto_tools_called or (
                request.tool_choice
                and request.tool_choice == "required"
                and output.finish_reason == "stop"
            )

For example here we check if there is any finish_reason set, but vllm explicitly checks if the finish reason is stop. Is there an edge case here where we may overwrite the finish_reason undesirably? What happens if it is length for example, and is that the correct behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our case, it is correct. Jail stream in our case ensures that we get complete tool calls, and it is safe to overwrite the finish reason.

For cases like length, I also thought about it, in that case tool calls will not be populated and hence we will not hit the overwrite condition.

Rest, request.tool_choice condition is itself handled by jail stream post processor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a small test case for length stop reason and test when tool call is parsed before max length is reached, when max length is reached and tool isn't finished parsing, and when max length is reached when tool is finished parsing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have those kind of chunks right now. May be part of other PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test @elyasmnvidian

Signed-off-by: ayushag <[email protected]>
Comment on lines 769 to 770
// Post-process to set finish reason to ToolCalls for the last chunk if any tool calls were emitted
JailedStream::fix_finish_reason(jailed_stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if jailstream failed or if there was no tool parsed? Does this function get skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no not skipped.

Copy link
Contributor

@elyasmnvidian elyasmnvidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address comments please

Signed-off-by: ayushag <[email protected]>
Signed-off-by: ayushag <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants