fix: KServe propagate error to client in stream infer #5263

GuanLuo · 2026-01-08T03:20:59Z

Overview:

Before this change, an error occurs during response streaming will not be properly returned and be silently ignored. Which results in malformed response not being returned other than failure logging in the worker side. Whereas non-streaming case unfold the error properly.

After this change, client now receives error messages (in the case of Triton client, exception will be captured in response callback)

See test_tensor_mocker_engine.py::test_model_stream_infer_failure where exception is excepted, before the change, user_data._completed_requests.get(timeout=5) will only raise EmptyException from get timeout which indicates the worker error is not propagating back to client.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Bug Fixes
- Enhanced error handling in gRPC streaming operations to properly capture and report errors instead of silent failures or crashes.
Tests
- Added test coverage for error scenarios including malformed responses and exceptions in both single and streaming inference operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guan Luo <[email protected]>

GuanLuo · 2026-01-08T03:21:20Z

@CodeRabbit review

coderabbitai · 2026-01-08T03:21:26Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-01-08T03:25:04Z

Walkthrough

The pull request adds explicit error handling to gRPC streaming paths in the service layer, converting stream errors into observable responses. Test coverage is extended with failure scenario testing for both single-shot and streaming inference modes, including malformed responses and exception cases.

Changes

Cohort / File(s)	Summary
gRPC streaming error handling `lib/llm/src/grpc/service/kserve.rs`	Replaced direct `stream.next()` patterns with delta-based handling using `while let Some(delta) = stream.next().await`. Errors from stream items are now captured via `delta.ok()` and converted into `ModelStreamInferResponse` objects containing error messages instead of propagating silently. Preserves successful data semantics while improving error visibility.
Test worker failure scenarios `tests/frontend/grpc/echo_tensor_worker.py`	Added parameter-based control flow for `malformed_response` (modifies tensor data structure and returns early) and `raise_exception` (raises intentional ValueError). Enables testing of both error response and exception handling paths.
Streaming test coverage `tests/frontend/grpc/test_tensor_mocker_engine.py`, `tests/frontend/grpc/triton_echo_client.py`	Introduced two new test functions testing failure scenarios for both single-infer and streaming-infer paths. Added `run_stream_infer()` method with queue-based result aggregation and callback-driven error handling for asynchronous streaming. Imports extended to include queue, partial, and numpy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Streams now speak their errors clear,
No more silent fears to bear!
With queues and callbacks, tests align,
Error paths now brightly shine—
Robust responses, defined design! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only template placeholders with no actual content; all required sections (Overview, Details, Where to start, Related Issues) are empty.	Fill in all template sections with actual implementation details, file-specific guidance for reviewers, and the related GitHub issue number from the branch name.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: propagating errors to the client in KServe stream inference, which aligns with the core fix shown in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @tests/frontend/grpc/test_tensor_mocker_engine.py:
- Around line 213-218: The else branch in the exception assertion block uses
"assert False, ..." which can be stripped under Python optimization; replace
that with an unconditional raise by changing the else branch to "raise
AssertionError('Expected exception was not raised')" so failures are always
raised regardless of -O; update the block that checks "if 'malformed_response'
in request_params ... elif 'raise_exception' in request_params ... else" to use
raise AssertionError instead of assert False and keep the same message,
referring to the local variables request_params and excinfo.

In @tests/frontend/grpc/triton_echo_client.py:
- Around line 113-116: Change the blocking call
user_data._completed_requests.get() to use a timeout (e.g.,
user_data._completed_requests.get(timeout=10)); catch the queue.Empty exception
from that call and fail the test with an explicit assertion or error message
indicating a timeout (e.g., "Stream inference timed out after Xs"), and keep the
existing check that the returned data_item is not an Exception. Ensure you
import and handle the appropriate Empty exception class so the test fails fast
instead of hanging.

🧹 Nitpick comments (3)

tests/frontend/grpc/triton_echo_client.py (2)
87-101: Consider extracting UserData and callback to a shared test utility.

The UserData class and callback function are duplicated between this file and test_tensor_mocker_engine.py. Consider extracting them to a shared module to reduce duplication and ensure consistent behavior.

138-143: Consider adding run_stream_infer() to the main block.

The main block runs check_health, run_infer, and get_config for manual testing, but omits the new run_stream_infer() method. Consider adding it for complete coverage when running the script directly.
     client.check_health()
     client.run_infer()
+    client.run_stream_infer()
     client.get_config()
     print("Triton echo client ran successfully.")
tests/frontend/grpc/test_tensor_mocker_engine.py (1)
163-168: Update docstrings to describe failure testing.

The docstrings for both test_model_infer_failure and test_model_stream_infer_failure describe echo/identity behavior, but the tests actually verify error propagation for malformed responses and exceptions. Consider updating them to reflect the actual test intent.
"""Test that gRPC streaming inference properly propagates errors to the client.

Verifies that both malformed_response and raise_exception scenarios result in
appropriate error messages being received by the client through the stream.
"""

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b5c803 and d5fd6e1.

📒 Files selected for processing (4)

lib/llm/src/grpc/service/kserve.rs
tests/frontend/grpc/echo_tensor_worker.py
tests/frontend/grpc/test_tensor_mocker_engine.py
tests/frontend/grpc/triton_echo_client.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-11T03:24:47.820Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

lib/llm/src/grpc/service/kserve.rs

🧬 Code graph analysis (2)

tests/frontend/grpc/test_tensor_mocker_engine.py (1)

tests/frontend/grpc/triton_echo_client.py (3)

run_stream_infer (69-128)

UserData (87-89)

callback (96-101)

tests/frontend/grpc/triton_echo_client.py (1)

tests/frontend/grpc/test_tensor_mocker_engine.py (2)

UserData (177-181)

callback (188-193)

🪛 Ruff (0.14.10)

tests/frontend/grpc/test_tensor_mocker_engine.py

218-218: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

tests/frontend/grpc/echo_tensor_worker.py

87-87: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: sglang (arm64)
GitHub Check: vllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: clippy (lib/runtime/examples)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: tests (lib/bindings/python)
GitHub Check: tests (.)
GitHub Check: clippy (.)
GitHub Check: tests (lib/runtime/examples)

🔇 Additional comments (5)

lib/llm/src/grpc/service/kserve.rs (2)

425-435: Same pattern as above - verify delta.ok() semantics.

This block mirrors the error handling pattern at lines 364-374. The same concern about the delta.ok() method applies here. Ensure consistency in how errors are handled across both streaming paths.

364-374: No changes needed. The code is correct.

The pattern match delta.ok() is intentional. The Annotated::ok(self) -> Result<Self, String> method is a custom implementation that extracts error information from annotated responses: if the annotation has an event="error" field, it returns Err(message), otherwise Ok(self). This design properly propagates streaming errors as ModelStreamInferResponse with error_message set, allowing the stream to continue processing subsequent items after an error.

tests/frontend/grpc/echo_tensor_worker.py (1)

78-87: LGTM - Test helper for error scenarios.

The implementation correctly adds two test paths:

malformed_response: Corrupts tensor data to trigger downstream parsing errors

raise_exception: Raises an intentional exception to test exception propagation

These align with the PR's goal of testing error propagation in streaming inference. The static analysis hint about TRY003 (long exception message) is acceptable in test code.

tests/frontend/grpc/test_tensor_mocker_engine.py (2)

112-113: LGTM - Extended test_echo to cover streaming.

Good addition to ensure the streaming path is exercised in the happy-path test.

116-148: LGTM - Good parameterized test for single-shot inference failures.

The test correctly verifies that both malformed_response and raise_exception scenarios propagate meaningful error messages to the client.

tests/frontend/grpc/test_tensor_mocker_engine.py

tests/frontend/grpc/triton_echo_client.py

Signed-off-by: Guan Luo <[email protected]>

lib/llm/src/protocols/tensor.rs

rmccorm4

LGTM other than:

Address the logging comment
Please add a before/after example of an error response to the PR description for better context

Signed-off-by: Guan Luo <[email protected]>

Signed-off-by: Guan Luo <[email protected]> Signed-off-by: Guan Luo <[email protected]>

GuanLuo and others added 3 commits January 7, 2026 01:18

fix: propagate error to client in stream infer

ee71d91

Signed-off-by: Guan Luo <[email protected]>

wip: extending test

a0a2f92

test: add test on streaming failure

d5fd6e1

Signed-off-by: Guan Luo <[email protected]>

pull-request-size bot added the size/L label Jan 8, 2026

GuanLuo marked this pull request as ready for review January 8, 2026 03:21

GuanLuo requested review from a team as code owners January 8, 2026 03:21

github-actions bot added the fix label Jan 8, 2026

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

tests/frontend/grpc/test_tensor_mocker_engine.py Outdated Show resolved Hide resolved

tests/frontend/grpc/triton_echo_client.py Outdated Show resolved Hide resolved

chore: address comment

61ac549

Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB January 8, 2026 04:13 Failure

fix: invoke validation

57524ec

Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB January 8, 2026 05:29 Failure

style: style

0c369e3

Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB January 8, 2026 05:34 Failure

style: style

56c0f2b

Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB January 8, 2026 06:26 Failure

fix: fix up

3596c29

Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 08:30 Inactive

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 08:38 Inactive

rmccorm4 reviewed Jan 8, 2026

View reviewed changes

lib/llm/src/protocols/tensor.rs Outdated Show resolved Hide resolved

rmccorm4 approved these changes Jan 8, 2026

View reviewed changes

address comment

1b5d272

Signed-off-by: Guan Luo <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 21:56 Inactive

copy-pr-bot bot temporarily deployed to GITLAB January 8, 2026 21:58 Inactive

GuanLuo enabled auto-merge (squash) January 9, 2026 06:31

GuanLuo merged commit bb8eaa2 into main Jan 9, 2026
37 of 38 checks passed

GuanLuo deleted the gluo/dis-1133-bug-kserve-infer-not-returning-error-message-on-malformed branch January 9, 2026 06:31

GuanLuo added a commit that referenced this pull request Jan 9, 2026

fix: KServe propagate error to client in stream infer (#5263)

ebdd96d

Signed-off-by: Guan Luo <[email protected]> Signed-off-by: Guan Luo <[email protected]>

nv-anants pushed a commit that referenced this pull request Jan 9, 2026

fix: KServe propagate error to client in stream infer (#5263) (#5322)

db6ba70

Signed-off-by: Guan Luo <[email protected]> Signed-off-by: Guan Luo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: KServe propagate error to client in stream infer #5263

fix: KServe propagate error to client in stream infer #5263

Uh oh!

GuanLuo commented Jan 8, 2026 •

edited

Loading

Uh oh!

GuanLuo commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmccorm4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: KServe propagate error to client in stream infer #5263

fix: KServe propagate error to client in stream infer #5263

Uh oh!

Conversation

GuanLuo commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

GuanLuo commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Uh oh!

coderabbitai bot commented Jan 8, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmccorm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GuanLuo commented Jan 8, 2026 •

edited

Loading