Skip to content

Conversation

@GuanLuo
Copy link
Contributor

@GuanLuo GuanLuo commented Jan 8, 2026

Overview:

Before this change, an error occurs during response streaming will not be properly returned and be silently ignored. Which results in malformed response not being returned other than failure logging in the worker side. Whereas non-streaming case unfold the error properly.

After this change, client now receives error messages (in the case of Triton client, exception will be captured in response callback)

See test_tensor_mocker_engine.py::test_model_stream_infer_failure where exception is excepted, before the change, user_data._completed_requests.get(timeout=5) will only raise EmptyException from get timeout which indicates the worker error is not propagating back to client.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced error handling in gRPC streaming operations to properly capture and report errors instead of silent failures or crashes.
  • Tests

    • Added test coverage for error scenarios including malformed responses and exceptions in both single and streaming inference operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@GuanLuo GuanLuo marked this pull request as ready for review January 8, 2026 03:21
@GuanLuo GuanLuo requested review from a team as code owners January 8, 2026 03:21
@github-actions github-actions bot added the fix label Jan 8, 2026
@GuanLuo
Copy link
Contributor Author

GuanLuo commented Jan 8, 2026

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

The pull request adds explicit error handling to gRPC streaming paths in the service layer, converting stream errors into observable responses. Test coverage is extended with failure scenario testing for both single-shot and streaming inference modes, including malformed responses and exception cases.

Changes

Cohort / File(s) Summary
gRPC streaming error handling
lib/llm/src/grpc/service/kserve.rs
Replaced direct stream.next() patterns with delta-based handling using while let Some(delta) = stream.next().await. Errors from stream items are now captured via delta.ok() and converted into ModelStreamInferResponse objects containing error messages instead of propagating silently. Preserves successful data semantics while improving error visibility.
Test worker failure scenarios
tests/frontend/grpc/echo_tensor_worker.py
Added parameter-based control flow for malformed_response (modifies tensor data structure and returns early) and raise_exception (raises intentional ValueError). Enables testing of both error response and exception handling paths.
Streaming test coverage
tests/frontend/grpc/test_tensor_mocker_engine.py, tests/frontend/grpc/triton_echo_client.py
Introduced two new test functions testing failure scenarios for both single-infer and streaming-infer paths. Added run_stream_infer() method with queue-based result aggregation and callback-driven error handling for asynchronous streaming. Imports extended to include queue, partial, and numpy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Streams now speak their errors clear,
No more silent fears to bear!
With queues and callbacks, tests align,
Error paths now brightly shine—
Robust responses, defined design! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only template placeholders with no actual content; all required sections (Overview, Details, Where to start, Related Issues) are empty. Fill in all template sections with actual implementation details, file-specific guidance for reviewers, and the related GitHub issue number from the branch name.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: propagating errors to the client in KServe stream inference, which aligns with the core fix shown in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @tests/frontend/grpc/test_tensor_mocker_engine.py:
- Around line 213-218: The else branch in the exception assertion block uses
"assert False, ..." which can be stripped under Python optimization; replace
that with an unconditional raise by changing the else branch to "raise
AssertionError('Expected exception was not raised')" so failures are always
raised regardless of -O; update the block that checks "if 'malformed_response'
in request_params ... elif 'raise_exception' in request_params ... else" to use
raise AssertionError instead of assert False and keep the same message,
referring to the local variables request_params and excinfo.

In @tests/frontend/grpc/triton_echo_client.py:
- Around line 113-116: Change the blocking call
user_data._completed_requests.get() to use a timeout (e.g.,
user_data._completed_requests.get(timeout=10)); catch the queue.Empty exception
from that call and fail the test with an explicit assertion or error message
indicating a timeout (e.g., "Stream inference timed out after Xs"), and keep the
existing check that the returned data_item is not an Exception. Ensure you
import and handle the appropriate Empty exception class so the test fails fast
instead of hanging.
🧹 Nitpick comments (3)
tests/frontend/grpc/triton_echo_client.py (2)

87-101: Consider extracting UserData and callback to a shared test utility.

The UserData class and callback function are duplicated between this file and test_tensor_mocker_engine.py. Consider extracting them to a shared module to reduce duplication and ensure consistent behavior.


138-143: Consider adding run_stream_infer() to the main block.

The main block runs check_health, run_infer, and get_config for manual testing, but omits the new run_stream_infer() method. Consider adding it for complete coverage when running the script directly.

     client.check_health()
     client.run_infer()
+    client.run_stream_infer()
     client.get_config()
     print("Triton echo client ran successfully.")
tests/frontend/grpc/test_tensor_mocker_engine.py (1)

163-168: Update docstrings to describe failure testing.

The docstrings for both test_model_infer_failure and test_model_stream_infer_failure describe echo/identity behavior, but the tests actually verify error propagation for malformed responses and exceptions. Consider updating them to reflect the actual test intent.

"""Test that gRPC streaming inference properly propagates errors to the client.

Verifies that both malformed_response and raise_exception scenarios result in
appropriate error messages being received by the client through the stream.
"""
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b5c803 and d5fd6e1.

📒 Files selected for processing (4)
  • lib/llm/src/grpc/service/kserve.rs
  • tests/frontend/grpc/echo_tensor_worker.py
  • tests/frontend/grpc/test_tensor_mocker_engine.py
  • tests/frontend/grpc/triton_echo_client.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/llm/src/grpc/service/kserve.rs
🧬 Code graph analysis (2)
tests/frontend/grpc/test_tensor_mocker_engine.py (1)
tests/frontend/grpc/triton_echo_client.py (3)
  • run_stream_infer (69-128)
  • UserData (87-89)
  • callback (96-101)
tests/frontend/grpc/triton_echo_client.py (1)
tests/frontend/grpc/test_tensor_mocker_engine.py (2)
  • UserData (177-181)
  • callback (188-193)
🪛 Ruff (0.14.10)
tests/frontend/grpc/test_tensor_mocker_engine.py

218-218: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

tests/frontend/grpc/echo_tensor_worker.py

87-87: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (.)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/runtime/examples)
🔇 Additional comments (5)
lib/llm/src/grpc/service/kserve.rs (2)

425-435: Same pattern as above - verify delta.ok() semantics.

This block mirrors the error handling pattern at lines 364-374. The same concern about the delta.ok() method applies here. Ensure consistency in how errors are handled across both streaming paths.


364-374: No changes needed. The code is correct.

The pattern match delta.ok() is intentional. The Annotated::ok(self) -> Result<Self, String> method is a custom implementation that extracts error information from annotated responses: if the annotation has an event="error" field, it returns Err(message), otherwise Ok(self). This design properly propagates streaming errors as ModelStreamInferResponse with error_message set, allowing the stream to continue processing subsequent items after an error.

tests/frontend/grpc/echo_tensor_worker.py (1)

78-87: LGTM - Test helper for error scenarios.

The implementation correctly adds two test paths:

  • malformed_response: Corrupts tensor data to trigger downstream parsing errors
  • raise_exception: Raises an intentional exception to test exception propagation

These align with the PR's goal of testing error propagation in streaming inference. The static analysis hint about TRY003 (long exception message) is acceptable in test code.

tests/frontend/grpc/test_tensor_mocker_engine.py (2)

112-113: LGTM - Extended test_echo to cover streaming.

Good addition to ensure the streaming path is exercised in the happy-path test.


116-148: LGTM - Good parameterized test for single-shot inference failures.

The test correctly verifies that both malformed_response and raise_exception scenarios propagate meaningful error messages to the client.

Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than:

  1. Address the logging comment
  2. Please add a before/after example of an error response to the PR description for better context

Signed-off-by: Guan Luo <[email protected]>
@GuanLuo GuanLuo merged commit bb8eaa2 into main Jan 9, 2026
37 of 38 checks passed
@GuanLuo GuanLuo deleted the gluo/dis-1133-bug-kserve-infer-not-returning-error-message-on-malformed branch January 9, 2026 06:31
GuanLuo added a commit that referenced this pull request Jan 9, 2026
nv-anants pushed a commit that referenced this pull request Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants