Skip to content

Conversation

@tzulingk
Copy link

@tzulingk tzulingk commented Dec 6, 2025

Background

Currently, validation errors in _check_arguments() raise ValueError, which causes the engine to shut down when caught by the Dynamo handler instead of returning a proper HTTP 400 error response to the client.

Why This Fix is Needed

  • RequestError is caught by the executor's error handling system and converted to proper error responses
  • Prevents entire engine shutdown on invalid user input
  • Maintains consistency with other validation errors (e.g., default_max_tokens) which already use RequestError
  • Allows graceful error handling in production environments

Testing

Build the image and run the container

./container/build.sh --framework trtllm
./container/run.sh -it --mount-workspace --gpus all --framework trtllm

agg.yaml

Change agg.yaml file
tensor_parallel_size: 1
moe_expert_parallel_size: 1
enable_attention_dp: false
max_num_tokens: 23
max_batch_size: 16
trust_remote_code: true
backend: pytorch
enable_chunked_prefill: false

kv_cache_config:
  free_gpu_memory_fraction: 0.85

Inside the container
Environment variables with defaults
exportMODEL_PATH=${MODEL_PATH:-"Qwen/Qwen3-0.6B"}
exportSERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"Qwen/Qwen3-0.6B"}
exportAGG_ENGINE_ARGS=${AGG_ENGINE_ARGS:-"examples/backends/trtllm/engine_configs/qwen3/agg.yaml"}
run frontend
python3 -m dynamo.frontend --router-mode kv --http-port 8080 >& tzuling_frontend.log &
run worker
python3 -m dynamo.trtllm
  --model-path "$MODEL_PATH"
  --served-model-name "$SERVED_MODEL_NAME"
  --extra-engine-args "$AGG_ENGINE_ARGS"
  --max-seq-len 20
  --publish-events-and-metrics >& tzuling_backend.log &

Send prompt > max_seq_len request
curl -s localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [
    {
      "role": "developer",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello! It’s a wonderful day. How are you today? Have you had breakfast?"
    }
  ],
  "stream": false,
  "max_tokens": 300
}'

Now returns

{"message":"The sum of prompt length (26.0), query length (0) should not exceed max_num_tokens (22)","type":"Internal Server Error","code":500}

without crashing the engine

Fixes #9760

Summary by CodeRabbit

  • Refactor
    • Improved error handling consistency by updating exception types in input validation.

✏️ Tip: You can customize this high-level summary in your review settings.

@tzulingk tzulingk requested a review from a team as a code owner December 6, 2025 06:53
@tzulingk tzulingk requested a review from hchings December 6, 2025 06:53
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 6, 2025

📝 Walkthrough

Walkthrough

Import RequestError from tensorrt_llm.executor.utils and change the exception type raised in _check_arguments from ValueError to RequestError when prompt and query lengths exceed max_num_tokens under the PyTorch backend.

Changes

Cohort / File(s) Change Summary
Error handling fix
tensorrt_llm/llmapi/llm.py
Added import of RequestError from tensorrt_llm.executor.utils; changed exception type from ValueError to RequestError in _check_arguments validation when sum of prompt and query lengths exceeds max_num_tokens under PyTorch backend.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Verify that RequestError is the correct exception type for this validation failure
  • Confirm this is the only occurrence of this particular validation issue requiring a fix
  • Ensure downstream error handling in the executor pipeline properly catches and formats RequestError as an HTTP 400 response

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description adequately explains the background, rationale, and includes detailed testing steps, though PR title format and some template sections are not fully followed.
Linked Issues check ✅ Passed The PR directly addresses all objectives from issue #9760: changing ValueError to RequestError in _check_arguments() to prevent engine shutdown and enable proper HTTP error responses.
Out of Scope Changes check ✅ Passed All changes are narrowly scoped to fixing the validation error handling in _check_arguments() as specified in issue #9760, with no extraneous modifications.
Title check ✅ Passed The title clearly identifies the main change: using RequestError instead of ValueError for validation errors to prevent engine shutdown.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm.py (1)

655-657: Consider extracting error message if this pattern repeats.

The static analysis tool suggests avoiding long error messages inline (TRY003). While the current implementation is functional and the descriptive message is helpful, if similar validation errors are added in the future, consider extracting common message formatting into a helper function or exception class attribute.

This is a low-priority style suggestion and not blocking.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d2178d and ad73414.

📒 Files selected for processing (1)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/llmapi/llm.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/llmapi/llm.py
🧬 Code graph analysis (1)
tensorrt_llm/llmapi/llm.py (1)
tensorrt_llm/executor/utils.py (2)
  • RequestError (76-77)
  • create_mpi_comm_session (48-65)
🪛 Ruff (0.14.7)
tensorrt_llm/llmapi/llm.py

655-657: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
tensorrt_llm/llmapi/llm.py (2)

33-34: LGTM: Import correctly added for RequestError.

The import is properly added to the existing import statement from tensorrt_llm.executor.utils, maintaining the namespace as required by coding guidelines.


655-657: Exception type change correctly prevents engine shutdown for PyTorch backend.

The change from ValueError to RequestError aligns with the PR objectives and ensures validation failures return proper HTTP error responses instead of causing engine shutdown. The error message is descriptive and includes relevant values.

Verify whether the TRT backend validation at line 672 and lines 678-709 should also use RequestError for consistency with the PyTorch backend error handling strategy. If ValueError is intentionally retained for the TRT backend due to different error handling patterns, document this design decision.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 6, 2025
@karljang karljang changed the title fix: Use RequestError for validation errors to prevent engine shutdown [#9760][fix] Use RequestError for validation errors to prevent engine shutdown Dec 6, 2025
@karljang
Copy link
Collaborator

karljang commented Dec 6, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27182 [ run ] triggered by Bot. Commit: ad73414

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27182 [ run ] completed with state SUCCESS. Commit: ad73414
/LLM/main/L0_MergeRequest_PR pipeline #20744 completed with status: 'FAILURE'

@karljang
Copy link
Collaborator

karljang commented Dec 6, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27189 [ run ] triggered by Bot. Commit: ad73414

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27189 [ run ] completed with state SUCCESS. Commit: ad73414
/LLM/main/L0_MergeRequest_PR pipeline #20751 completed with status: 'FAILURE'

@karljang
Copy link
Collaborator

karljang commented Dec 7, 2025

@tzulingk ,
Thanks for the contribution!
Could you also update TestLlmError.test_max_num_token_check accordingly?
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/unittest/llmapi/test_llm_pytorch.py#L800C1-L809C32

@karljang
Copy link
Collaborator

karljang commented Dec 7, 2025

Additionally, to address the thread leak failure shown in the Jenkins log, could you add some extra guards like the following:

def test_max_num_token_check(self):
    """ LLM should raise error when got prompt length exceed the valid range. """
    with LLM(llama_model_path,
             kv_cache_config=global_kvcache_config,
             max_num_tokens=100) as llm:
        
        with pytest.raises(RequestError,
                           match="should not exceed max_num_tokens"):
            ids = [random.randint(10, 100) for _ in range(101)]
            llm.generate([ids])

or

def test_max_num_token_check(self):
    """ LLM should raise error when got prompt length exceed the valid range. """
    llm = LLM(llama_model_path,
              kv_cache_config=global_kvcache_config,
              max_num_tokens=100)
    
    try:
        with pytest.raises(RequestError,  # ← Changed from ValueError
                           match="should not exceed max_num_tokens"):
            ids = [random.randint(10, 100) for _ in range(101)]
            llm.generate([ids])
    finally:
        llm.shutdown()  # ← Added cleanup

@karljang
Copy link
Collaborator

karljang commented Dec 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27562 [ run ] triggered by Bot. Commit: e55afdd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27562 [ run ] completed with state FAILURE. Commit: e55afdd
LLM/main/L0_MergeRequest_PR #21034 (Blue Ocean) completed with status: ABORTED

@karljang
Copy link
Collaborator

karljang commented Dec 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27573 [ run ] triggered by Bot. Commit: e55afdd

@karljang
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27758 [ run ] triggered by Bot. Commit: e55afdd

@karljang
Copy link
Collaborator

@hchings , Could you please review or assign this PR? CI is still failing, but the errors don’t seem to be related to this PR.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #27758 [ run ] completed with state SUCCESS. Commit: e55afdd
/LLM/main/L0_MergeRequest_PR pipeline #21185 completed with status: 'FAILURE'

@karljang
Copy link
Collaborator

@tzulingk
Another failed case: https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR/21185/testReport/junit/DGX_H100-2_GPUs-PyTorch-Ray-1/test_unittests/test_unittests_v2_unittest_llmapi_test_llm_multi_gpu_pytorch_py__m__gpu2__/

unittest/DGX_H100-2_GPUs-PyTorch-Ray-1/unittest/llmapi/test_llm_multi_gpu_pytorch.py::test_llm_capture_request_error FAILED                                                                      [  9%]

I don't know the history, but it seems like the team intended to handle ValueError and RequestError differently. If that’s the case, it might be more effective to catch ‘ValueError’ from trtllm-serve or another source.

@tzulingk , @hchings , Could you both share your thoughts on this?

@tzulingk
Copy link
Author

/bot run

1 similar comment
@karljang
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28336 [ run ] triggered by Bot. Commit: 657bc26

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28336 [ run ] completed with state SUCCESS. Commit: 657bc26
/LLM/main/L0_MergeRequest_PR pipeline #21677 completed with status: 'FAILURE'

@pcastonguay pcastonguay requested a review from LinPoly December 15, 2025 16:06
@karljang
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28516 [ run ] triggered by Bot. Commit: 5e8af75

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28516 [ run ] completed with state SUCCESS. Commit: 5e8af75
/LLM/main/L0_MergeRequest_PR pipeline #21838 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Copy link
Collaborator

@LinPoly LinPoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to further distinguish illegal request from other kinds of exceptions. THX for improving.

@rmccorm4
Copy link
Contributor

Hi @LinPoly @karljang is anything else needed to get this PR merged? and which public pypi wheel RC will this change end up in?

@pcastonguay
Copy link
Collaborator

/bot run --disable-fail-fast

1 similar comment
@pcastonguay
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #28816 [ run ] triggered by Bot. Commit: 5e8af75

@pcastonguay
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30763 [ run ] triggered by Bot. Commit: 5e8af75

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30763 [ run ] completed with state SUCCESS. Commit: 5e8af75
/LLM/main/L0_MergeRequest_PR pipeline #23746 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@LinPoly
Copy link
Collaborator

LinPoly commented Jan 7, 2026

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30878 [ run ] triggered by Bot. Commit: 5e8af75

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30878 [ run ] completed with state SUCCESS. Commit: 5e8af75
/LLM/main/L0_MergeRequest_PR pipeline #23842 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: trtllm crashes when prompt > max_num_tokens

7 participants