[None][chore] Weekly mass integration of release/1.1 #8918

mikeiovine · 2025-11-04T19:20:34Z

Description

Cherry pick the commits before the major dependency update.

Currently excluded due to high amount of conflicts

Excluded CUDA 13 runtime dependencies as well: #8858. These are related to the DLFW upgrade, which is being done separately for now.

Also excluded this WAR for a transient CI issue: #8616

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…B-FP8 (NVIDIA#8429) Signed-off-by: Junyi Xu <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

… info with CI failure. (NVIDIA#8440) Signed-off-by: Simeng Liu <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…e. (NVIDIA#8316) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

NVIDIA#8494) Signed-off-by: Jin Li <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

NVIDIA#8500) Signed-off-by: Bo Deng <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

Signed-off-by: Pengyun Lin <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…DIA#8582) Signed-off-by: Yan Chunwei <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…x hang issue (NVIDIA#8519) Signed-off-by: Lizhi Zhou <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…acy test result (NVIDIA#8609) Signed-off-by: Lizhi Zhou <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…backend (NVIDIA#8611) Signed-off-by: Jie Li <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

Signed-off-by: Ivy Zhang <[email protected]> Co-authored-by: Larry Xu <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…opper (NVIDIA#8612) Signed-off-by: Shiyu Li <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

…VFP4 models. (NVIDIA#8679) Signed-off-by: Yukun He <[email protected]> Signed-off-by: Mike Iovine <[email protected]>

coderabbitai · 2025-11-04T19:22:58Z

📝 Walkthrough

Walkthrough

This PR contains coordinated changes across C++ KV cache management, CUDA kernel synchronization, PyTorch model implementations, distributed execution infrastructure, and test configurations. Key changes include safer sequence access patterns, stream management refactoring, EXAONE4 model enhancements, termination handler redesign for disaggregated processing, and test infrastructure updates.

Changes

Cohort / File(s)	Summary
C++ KV Cache Manager Safety `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Added guard lookups with mutex locks to safely check sequence presence in mSequences before access; replaced direct map access with helper function to prevent undefined behavior.
CUDA Kernel Synchronization `cpp/tensorrt_llm/kernels/communicationKernels/mnnvlTwoShotAllreduceKernels.cu`	Expanded CUDA architecture gate from SM 900+ to SM 700+; switched inline assembly from `red.global.gpu.add.u32` to `red.release.global.gpu.add.u32` with atomicAdd fallback for older architectures.
PyTorch CUDA Graph Workspace `tensorrt_llm/_torch/attention_backend/trtllm.py`	Added `cuda_graph_workspace` field to TrtllmAttentionMetadata; initialize in `_post_init_with_buffers` and conditionally select workspace in forward based on CUDA graph usage.
PyTorch Backend Stream Management `tensorrt_llm/_torch/compilation/backend.py`	Replaced `aux_streams` list with single `num_streams` count attribute; updated multi-stream scheduling and event calculations to use stream-count-based approach.
Distributed Communication API `tensorrt_llm/_torch/distributed/communicator.py`	Added `root` parameter to `MPIDist.pp_gather` signature with default value 0; forwards to underlying `pp_comm.gather` call.
EXAONE4 Model Enhancements `tensorrt_llm/_torch/models/modeling_exaone4.py`	Added QuantAlgo import and `disable_deep_gemm` flag to Exaone4Attention and Exaone4DecoderLayer; propagates quantization-based optimization control to both attention and MLP paths.
LLAMA Quantization Fusion Logic `tensorrt_llm/_torch/models/modeling_llama.py`	Replaced next_attn-based NVFP4 guards with fusion operation type checks (`post_feed_forward_fusion_op` against `RESIDUAL_RMS_NORM_QUANT_NVFP4`); affects unpacking of allreduce output in post-MLP fusion path.
PyTorch Executor Infrastructure `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Introduced `backend_num_streams` attribute to PyTorchModelEngine; replaces reference to `_torch_compile_backend.aux_streams` for model extra attributes.
Disaggregated PP Termination Refactoring `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Redesigned DisaggPPTerminationHandler constructor (now accepts `dist` and `terminator_func` callback); replaced `sync` method with new `terminate_pending_requests` using ring-protocol coordination; updated all invocation sites in PyExecutor.
Executor Max Tokens Guard `tensorrt_llm/executor/base_worker.py`	Added `default_max_tokens > 0` condition to max tokens deduction logic; prevents clamping to non-positive defaults and preserves user-provided max_tokens in edge cases.
IPC Address Generation `tensorrt_llm/llmapi/trtllm-llmapi-launch`	Replaced free TCP port-based IPC with UUID-based `ipc://<tempdir>/rpc_test_<uuid>` address; moved behind MPI rank check to run only on rank 0.
Disaggregated Serving Test Infrastructure `tests/integration/defs/accuracy/test_disaggregated_serving.py`	Added per-server log redirection with `output_<server_name>_<index>.log` files; split multi_popen into three calls for ctx, gen, disagg servers; introduced health-check loop querying `/health` endpoint; added `enable_block_reuse: True` to kv_cache_config; removed `skip_pre_hopper` decorator.
LLM API PyTorch Test Rename `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Renamed `test_fp8_tp2pp2` to `test_fp4_tp2pp2`; updated model path to FP4 variant; adjusted quantization assertions to expect `QuantAlgo.NVFP4`.
CUDA Cache Cleanup `tests/integration/defs/conftest.py`	Added `gc.collect()` call before `torch.cuda.empty_cache()` in torch_empty_cache fixture.
Disaggregated Single GPU Tests `tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py`	Added `free_gpu_memory_fraction=0.4` to KvCacheConfig in spec-dec batch tests.
Serve Test Backend Update `tests/integration/defs/examples/serve/test_serve.py`	Changed decorator from `@skip_pre_hopper` to `@skip_no_hopper` for `test_extra_llm_api_options`; added MOE backend FP8 blockscale documentation.
E2E Test Configuration `tests/integration/defs/test_e2e.py`	Added dynamic SM version checks via `get_sm_version()`; branched MoE backend selection and KV cache fraction on Blackwell detection; updated GPU count for Llama3.1-70B-BF16 from 2 to 8; adjusted performance mapping values.
Test Lists & Test DB `tests/integration/test_lists/qa/llm_function_core.txt`, `llm_function_core_sanity.txt`, `llm_function_nim.txt`, `l0_b200.yml`, `l0_dgx_b200.yml`, `l0_h100.yml`, `waives.txt`	Updated test selections: replaced FP8 TP2PP2 with FP4 TP2PP2; changed multi-GPU model references from 2-GPU to 8-GPU variants; added disaggregated and EXAONE4 test entries; removed waiver exemptions.
Unit Test Infrastructure & Skip Markers `tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py`, `_torch/modeling/test_modeling_exaone4.py`, `_torch/multi_gpu/test_mnnvl_allreduce.py`, `_torch/multi_gpu_modeling/test_deepseek.py`, `_torch/thop/parallel/test_fp8_rowwise_linear.py`, `_torch/thop/serial/test_moe.py`, `llmapi/test_llm.py`	Added EXAONE4 FP8 quantization config and `test_llm_load` test; removed blank lines; removed tp_size==4 skip in Deepseek; replaced `@skip_pre_hopper` with `@skip_blackwell`; added pytest.skip calls for MoE FP4 tests due to NV bugs; removed skip marker on workspace test.

Sequence Diagram(s)

sequenceDiagram
    participant PyEx as PyExecutor
    participant DTermH as DisaggPPTerminationHandler
    participant Dist as Distributor
    participant App as Application
    Note over PyEx,App: Old Flow: Synchronous Per-Microbatch
    PyEx->>DTermH: sync(microbatch_idx)
    DTermH->>DTermH: local_termination check
    DTermH->>DTermH: cleanup() & await handles
    PyEx->>App: Process terminated requests
    
    Note over PyEx,App: New Flow: Ring Protocol Per-Iteration
    loop Each Executor Iteration
        PyEx->>DTermH: terminate_pending_requests()
        DTermH->>Dist: Send new_term_state (ready/term data)
        DTermH->>Dist: Recv new_term_state from neighbor
        DTermH->>DTermH: Decide terminations locally
        DTermH->>App: terminator_func() for each finalized request
        DTermH->>DTermH: Increment _terminating_iteration
    end

sequenceDiagram
    participant Test as Test Process
    participant MultiPopen as multi_popen
    participant Server as Server Process
    participant LogFile as Log File
    
    Note over Test,LogFile: Disaggregated Serving Startup
    Test->>MultiPopen: Start ctx_processes (server_name="ctx")
    activate MultiPopen
    MultiPopen->>Server: Launch with log redirect
    Server->>LogFile: stdout/stderr → output_ctx_0.log
    MultiPopen-->>Test: ctx process handle
    deactivate MultiPopen
    
    Test->>MultiPopen: Start gen_processes (server_name="gen")
    activate MultiPopen
    MultiPopen->>Server: Launch with log redirect
    Server->>LogFile: stdout/stderr → output_gen_0.log
    MultiPopen-->>Test: gen process handle
    deactivate MultiPopen
    
    Test->>MultiPopen: Start disagg_processes (server_name="disagg")
    activate MultiPopen
    MultiPopen->>Server: Launch with log redirect
    Server->>LogFile: stdout/stderr → output_disagg_0.log
    MultiPopen-->>Test: disagg process handle
    deactivate MultiPopen
    
    Test->>Test: Health check loop
    loop Until 200 or timeout
        Test->>Server: GET http://localhost:8000/health
        Server-->>Test: 200 OK (ready)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

DisaggPPTerminationHandler refactoring (tensorrt_llm/_torch/pyexecutor/py_executor.py): Major API redesign shifting from synchronous per-microbatch to ring-protocol coordination; constructor signature changed significantly; internal state management restructured.
LLAMA post-feed-forward fusion logic (tensorrt_llm/_torch/models/modeling_llama.py): Behavioral change from next_attn-based guards to fusion-operation-type checks; impacts unpacking logic in critical forward path.
Disaggregated serving test restructuring (tests/integration/defs/accuracy/test_disaggregated_serving.py): Complex multi-process coordination with three separate server groups, log redirection, and health-check polling; significant control flow reorganization.
KV cache manager safety guards (cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp): Mutex-protected sequence lookups; verify locking correctness and thread-safety implications.
Stream management refactoring (tensorrt_llm/_torch/compilation/backend.py, tensorrt_llm/_torch/pyexecutor/model_engine.py): Transition from aux_streams list to num_streams count; verify all references updated consistently.

Possibly related PRs

feat: Add support for disaggregation with pp with pytorch backend #6369: Related KV cache management changes in kvCacheManager with sequence access pattern improvements.

Suggested reviewers

liji-nv
yuxianq

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is minimal and lacks required template sections. Missing detailed explanation of changes, proper ticket/issue reference format, and comprehensive test coverage details.	Add a proper PR title following [JIRA/NVBugs/GitHub issue][type] format, expand the Description section with detailed explanation of changes, and provide specific test coverage information beyond 'N/A'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][chore] Weekly mass integration of release/1.1' is directly related to the changeset, which contains a mass integration of multiple commits from the release/1.1 branch into main.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (12)

cpp/tensorrt_llm/kernels/communicationKernels/mnnvlTwoShotAllreduceKernels.cu (1)

113-126: Consider adding documentation for memory ordering semantics.

While the implementation is correct, the cta_arrive() function would benefit from brief comments explaining the memory ordering choices across different architectures, particularly:

The role of .release semantics in ensuring visibility of prior memory operations

Why red.async.release is used for sm_100+ vs. synchronous red.release for sm_70-99

The synchronization protocol assumptions (paired with the volatile load spin-wait in wait_and_update())

This would help future maintainers understand the correctness requirements and architectural tradeoffs.
tensorrt_llm/_torch/models/modeling_exaone4.py (1)
134-140: The workaround logic is correct, though consider simplifying quant_config access.

The disable_deep_gemm computation correctly identifies FP8_BLOCK_SCALES quantization to work around the documented memory access issue. The logic is sound.

For consistency, since Line 131 directly accesses model_config.quant_config, you could simplify lines 135-139 to use the same pattern instead of getattr:
 disable_deep_gemm = False
-quant_config = getattr(model_config, "quant_config", None)
-if quant_config is not None:
+if model_config.quant_config is not None:
     # EXAONE4 fp8 has an illegal memory access issue with deep_gemm.
-    disable_deep_gemm = getattr(quant_config, "quant_algo",
-                                None) == QuantAlgo.FP8_BLOCK_SCALES
+    disable_deep_gemm = model_config.quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES
tests/unittest/_torch/thop/serial/test_moe.py (1)
1152-1152: Consider using a class-level skip decorator for better maintainability.

Both test_autotune (line 1065) and test_no_autotune (line 1152) methods in TestMoeFp4 are skipped with the same bug reference. Consider applying @pytest.mark.skip("https://nvbugs/5575841") at the class level instead to avoid duplication and improve maintainability.

Apply this diff to use a class-level skip:
+@pytest.mark.skip("https://nvbugs/5575841")
 class TestMoeFp4:
     """
     Test the NVFP4 MoE. As autotune also covers the actual MoE, we can run the test
     with autotune by default. We add a separate test for no autotune to ensure that
     the default tactic selection works. This reduces unnecessary test runs for CI
     """
And remove the individual skips from lines 1065 and 1152.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

655-655: Rename aligns with FP4 path; consider naming clarity.

Switch to FP4 is correct. Optional: rename to test_nvfp4_tp2pp2 for consistency with QuantAlgo.NVFP4 to avoid ambiguity between generic “FP4” and “NVFP4”.

tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)

57-59: Runtime risk: increased layers from 4→32.

Intent (trigger deep_gemm path) is clear, but 32 layers at hidden_size=5120 can inflate test time/mem. Consider lowering to the minimal depth that still reproduces the issue, or guard with a tighter input size/timeout.

Please confirm CI runtime for this test remains within budget.

tests/integration/test_lists/test-db/l0_b200.yml (1)

81-84: Potential duplicate/contradictory singlegpu entries.

You add:

singlegpu with -k "not test_trtllm_bench_backend_comparison"

and singlegpu (unfiltered)

The unfiltered line will still run the excluded test, possibly duplicating the rest. Clarify intent; keep only one or split into mutually exclusive subsets.
tensorrt_llm/llmapi/trtllm-llmapi-launch (1)
27-29: Update function and variable names to reflect IPC usage.

The function name export_free_tcp_addr_for_spawn_proxy_process and variable name free_port suggest TCP usage, but the implementation now generates an IPC address using Unix domain sockets. This naming mismatch can mislead maintainers.

Consider renaming:
-function export_free_tcp_addr_for_spawn_proxy_process {
+function export_ipc_addr_for_spawn_proxy_process {
     # Generate unique IPC address without importing tensorrt_llm to avoid MPI initialization conflicts
-    local free_port=$(python3 -c "import uuid, tempfile, os; print(f'ipc://{os.path.join(tempfile.gettempdir(), \"rpc_test_\" + str(uuid.uuid4()))}')")
-    export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR=$free_port
+    local ipc_addr=$(python3 -c "import uuid, tempfile, os; print(f'ipc://{os.path.join(tempfile.gettempdir(), \"rpc_test_\" + str(uuid.uuid4()))}')")
+    export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR=$ipc_addr
     log_stderr "TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR: $TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR"
And update the call site at line 43:
-    export_free_tcp_addr_for_spawn_proxy_process
+    export_ipc_addr_for_spawn_proxy_process
tests/integration/defs/conftest.py (1)

2677-2685: Early GC before CUDA cache clear is reasonable.

Looks good and may reduce OOMs by dropping Python refs before empty_cache().

If test time is a concern, consider gating the pre-empty_cache gc.collect() behind an env flag (e.g., TLLM_GC_BEFORE_EMPTY_CACHE=1).
tests/integration/defs/test_e2e.py (1)
2208-2208: Make Blackwell check explicit.

Use get_sm_version() >= 100 instead of > 90 for clarity and future-proofing.
-    is_blackwell = get_sm_version() > 90
+    is_blackwell = get_sm_version() >= 100
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

2145-2146: Thread-safety of getNumTokens read.

Using getSequence(...) improves access, but getNumTokens() is read outside mSequencesMtx. If GenerationRequest isn’t internally thread-safe for reads, prefer fetching the count while holding the lock or add a small accessor that returns the value under lock.
tensorrt_llm/_torch/distributed/communicator.py (1)
408-410: Unify pp_gather parameter naming with TorchDist; consider large-object safety.

MPIDist uses root while TorchDist uses dst. This can break kwargs at call sites. Either accept both or standardize.

Also, PP gather still uses plain MPI gather; if objects can be large, consider safe_gather() like tp_gather.

Example change outside this hunk (TorchDist) to accept both:
def pp_gather(self, obj, dst=0, root=None):
    if root is not None:
        dst = root
    # rest unchanged...
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
189-205: Ensure redirected log files are closed on failures

When enable_redirect_log is true and popen(...) raises before we reach the manual cleanup loop, the opened log file handle is leaked because it never gets registered with the ExitStack. Let the stack manage both the file and the subprocess context so the descriptor is closed even on startup errors.

Apply this diff to manage the resources via the stack:
-        processes = []
-        log_files = []
-        try:
-            for i, (env, args) in enumerate(server_configs):
-                if enable_redirect_log:
-                    f = open(f"output_{server_name}_{i}.log", "w+")
-                    env["TLLM_LOG_LEVEL"] = "INFO"
-                    proc = popen(args, env=env, stdout=f, stderr=f)
-                    log_files.append(f)
-                else:
-                    proc = popen(args, env=env)
-                processes.append(proc)
-
-            with contextlib.ExitStack() as stack:
-                opened_processes = [
-                    stack.enter_context(proc) for proc in processes
-                ]
-                yield opened_processes
-            for f in log_files:
-                f.close()
+        try:
+            with contextlib.ExitStack() as stack:
+                opened_processes = []
+                for i, (env, args) in enumerate(server_configs):
+                    if enable_redirect_log:
+                        log_file = stack.enter_context(
+                            open(f"output_{server_name}_{i}.log", "w+"))
+                        env["TLLM_LOG_LEVEL"] = "INFO"
+                        proc_ctx = popen(
+                            args, env=env, stdout=log_file, stderr=log_file)
+                    else:
+                        proc_ctx = popen(args, env=env)
+                    opened_processes.append(stack.enter_context(proc_ctx))
+                yield opened_processes

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70e4d72 and 6925f4b.

📒 Files selected for processing (32)

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (2 hunks)
cpp/tensorrt_llm/kernels/communicationKernels/mnnvlTwoShotAllreduceKernels.cu (1 hunks)
jenkins/L0_Test.groovy (1 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (3 hunks)
tensorrt_llm/_torch/compilation/backend.py (3 hunks)
tensorrt_llm/_torch/distributed/communicator.py (1 hunks)
tensorrt_llm/_torch/models/modeling_exaone4.py (5 hunks)
tensorrt_llm/_torch/models/modeling_llama.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (4 hunks)
tensorrt_llm/executor/base_worker.py (1 hunks)
tensorrt_llm/llmapi/trtllm-llmapi-launch (2 hunks)
tests/integration/defs/accuracy/test_disaggregated_serving.py (4 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/defs/conftest.py (1 hunks)
tests/integration/defs/disaggregated/test_disaggregated_single_gpu.py (1 hunks)
tests/integration/defs/examples/serve/test_serve.py (2 hunks)
tests/integration/defs/test_e2e.py (6 hunks)
tests/integration/test_lists/qa/llm_function_core.txt (2 hunks)
tests/integration/test_lists/qa/llm_function_core_sanity.txt (2 hunks)
tests/integration/test_lists/qa/llm_function_nim.txt (1 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (1 hunks)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
tests/integration/test_lists/waives.txt (0 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1 hunks)
tests/unittest/_torch/modeling/test_modeling_exaone4.py (4 hunks)
tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py (0 hunks)
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py (0 hunks)
tests/unittest/_torch/thop/parallel/test_fp8_rowwise_linear.py (1 hunks)
tests/unittest/_torch/thop/serial/test_moe.py (3 hunks)
tests/unittest/llmapi/test_llm.py (0 hunks)

💤 Files with no reviewable changes (4)

tests/unittest/llmapi/test_llm.py
tests/integration/test_lists/waives.txt
tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py
tests/unittest/_torch/multi_gpu_modeling/test_deepseek.py

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}