-
Notifications
You must be signed in to change notification settings - Fork 2k
[None] [feat] Add test script and raster M for gather fc1 kernel #10429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zongfei Jing <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
📝 WalkthroughWalkthroughIntroduce a Changes
Sequence Diagram(s)sequenceDiagram
participant CustomOps as Custom Ops<br/>(cute_dsl_custom_ops)
participant KernelFactory as Kernel Factory<br/>(blockscaled_*_fusion)
participant Scheduler as Tile Scheduler<br/>(PersistentScheduler)
participant CUDAExec as CUDA Execution
rect rgb(240, 248, 255)
Note over CustomOps: Tactic Selection Phase
CustomOps->>CustomOps: get_valid_tactics()
Note over CustomOps: Returns 3-tuple:<br/>(mma_tiler_mn, cluster_shape_mn,<br/>raster_along_m)
CustomOps->>KernelFactory: forward(inputs, tactic=3tuple)
end
rect rgb(245, 250, 240)
Note over KernelFactory: Kernel Construction Phase
KernelFactory->>KernelFactory: Unpack tactic tuple
activate KernelFactory
Note over KernelFactory: Extract raster_along_m
KernelFactory->>KernelFactory: Build cache key<br/>(includes raster_along_m)
deactivate KernelFactory
end
rect rgb(250, 240, 245)
Note over Scheduler: Grid & Scheduler Init Phase
KernelFactory->>Scheduler: _compute_grid()<br/>(raster_along_m parameter)
activate Scheduler
Scheduler->>Scheduler: PersistentTileSchedulerParams<br/>.__init__(raster_along_m=True/False)
alt raster_along_m = True
Scheduler->>Scheduler: Column-major work decode<br/>(cluster_m, cluster_n first)
else raster_along_m = False
Scheduler->>Scheduler: Row-major work decode<br/>(cluster_n first)
end
Scheduler->>Scheduler: FastDivmod divisors<br/>(conditional on swizzle)
deactivate Scheduler
end
rect rgb(240, 245, 250)
Note over CUDAExec: Kernel Execution Phase
KernelFactory->>CUDAExec: Launch kernel<br/>(includes raster_along_m,<br/>PersistentScheduler config)
activate CUDAExec
CUDAExec->>CUDAExec: Work index decoding<br/>via _get_cluster_work_idx_with_fastdivmod
CUDAExec->>CUDAExec: Tile iteration &<br/>barrier sync<br/>(branches by raster_along_m)
deactivate CUDAExec
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Fix all issues with AI Agents 🤖
In
@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py:
- Around line 158-293: The hooked_get_cluster_work_idx_with_fastdivmod assigns
an unused variable and assumes FastDivmod divisors always exist; rename
work_iteration to _work_iteration in hooked_get_cluster_work_idx_with_fastdivmod
to silence the lint warning, and add an explicit guard at the start of
hooked_get_cluster_work_idx_with_fastdivmod that checks if self.params.batch_fdd
is None and raises a RuntimeError (e.g. "_get_cluster_work_idx_with_fastdivmod
requires swizzle_size == 1") so callers with swizzle_size > 1 fail fast; no
other logic changes needed in hooked_PersistentTileSchedulerParams_init other
than keeping the existing None assignments for the swizzled path.
In
@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py:
- Around line 508-531: The run function signature includes an unused **kwargs
and a misleading type for permuted_m; remove the **kwargs parameter from the
run(...) signature and update permuted_m: int = None to permuted_m:
Optional[int] = None, adding an import for Optional from typing at the top of
the file; keep the rest of the parameters and default behavior unchanged and
ensure there are no references to **kwargs elsewhere in the file.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)
1868-1891: Includeraster_along_min the kernel cache key to avoid mixing raster modes
Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner.forwardnow varies behavior byraster_along_m, butcache_keydoes not include this flag. If bothraster_along_m=TrueandFalsetactics are exercised in the same process, the first compiled kernel will be reused for the other mode, so tuning over both orientations will not actually change the scheduler behavior.Recommend extending the key:
- cache_key = (self.scaling_vector_size, self.tile_size, self.top_k, - mma_tiler_mn, cluster_shape_mn) + cache_key = ( + self.scaling_vector_size, + self.tile_size, + self.top_k, + mma_tiler_mn, + cluster_shape_mn, + raster_along_m, + )This keeps the cache coherent with the compiled configuration.
Also applies to: 2019-2038, 2028-2038
🧹 Nitpick comments (2)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (2)
328-331: Clean up unused unpacked return values to satisfy Ruff and simplify the APISeveral values returned from helpers are never used downstream (e.g.
token_id_mapping_torchfromcreate_token_id_mapping_tensor, and many*_torch_cpu/*_torch_gpu/aligned_group_m_list/valid_m/token_id_mapping_cpuoutputs fromcreate_tensorsas unpacked inrun()andgenerate_tensors()).To avoid RUF059 noise and keep the API focused, either:
- Stop returning these values if they’re not needed, or
- Prefix the unused unpack targets with
_at the call sites, e.g.:- token_id_mapping_cpu, token_id_mapping, token_id_mapping_torch = create_token_id_mapping_tensor(...) + token_id_mapping_cpu, token_id_mapping, _token_id_mapping_torch = create_token_id_mapping_tensor(...)and similarly in the large tuple unpack from
create_tensorsand ingenerate_tensors().Also applies to: 446-505, 981-1012
1086-1142: Consider narrowing exception handling inread_benchmark_file
read_benchmark_filecurrently catchesFileNotFoundErrorand then a bareException, re-wrapping asargparse.ArgumentTypeError. For a CLI helper this works, but it makes debugging unexpected parse bugs harder and trips tools (BLE001/TRY003/B904).If you want to keep tooling quiet and improve debuggability:
- Catch only the expected parse errors (e.g.
ValueError,OSError) and- Chain them explicitly:
- except FileNotFoundError: - raise argparse.ArgumentTypeError(f"Benchmark file not found: {filepath}") - except Exception as e: - raise argparse.ArgumentTypeError(f"Error reading benchmark file: {e}") + except FileNotFoundError as err: + raise argparse.ArgumentTypeError(f"Benchmark file not found: {filepath}") from err + except ValueError as err: + raise argparse.ArgumentTypeError(f"Error reading benchmark file: {err}") from errOptional, but it will align better with Ruff’s recommendations.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.pytests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.pytensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.pytensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
🧠 Learnings (4)
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.
Applied to files:
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
🧬 Code graph analysis (1)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (1)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (3)
cvt_sf_MKL_to_M32x4xrm_K4xrk_L(3398-3409)cvt_sf_M32x4xrm_K4xrk_L_to_MKL(3413-3424)get_dtype_rcp_limits(3025-3041)
🪛 Ruff (0.14.10)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
112-115: Avoid specifying long messages outside the exception class
(TRY003)
446-446: Unpacked variable token_id_mapping_torch is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
527-527: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
530-530: Unused function argument: kwargs
(ARG001)
581-581: Avoid specifying long messages outside the exception class
(TRY003)
599-603: Avoid specifying long messages outside the exception class
(TRY003)
623-623: Unpacked variable sfc_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
631-631: Unpacked variable sfc_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
632-632: Unpacked variable norm_const_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
994-994: Unpacked variable a_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
995-995: Unpacked variable b_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
996-996: Unpacked variable c_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
997-997: Unpacked variable sfa_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
998-998: Unpacked variable sfb_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
999-999: Unpacked variable sfc_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1000-1000: Unpacked variable norm_const_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1001-1001: Unpacked variable alpha_torch_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1002-1002: Unpacked variable a_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1003-1003: Unpacked variable b_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1004-1004: Unpacked variable c_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1005-1005: Unpacked variable sfa_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1006-1006: Unpacked variable sfb_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1007-1007: Unpacked variable sfc_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1008-1008: Unpacked variable norm_const_torch_gpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1009-1009: Unpacked variable aligned_group_m_list is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1010-1010: Unpacked variable valid_m is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1011-1011: Unpacked variable token_id_mapping_cpu is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1084-1084: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
1084-1084: Avoid specifying long messages outside the exception class
(TRY003)
1123-1123: Abstract raise to an inner function
(TRY301)
1123-1123: Avoid specifying long messages outside the exception class
(TRY003)
1126-1126: Unpacked variable m_first is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
1136-1136: Consider moving this statement to an else block
(TRY300)
1139-1139: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
1139-1139: Avoid specifying long messages outside the exception class
(TRY003)
1140-1140: Do not catch blind exception: Exception
(BLE001)
1141-1141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
1141-1141: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
170-170: Avoid specifying long messages outside the exception class
(TRY003)
172-172: Avoid specifying long messages outside the exception class
(TRY003)
268-268: Unused function argument: loc
(ARG001)
268-268: Unused function argument: ip
(ARG001)
270-270: Unpacked variable work_iteration is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (3)
2959-2994:_compute_gridraster flag wiring looks consistentThe updated
_compute_gridnow threadsraster_along_mintoPersistentTileSchedulerParams, and__call__passesself.raster_along_mthrough. Combined with the scheduler hook, this gives you a clean knob to flip traversal order at runtime without touching call sites.No action needed here; just noting that the propagation is sane and localized.
3312-3394:orig_mvsmseparation inwrappermatches gather semanticsThe new
wrapperparameters (orig_m,m) and layouts:
- Use
orig_mforaanda_sflayouts.- Use
m(permuted size) forc,c_sf,token_id_mapping, and num-tiles computations.This aligns with the Python runner (
orig_m = a.size(0),m = permuted_idx_to_expanded_idx.size(0)) and the gather design, so the host/test script wiring is coherent.No changes required here.
3397-3424: Exported sf layout converters are consistent with the test harness usageThe new jitted helpers
cvt_sf_MKL_to_M32x4xrm_K4xrk_Landcvt_sf_M32x4xrm_K4xrk_L_to_MKLmirror the layouts assumed in the test script (MKL ↔ MMA swizzled forms), and the implementation matches the existing pattern (grouping modes then copying by hierarchical coord).Looks good as a shared utility for both the kernel and the standalone script.
...torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
Show resolved
Hide resolved
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
Show resolved
Hide resolved
|
/bot run --disable-fail-fast |
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
Show resolved
Hide resolved
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
Outdated
Show resolved
Hide resolved
|
PR_Github #30683 [ run ] triggered by Bot. Commit: |
|
PR_Github #30683 [ run ] completed with state |
Signed-off-by: Zongfei Jing <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #30747 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #30749 [ run ] triggered by Bot. Commit: |
… gemm swiglu fusion kernel Signed-off-by: Zongfei Jing <[email protected]>
3b4e1b6 to
e129576
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #30756 [ run ] triggered by Bot. Commit: |
|
PR_Github #30756 [ run ] completed with state |
syuoni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.