[None] [feat] Add test script and raster M for gather fc1 kernel #10429

zongfeijing · 2026-01-06T02:40:53Z

Summary by CodeRabbit

Release Notes

New Features
- Added raster_along_m flag enabling configurable kernel output layout orientation and scheduling behavior
- Extended tactic system to support enhanced configuration control with additional parameters
- Introduced scale-factor tensor layout conversion utilities
Tests
- Comprehensive new test harness with tensor construction and reference implementations for kernel validation

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Zongfei Jing <[email protected]>

coderabbitai · 2026-01-06T02:51:25Z

📝 Walkthrough

Walkthrough

Introduce a raster_along_m parameter to CuteDSL-based NVFP4 grouped GEMM tactics and kernel scheduling. Tactics expand to 3-tuples including a raster orientation flag, and kernel construction, grid computation, and work-index decoding now support alternative data layout configurations determined at runtime.

Changes

Cohort / File(s)	Summary
Custom ops interface `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`	Updated `get_valid_tactics()` methods in `CuteDSLNVFP4BlackwellLinear`, `Sm100BlockScaledContiguousGroupedGemmFinalizeFusionRunner`, and `Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner` to return 3-tuples `(mma_tiler_mn, cluster_shape_mn, raster_along_m)` instead of 2-tuples. `forward()` methods updated to unpack the extended tuple; default raster_along_m is False. Cache keys and kernel constructor calls now include raster_along_m parameter.
Kernel scheduling and grid logic `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`	Added raster_along_m parameter to `BlockScaledContiguousGatherGroupedGemmKernel.__init__()` and threaded through grid computation and tile scheduling. New hook functions (`hook_PersistentTileSchedulerParams_init`, `hook_Get_cluster_work_idx_with_fastdivmod`) implement conditional work-index decoding and cluster scheduling based on raster orientation. Introduced `cvt_sf_MKL_to_M32x4xrm_K4xrk_L` and `cvt_sf_M32x4xrm_K4xrk_L_to_MKL` helper functions for scale-factor layout conversion. Updated `cutlass.utils.PersistentTileSchedulerParams.__init__()` signature to accept raster_along_m.
Test harness `tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`	New test script with comprehensive tensor factories: `create_mask()`, `create_scale_factor_tensor()`, `create_scale_factor_tensor_unswizzled()`, `create_sf_layout_tensor()`, `create_token_id_mapping_tensor()`, and `create_tensors()`. Main `run()` orchestrator function accepts raster_along_m parameter, constructs kernels, executes via CUDA, and performs reference verification including SwiGLU fusion and quantization paths. CLI entry point supports problem-size configuration, dtype selection, and benchmark integration.

Sequence Diagram(s)

sequenceDiagram
    participant CustomOps as Custom Ops<br/>(cute_dsl_custom_ops)
    participant KernelFactory as Kernel Factory<br/>(blockscaled_*_fusion)
    participant Scheduler as Tile Scheduler<br/>(PersistentScheduler)
    participant CUDAExec as CUDA Execution
    
    rect rgb(240, 248, 255)
    Note over CustomOps: Tactic Selection Phase
    CustomOps->>CustomOps: get_valid_tactics()
    Note over CustomOps: Returns 3-tuple:<br/>(mma_tiler_mn, cluster_shape_mn,<br/>raster_along_m)
    CustomOps->>KernelFactory: forward(inputs, tactic=3tuple)
    end
    
    rect rgb(245, 250, 240)
    Note over KernelFactory: Kernel Construction Phase
    KernelFactory->>KernelFactory: Unpack tactic tuple
    activate KernelFactory
    Note over KernelFactory: Extract raster_along_m
    KernelFactory->>KernelFactory: Build cache key<br/>(includes raster_along_m)
    deactivate KernelFactory
    end
    
    rect rgb(250, 240, 245)
    Note over Scheduler: Grid & Scheduler Init Phase
    KernelFactory->>Scheduler: _compute_grid()<br/>(raster_along_m parameter)
    activate Scheduler
    Scheduler->>Scheduler: PersistentTileSchedulerParams<br/>.__init__(raster_along_m=True/False)
    alt raster_along_m = True
        Scheduler->>Scheduler: Column-major work decode<br/>(cluster_m, cluster_n first)
    else raster_along_m = False
        Scheduler->>Scheduler: Row-major work decode<br/>(cluster_n first)
    end
    Scheduler->>Scheduler: FastDivmod divisors<br/>(conditional on swizzle)
    deactivate Scheduler
    end
    
    rect rgb(240, 245, 250)
    Note over CUDAExec: Kernel Execution Phase
    KernelFactory->>CUDAExec: Launch kernel<br/>(includes raster_along_m,<br/>PersistentScheduler config)
    activate CUDAExec
    CUDAExec->>CUDAExec: Work index decoding<br/>via _get_cluster_work_idx_with_fastdivmod
    CUDAExec->>CUDAExec: Tile iteration &<br/>barrier sync<br/>(branches by raster_along_m)
    deactivate CUDAExec
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. It only contains the template boilerplate without any actual description of changes, test coverage details, or explanation of the implementation.	Fill in the Description and Test Coverage sections with details about what was added and how it was tested. Ensure the description explains the raster M implementation and test script purpose.
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main changes: adding a test script and raster M support for the gather fc1 kernel, which aligns with the actual changes in the code.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Fix all issues with AI Agents 🤖

In
@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py:
- Around line 158-293: The hooked_get_cluster_work_idx_with_fastdivmod assigns
an unused variable and assumes FastDivmod divisors always exist; rename
work_iteration to _work_iteration in hooked_get_cluster_work_idx_with_fastdivmod
to silence the lint warning, and add an explicit guard at the start of
hooked_get_cluster_work_idx_with_fastdivmod that checks if self.params.batch_fdd
is None and raises a RuntimeError (e.g. "_get_cluster_work_idx_with_fastdivmod
requires swizzle_size == 1") so callers with swizzle_size > 1 fail fast; no
other logic changes needed in hooked_PersistentTileSchedulerParams_init other
than keeping the existing None assignments for the swizzled path.

In
@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py:
- Around line 508-531: The run function signature includes an unused **kwargs
and a misleading type for permuted_m; remove the **kwargs parameter from the
run(...) signature and update permuted_m: int = None to permuted_m:
Optional[int] = None, adding an import for Optional from typing at the top of
the file; keep the rest of the parameters and default behavior unchanged and
ensure there are no references to **kwargs elsewhere in the file.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)
1868-1891: Include raster_along_m in the kernel cache key to avoid mixing raster modes

Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner.forward now varies behavior by raster_along_m, but cache_key does not include this flag. If both raster_along_m=True and False tactics are exercised in the same process, the first compiled kernel will be reused for the other mode, so tuning over both orientations will not actually change the scheduler behavior.

Recommend extending the key:
-            cache_key = (self.scaling_vector_size, self.tile_size, self.top_k,
-                         mma_tiler_mn, cluster_shape_mn)
+            cache_key = (
+                self.scaling_vector_size,
+                self.tile_size,
+                self.top_k,
+                mma_tiler_mn,
+                cluster_shape_mn,
+                raster_along_m,
+            )
This keeps the cache coherent with the compiled configuration.

Also applies to: 2019-2038, 2028-2038

🧹 Nitpick comments (2)

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (2)
328-331: Clean up unused unpacked return values to satisfy Ruff and simplify the API

Several values returned from helpers are never used downstream (e.g. token_id_mapping_torch from create_token_id_mapping_tensor, and many *_torch_cpu/*_torch_gpu/aligned_group_m_list/valid_m/token_id_mapping_cpu outputs from create_tensors as unpacked in run() and generate_tensors()).

To avoid RUF059 noise and keep the API focused, either:

Stop returning these values if they’re not needed, or

Prefix the unused unpack targets with _ at the call sites, e.g.:
-    token_id_mapping_cpu, token_id_mapping, token_id_mapping_torch = create_token_id_mapping_tensor(...)
+    token_id_mapping_cpu, token_id_mapping, _token_id_mapping_torch = create_token_id_mapping_tensor(...)
and similarly in the large tuple unpack from create_tensors and in generate_tensors().

Also applies to: 446-505, 981-1012

1086-1142: Consider narrowing exception handling in read_benchmark_file

read_benchmark_file currently catches FileNotFoundError and then a bare Exception, re-wrapping as argparse.ArgumentTypeError. For a CLI helper this works, but it makes debugging unexpected parse bugs harder and trips tools (BLE001/TRY003/B904).

If you want to keep tooling quiet and improve debuggability:

Catch only the expected parse errors (e.g. ValueError, OSError) and

Chain them explicitly:
-    except FileNotFoundError:
-        raise argparse.ArgumentTypeError(f"Benchmark file not found: {filepath}")
-    except Exception as e:
-        raise argparse.ArgumentTypeError(f"Error reading benchmark file: {e}")
+    except FileNotFoundError as err:
+        raise argparse.ArgumentTypeError(f"Benchmark file not found: {filepath}") from err
+    except ValueError as err:
+        raise argparse.ArgumentTypeError(f"Error reading benchmark file: {err}") from err
Optional, but it will align better with Ruff’s recommendations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cae727 and 40f4411.

📒 Files selected for processing (3)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

🧠 Learnings (4)

📚 Learning: 2025-12-12T10:07:31.564Z

Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.

Applied to files:

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

📚 Learning: 2025-08-21T21:48:35.135Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

📚 Learning: 2025-08-08T05:10:38.906Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

🧬 Code graph analysis (1)

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (1)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (3)

cvt_sf_MKL_to_M32x4xrm_K4xrk_L (3398-3409)

cvt_sf_M32x4xrm_K4xrk_L_to_MKL (3413-3424)

get_dtype_rcp_limits (3025-3041)

🪛 Ruff (0.14.10)

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

112-115: Avoid specifying long messages outside the exception class

(TRY003)

446-446: Unpacked variable token_id_mapping_torch is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

527-527: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

530-530: Unused function argument: kwargs

(ARG001)

581-581: Avoid specifying long messages outside the exception class

(TRY003)

599-603: Avoid specifying long messages outside the exception class

(TRY003)

623-623: Unpacked variable sfc_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

631-631: Unpacked variable sfc_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

632-632: Unpacked variable norm_const_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

994-994: Unpacked variable a_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

995-995: Unpacked variable b_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

996-996: Unpacked variable c_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

997-997: Unpacked variable sfa_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

998-998: Unpacked variable sfb_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

999-999: Unpacked variable sfc_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1000-1000: Unpacked variable norm_const_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1001-1001: Unpacked variable alpha_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1002-1002: Unpacked variable a_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1003-1003: Unpacked variable b_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1004-1004: Unpacked variable c_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1005-1005: Unpacked variable sfa_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1006-1006: Unpacked variable sfb_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1007-1007: Unpacked variable sfc_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1008-1008: Unpacked variable norm_const_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1009-1009: Unpacked variable aligned_group_m_list is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1010-1010: Unpacked variable valid_m is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1011-1011: Unpacked variable token_id_mapping_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1084-1084: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

1084-1084: Avoid specifying long messages outside the exception class

(TRY003)

1123-1123: Abstract raise to an inner function

(TRY301)

1123-1123: Avoid specifying long messages outside the exception class

(TRY003)

1126-1126: Unpacked variable m_first is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

1136-1136: Consider moving this statement to an else block

(TRY300)

1139-1139: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

1139-1139: Avoid specifying long messages outside the exception class

(TRY003)

1140-1140: Do not catch blind exception: Exception

(BLE001)

1141-1141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

1141-1141: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

170-170: Avoid specifying long messages outside the exception class

(TRY003)

172-172: Avoid specifying long messages outside the exception class

(TRY003)

268-268: Unused function argument: loc

(ARG001)

268-268: Unused function argument: ip

(ARG001)

270-270: Unpacked variable work_iteration is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (3)

2959-2994: _compute_grid raster flag wiring looks consistent

The updated _compute_grid now threads raster_along_m into PersistentTileSchedulerParams, and __call__ passes self.raster_along_m through. Combined with the scheduler hook, this gives you a clean knob to flip traversal order at runtime without touching call sites.

No action needed here; just noting that the propagation is sane and localized.

3312-3394: orig_m vs m separation in wrapper matches gather semantics

The new wrapper parameters (orig_m, m) and layouts:

Use orig_m for a and a_sf layouts.

Use m (permuted size) for c, c_sf, token_id_mapping, and num-tiles computations.

This aligns with the Python runner (orig_m = a.size(0), m = permuted_idx_to_expanded_idx.size(0)) and the gather design, so the host/test script wiring is coherent.

No changes required here.

3397-3424: Exported sf layout converters are consistent with the test harness usage

The new jitted helpers cvt_sf_MKL_to_M32x4xrm_K4xrk_L and cvt_sf_M32x4xrm_K4xrk_L_to_MKL mirror the layouts assumed in the test script (MKL ↔ MMA swizzled forms), and the implementation matches the existing pattern (grouping modes then copying by hierarchical coord).

Looks good as a shared utility for both the kernel and the standalone script.

...torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

zongfeijing · 2026-01-06T05:31:16Z

/bot run --disable-fail-fast

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

tensorrt-cicd · 2026-01-06T05:39:14Z

PR_Github #30683 [ run ] triggered by Bot. Commit: 40f4411

tensorrt-cicd · 2026-01-06T10:49:17Z

PR_Github #30683 [ run ] completed with state SUCCESS. Commit: 40f4411
/LLM/main/L0_MergeRequest_PR pipeline #23673 completed with status: 'SUCCESS'

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing · 2026-01-06T14:16:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-06T14:21:50Z

PR_Github #30747 [ run ] triggered by Bot. Commit: 8c3aa4e

zongfeijing · 2026-01-06T14:22:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-06T14:28:30Z

PR_Github #30749 [ run ] triggered by Bot. Commit: 3b4e1b6

… gemm swiglu fusion kernel Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing · 2026-01-06T15:52:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-06T15:57:54Z

PR_Github #30756 [ run ] triggered by Bot. Commit: e129576

tensorrt-cicd · 2026-01-06T18:55:51Z

PR_Github #30756 [ run ] completed with state SUCCESS. Commit: e129576
/LLM/main/L0_MergeRequest_PR pipeline #23739 completed with status: 'SUCCESS'

syuoni

LGTM, thanks

zongfeijing added 2 commits January 5, 2026 17:48

Add raster N

ad654b8

Signed-off-by: Zongfei Jing <[email protected]>

add test script for gather fc1

40f4411

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing marked this pull request as ready for review January 6, 2026 02:41

zongfeijing requested review from a team as code owners January 6, 2026 02:41

zongfeijing requested review from kaiyux, liyuhannnnn, shaharmor98, syuoni and yizhang-nv January 6, 2026 02:41

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

...torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py Show resolved Hide resolved

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py Show resolved Hide resolved

liyuhannnnn reviewed Jan 6, 2026

View reviewed changes

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py Show resolved Hide resolved

liyuhannnnn reviewed Jan 6, 2026

View reviewed changes

tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py Outdated Show resolved Hide resolved

liyuhannnnn self-requested a review January 6, 2026 05:41

liyuhannnnn approved these changes Jan 6, 2026

View reviewed changes

syuoni reviewed Jan 6, 2026

View reviewed changes

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py Show resolved Hide resolved

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py Outdated Show resolved Hide resolved

hyukn approved these changes Jan 6, 2026

View reviewed changes

Remove m_aligned parameter and derive it from mma_tiler_mn[0]

8c3aa4e

Signed-off-by: Zongfei Jing <[email protected]>

Add example usage docstring for blockscaled contiguous gather grouped…

e129576

… gemm swiglu fusion kernel Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing force-pushed the user/zongfeij/scheduler branch from 3b4e1b6 to e129576 Compare January 6, 2026 15:51

syuoni approved these changes Jan 7, 2026

View reviewed changes

zongfeijing merged commit bb2f883 into NVIDIA:main Jan 7, 2026
5 checks passed

[None] [feat] Add test script and raster M for gather fc1 kernel #10429

[None] [feat] Add test script and raster M for gather fc1 kernel #10429

Uh oh!

Conversation

zongfeijing commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jan 6, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zongfeijing commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

zongfeijing commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

zongfeijing commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

zongfeijing commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

tensorrt-cicd commented Jan 6, 2026

Uh oh!

syuoni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zongfeijing commented Jan 6, 2026 •

edited by coderabbitai bot

Loading