Skip to content

Conversation

@zongfeijing
Copy link
Collaborator

@zongfeijing zongfeijing commented Jan 6, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added raster_along_m flag enabling configurable kernel output layout orientation and scheduling behavior
    • Extended tactic system to support enhanced configuration control with additional parameters
    • Introduced scale-factor tensor layout conversion utilities
  • Tests

    • Comprehensive new test harness with tensor construction and reference implementations for kernel validation

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Zongfei Jing <[email protected]>
@zongfeijing zongfeijing marked this pull request as ready for review January 6, 2026 02:41
@zongfeijing zongfeijing requested review from a team as code owners January 6, 2026 02:41
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

Introduce a raster_along_m parameter to CuteDSL-based NVFP4 grouped GEMM tactics and kernel scheduling. Tactics expand to 3-tuples including a raster orientation flag, and kernel construction, grid computation, and work-index decoding now support alternative data layout configurations determined at runtime.

Changes

Cohort / File(s) Summary
Custom ops interface
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
Updated get_valid_tactics() methods in CuteDSLNVFP4BlackwellLinear, Sm100BlockScaledContiguousGroupedGemmFinalizeFusionRunner, and Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner to return 3-tuples (mma_tiler_mn, cluster_shape_mn, raster_along_m) instead of 2-tuples. forward() methods updated to unpack the extended tuple; default raster_along_m is False. Cache keys and kernel constructor calls now include raster_along_m parameter.
Kernel scheduling and grid logic
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
Added raster_along_m parameter to BlockScaledContiguousGatherGroupedGemmKernel.__init__() and threaded through grid computation and tile scheduling. New hook functions (hook_PersistentTileSchedulerParams_init, hook_Get_cluster_work_idx_with_fastdivmod) implement conditional work-index decoding and cluster scheduling based on raster orientation. Introduced cvt_sf_MKL_to_M32x4xrm_K4xrk_L and cvt_sf_M32x4xrm_K4xrk_L_to_MKL helper functions for scale-factor layout conversion. Updated cutlass.utils.PersistentTileSchedulerParams.__init__() signature to accept raster_along_m.
Test harness
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
New test script with comprehensive tensor factories: create_mask(), create_scale_factor_tensor(), create_scale_factor_tensor_unswizzled(), create_sf_layout_tensor(), create_token_id_mapping_tensor(), and create_tensors(). Main run() orchestrator function accepts raster_along_m parameter, constructs kernels, executes via CUDA, and performs reference verification including SwiGLU fusion and quantization paths. CLI entry point supports problem-size configuration, dtype selection, and benchmark integration.

Sequence Diagram(s)

sequenceDiagram
    participant CustomOps as Custom Ops<br/>(cute_dsl_custom_ops)
    participant KernelFactory as Kernel Factory<br/>(blockscaled_*_fusion)
    participant Scheduler as Tile Scheduler<br/>(PersistentScheduler)
    participant CUDAExec as CUDA Execution
    
    rect rgb(240, 248, 255)
    Note over CustomOps: Tactic Selection Phase
    CustomOps->>CustomOps: get_valid_tactics()
    Note over CustomOps: Returns 3-tuple:<br/>(mma_tiler_mn, cluster_shape_mn,<br/>raster_along_m)
    CustomOps->>KernelFactory: forward(inputs, tactic=3tuple)
    end
    
    rect rgb(245, 250, 240)
    Note over KernelFactory: Kernel Construction Phase
    KernelFactory->>KernelFactory: Unpack tactic tuple
    activate KernelFactory
    Note over KernelFactory: Extract raster_along_m
    KernelFactory->>KernelFactory: Build cache key<br/>(includes raster_along_m)
    deactivate KernelFactory
    end
    
    rect rgb(250, 240, 245)
    Note over Scheduler: Grid & Scheduler Init Phase
    KernelFactory->>Scheduler: _compute_grid()<br/>(raster_along_m parameter)
    activate Scheduler
    Scheduler->>Scheduler: PersistentTileSchedulerParams<br/>.__init__(raster_along_m=True/False)
    alt raster_along_m = True
        Scheduler->>Scheduler: Column-major work decode<br/>(cluster_m, cluster_n first)
    else raster_along_m = False
        Scheduler->>Scheduler: Row-major work decode<br/>(cluster_n first)
    end
    Scheduler->>Scheduler: FastDivmod divisors<br/>(conditional on swizzle)
    deactivate Scheduler
    end
    
    rect rgb(240, 245, 250)
    Note over CUDAExec: Kernel Execution Phase
    KernelFactory->>CUDAExec: Launch kernel<br/>(includes raster_along_m,<br/>PersistentScheduler config)
    activate CUDAExec
    CUDAExec->>CUDAExec: Work index decoding<br/>via _get_cluster_work_idx_with_fastdivmod
    CUDAExec->>CUDAExec: Tile iteration &<br/>barrier sync<br/>(branches by raster_along_m)
    deactivate CUDAExec
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. It only contains the template boilerplate without any actual description of changes, test coverage details, or explanation of the implementation. Fill in the Description and Test Coverage sections with details about what was added and how it was tested. Ensure the description explains the raster M implementation and test script purpose.
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: adding a test script and raster M support for the gather fc1 kernel, which aligns with the actual changes in the code.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Fix all issues with AI Agents 🤖
In
@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py:
- Around line 158-293: The hooked_get_cluster_work_idx_with_fastdivmod assigns
an unused variable and assumes FastDivmod divisors always exist; rename
work_iteration to _work_iteration in hooked_get_cluster_work_idx_with_fastdivmod
to silence the lint warning, and add an explicit guard at the start of
hooked_get_cluster_work_idx_with_fastdivmod that checks if self.params.batch_fdd
is None and raises a RuntimeError (e.g. "_get_cluster_work_idx_with_fastdivmod
requires swizzle_size == 1") so callers with swizzle_size > 1 fail fast; no
other logic changes needed in hooked_PersistentTileSchedulerParams_init other
than keeping the existing None assignments for the swizzled path.

In
@tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py:
- Around line 508-531: The run function signature includes an unused **kwargs
and a misleading type for permuted_m; remove the **kwargs parameter from the
run(...) signature and update permuted_m: int = None to permuted_m:
Optional[int] = None, adding an import for Optional from typing at the top of
the file; keep the rest of the parameters and default behavior unchanged and
ensure there are no references to **kwargs elsewhere in the file.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)

1868-1891: Include raster_along_m in the kernel cache key to avoid mixing raster modes

Sm100BlockScaledContiguousGatherGroupedGemmSwigluFusionRunner.forward now varies behavior by raster_along_m, but cache_key does not include this flag. If both raster_along_m=True and False tactics are exercised in the same process, the first compiled kernel will be reused for the other mode, so tuning over both orientations will not actually change the scheduler behavior.

Recommend extending the key:

-            cache_key = (self.scaling_vector_size, self.tile_size, self.top_k,
-                         mma_tiler_mn, cluster_shape_mn)
+            cache_key = (
+                self.scaling_vector_size,
+                self.tile_size,
+                self.top_k,
+                mma_tiler_mn,
+                cluster_shape_mn,
+                raster_along_m,
+            )

This keeps the cache coherent with the compiled configuration.

Also applies to: 2019-2038, 2028-2038

🧹 Nitpick comments (2)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (2)

328-331: Clean up unused unpacked return values to satisfy Ruff and simplify the API

Several values returned from helpers are never used downstream (e.g. token_id_mapping_torch from create_token_id_mapping_tensor, and many *_torch_cpu/*_torch_gpu/aligned_group_m_list/valid_m/token_id_mapping_cpu outputs from create_tensors as unpacked in run() and generate_tensors()).

To avoid RUF059 noise and keep the API focused, either:

  • Stop returning these values if they’re not needed, or
  • Prefix the unused unpack targets with _ at the call sites, e.g.:
-    token_id_mapping_cpu, token_id_mapping, token_id_mapping_torch = create_token_id_mapping_tensor(...)
+    token_id_mapping_cpu, token_id_mapping, _token_id_mapping_torch = create_token_id_mapping_tensor(...)

and similarly in the large tuple unpack from create_tensors and in generate_tensors().

Also applies to: 446-505, 981-1012


1086-1142: Consider narrowing exception handling in read_benchmark_file

read_benchmark_file currently catches FileNotFoundError and then a bare Exception, re-wrapping as argparse.ArgumentTypeError. For a CLI helper this works, but it makes debugging unexpected parse bugs harder and trips tools (BLE001/TRY003/B904).

If you want to keep tooling quiet and improve debuggability:

  • Catch only the expected parse errors (e.g. ValueError, OSError) and
  • Chain them explicitly:
-    except FileNotFoundError:
-        raise argparse.ArgumentTypeError(f"Benchmark file not found: {filepath}")
-    except Exception as e:
-        raise argparse.ArgumentTypeError(f"Error reading benchmark file: {e}")
+    except FileNotFoundError as err:
+        raise argparse.ArgumentTypeError(f"Benchmark file not found: {filepath}") from err
+    except ValueError as err:
+        raise argparse.ArgumentTypeError(f"Error reading benchmark file: {err}") from err

Optional, but it will align better with Ruff’s recommendations.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cae727 and 40f4411.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
🧠 Learnings (4)
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
🧬 Code graph analysis (1)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (1)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (3)
  • cvt_sf_MKL_to_M32x4xrm_K4xrk_L (3398-3409)
  • cvt_sf_M32x4xrm_K4xrk_L_to_MKL (3413-3424)
  • get_dtype_rcp_limits (3025-3041)
🪛 Ruff (0.14.10)
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

112-115: Avoid specifying long messages outside the exception class

(TRY003)


446-446: Unpacked variable token_id_mapping_torch is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


527-527: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


530-530: Unused function argument: kwargs

(ARG001)


581-581: Avoid specifying long messages outside the exception class

(TRY003)


599-603: Avoid specifying long messages outside the exception class

(TRY003)


623-623: Unpacked variable sfc_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


631-631: Unpacked variable sfc_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


632-632: Unpacked variable norm_const_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


994-994: Unpacked variable a_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


995-995: Unpacked variable b_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


996-996: Unpacked variable c_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


997-997: Unpacked variable sfa_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


998-998: Unpacked variable sfb_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


999-999: Unpacked variable sfc_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1000-1000: Unpacked variable norm_const_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1001-1001: Unpacked variable alpha_torch_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1002-1002: Unpacked variable a_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1003-1003: Unpacked variable b_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1004-1004: Unpacked variable c_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1005-1005: Unpacked variable sfa_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1006-1006: Unpacked variable sfb_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1007-1007: Unpacked variable sfc_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1008-1008: Unpacked variable norm_const_torch_gpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1009-1009: Unpacked variable aligned_group_m_list is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1010-1010: Unpacked variable valid_m is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1011-1011: Unpacked variable token_id_mapping_cpu is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1084-1084: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


1084-1084: Avoid specifying long messages outside the exception class

(TRY003)


1123-1123: Abstract raise to an inner function

(TRY301)


1123-1123: Avoid specifying long messages outside the exception class

(TRY003)


1126-1126: Unpacked variable m_first is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


1136-1136: Consider moving this statement to an else block

(TRY300)


1139-1139: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


1139-1139: Avoid specifying long messages outside the exception class

(TRY003)


1140-1140: Do not catch blind exception: Exception

(BLE001)


1141-1141: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


1141-1141: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

170-170: Avoid specifying long messages outside the exception class

(TRY003)


172-172: Avoid specifying long messages outside the exception class

(TRY003)


268-268: Unused function argument: loc

(ARG001)


268-268: Unused function argument: ip

(ARG001)


270-270: Unpacked variable work_iteration is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py (3)

2959-2994: _compute_grid raster flag wiring looks consistent

The updated _compute_grid now threads raster_along_m into PersistentTileSchedulerParams, and __call__ passes self.raster_along_m through. Combined with the scheduler hook, this gives you a clean knob to flip traversal order at runtime without touching call sites.

No action needed here; just noting that the propagation is sane and localized.


3312-3394: orig_m vs m separation in wrapper matches gather semantics

The new wrapper parameters (orig_m, m) and layouts:

  • Use orig_m for a and a_sf layouts.
  • Use m (permuted size) for c, c_sf, token_id_mapping, and num-tiles computations.

This aligns with the Python runner (orig_m = a.size(0), m = permuted_idx_to_expanded_idx.size(0)) and the gather design, so the host/test script wiring is coherent.

No changes required here.


3397-3424: Exported sf layout converters are consistent with the test harness usage

The new jitted helpers cvt_sf_MKL_to_M32x4xrm_K4xrk_L and cvt_sf_M32x4xrm_K4xrk_L_to_MKL mirror the layouts assumed in the test script (MKL ↔ MMA swizzled forms), and the implementation matches the existing pattern (grouping modes then copying by hierarchical coord).

Looks good as a shared utility for both the kernel and the standalone script.

@zongfeijing
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30683 [ run ] triggered by Bot. Commit: 40f4411

@liyuhannnnn liyuhannnnn self-requested a review January 6, 2026 05:41
@tensorrt-cicd
Copy link
Collaborator

PR_Github #30683 [ run ] completed with state SUCCESS. Commit: 40f4411
/LLM/main/L0_MergeRequest_PR pipeline #23673 completed with status: 'SUCCESS'

@zongfeijing
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30747 [ run ] triggered by Bot. Commit: 8c3aa4e

@zongfeijing
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30749 [ run ] triggered by Bot. Commit: 3b4e1b6

… gemm swiglu fusion kernel

Signed-off-by: Zongfei Jing <[email protected]>
@zongfeijing zongfeijing force-pushed the user/zongfeij/scheduler branch from 3b4e1b6 to e129576 Compare January 6, 2026 15:51
@zongfeijing
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30756 [ run ] triggered by Bot. Commit: e129576

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30756 [ run ] completed with state SUCCESS. Commit: e129576
/LLM/main/L0_MergeRequest_PR pipeline #23739 completed with status: 'SUCCESS'

Copy link
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@zongfeijing zongfeijing merged commit bb2f883 into NVIDIA:main Jan 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants