Support PermuteN for 2D block scale GEMM with block 128(N) by amd-khushbu · Pull Request #5040 · ROCm/rocm-libraries

amd-khushbu · 2026-03-03T00:53:04Z

Proposed changes

This PR enables preshuffleB with PermuteN for 2D block scale GEMM operations when the block quantization group size is 128 in the N dimension (BQuantGroupSize::kN == 128).

Motivation

Jira Ticket: AICK-442
PermuteN is a feature that aligns the matrix in memory for coalesced access, improving performance
Previously, PermuteN was only supported for BQuantGroupSize::kN == 1 (per-element quantization)
This change extends support to BQuantGroupSize::kN == 128 (block-wise quantization)

Key Changes

1. Extended PermuteN Support in GEMM Pipeline (`run_gemm_quant_example.inc`)

Modified TiledPermuteN condition to enable PermuteN when BQuantGroupSize::kN == 1 || BQuantGroupSize::kN == 128
Updated shuffle_b_permuteN and bq_permuteN invocation conditions to support the new block size

2. Enhanced Tensor Shuffle Utilities (`tensor_shuffle_utils.hpp`)

Updated bq_permuteN function to handle both per-element (group_n == 1) and block-128 (group_n == 128) quantization
For group_n == 1: Uses full N-tile decomposition with NWarp, N_Warp_Tile, and NRepeat dimensions
For group_n == 128: Uses a simplified view where the entire block is treated as a single unit

3. Block GEMM Kernel Changes (`block_universal_gemm_ar_flatbr_bquant_cr.hpp`)

Added NPerBlock constant for proper dimension tracking
Modified scale register offset calculation in the BPreshuffleQuant path:
- When BQuantGroupSize::kN > (NWarp * WG::kN) and NPerBlock == BQuantGroupSize::kN: Uses a single quant group per block (prefill scenario)
- Otherwise: Uses nIter for decode or multiple groups per warp scenarios

Technical Details

The key insight is that when BQuantGroupSize::kN == 128 (matching the N block size), each thread block processes exactly one quantization group in the N dimension. This allows the same PermuteN optimization to be applied, as the scale values can be broadcast efficiently within the block.

Checklist

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

The implementation distinguishes between two scenarios:

Per-element quantization (kN == 1): Each element has its own scale, full N-tile decomposition is used
Block quantization (kN == 128): One scale per 128 elements in N, simplified view where scales are broadcast within the block

This approach maintains backward compatibility while enabling performance optimizations for the common 128-block quantization case used in modern quantized models.

ThomasNing · 2026-03-04T01:16:01Z

@amd-khushbu When the transpose C is enabled, we need to have a new algorithm of the PermuteN in C-shuffle epilogue, which does not treat the M-dimension as the outer loop, but instead treats the N-dimension as the outer loop.

ThomasNing · 2026-03-11T22:47:41Z

@amd-khushbu CI error.

ThomasNing

Could we also add the protection and limitation on some of the blockscale size on N dimension that doesn't support PermuteN, and why?

ThomasNing · 2026-03-11T22:52:32Z

...ablekernel/example/ck_tile/38_block_scale_gemm/gemm_abquant_quantgrouped_preshuffleb_bf8.cpp

        using TypeConfig =
            decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{});
-        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t>,
+        return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t, false>,


Could we add a comment here to tell the user what the false boolean in here means?

ThomasNing · 2026-03-17T10:08:50Z

@amd-khushbu CI failed again. PTAL?

amd-khushbu added 7 commits February 11, 2026 18:03

initial commit for debugging

b7d119f

debugging with prints

33e4890

merge with develop

02e5193

working case for group_n:128

d630492

code clean up

f7d8393

working preshuffleQ and preshuffleQuant for bquant

b0d94ff

Merge branch 'develop' into ck/khuagarw/AICK-442

2a7bd1b

amd-khushbu requested a review from a team as a code owner March 3, 2026 00:53

github-actions bot added the project: composablekernel label Mar 3, 2026

removing debug changes and code clean up

812f24f

assistant-librarian bot added the organization: ROCm label Mar 3, 2026

working permuteN for abquant

5ed440a

amd-khushbu and others added 9 commits March 5, 2026 19:54

Merge branch 'develop' into ck/khuagarw/AICK-442

30dacb3

disabling transposeC with permuteN

b89394e

Merge branch 'develop' into ck/khuagarw/AICK-442

844352e

Merge branch 'develop' into ck/khuagarw/AICK-442

9030336

Merge branch 'develop' into ck/khuagarw/AICK-442

f032947

Merge branch 'develop' into ck/khuagarw/AICK-442

67acab0

resolving merge conflicts

57cb0b2

resolving merge conflicts

2f37f37

Merge branch 'develop' into ck/khuagarw/AICK-442

860befe

ThomasNing requested changes Mar 11, 2026

View reviewed changes

amd-khushbu and others added 6 commits March 11, 2026 16:50

Merge branch 'develop' into ck/khuagarw/AICK-442

227a3bf

addressing review comments

9751bbf

Merge branch 'develop' into ck/khuagarw/AICK-442

cc973e8

fix the compilation error to make it support TransposeC

ecc711a

Merge branch 'develop' into ck/khuagarw/AICK-442

1a6cb2f

Merge branch 'develop' into ck/khuagarw/AICK-442

a66d0d6

resolving merge conflicts

1af05fd

amd-khushbu and others added 4 commits March 17, 2026 13:42

Merge branch 'develop' into ck/khuagarw/AICK-442

b182f49

fixing CI issue

d01728d

fixing CI issue

0630786

Merge branch 'develop' into ck/khuagarw/AICK-442

42887c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PermuteN for 2D block scale GEMM with block 128(N)#5040

Support PermuteN for 2D block scale GEMM with block 128(N)#5040
amd-khushbu wants to merge 29 commits intodevelopfrom
ck/khuagarw/AICK-442

amd-khushbu commented Mar 3, 2026 •

edited

Loading

Uh oh!

ThomasNing commented Mar 4, 2026

Uh oh!

ThomasNing commented Mar 11, 2026

Uh oh!

ThomasNing left a comment

Uh oh!

ThomasNing Mar 11, 2026

Uh oh!

ThomasNing commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amd-khushbu commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Motivation

Key Changes

1. Extended PermuteN Support in GEMM Pipeline (run_gemm_quant_example.inc)

2. Enhanced Tensor Shuffle Utilities (tensor_shuffle_utils.hpp)

3. Block GEMM Kernel Changes (block_universal_gemm_ar_flatbr_bquant_cr.hpp)

Technical Details

Checklist

Discussion

Uh oh!

ThomasNing commented Mar 4, 2026

Uh oh!

ThomasNing commented Mar 11, 2026

Uh oh!

ThomasNing left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasNing Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ThomasNing commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amd-khushbu commented Mar 3, 2026 •

edited

Loading

1. Extended PermuteN Support in GEMM Pipeline (`run_gemm_quant_example.inc`)

2. Enhanced Tensor Shuffle Utilities (`tensor_shuffle_utils.hpp`)

3. Block GEMM Kernel Changes (`block_universal_gemm_ar_flatbr_bquant_cr.hpp`)