Support PermuteN for 2D block scale GEMM with block 128(N)#5040
Open
amd-khushbu wants to merge 29 commits intodevelopfrom
Open
Support PermuteN for 2D block scale GEMM with block 128(N)#5040amd-khushbu wants to merge 29 commits intodevelopfrom
amd-khushbu wants to merge 29 commits intodevelopfrom
Conversation
Contributor
|
@amd-khushbu When the transpose C is enabled, we need to have a new algorithm of the PermuteN in C-shuffle epilogue, which does not treat the M-dimension as the outer loop, but instead treats the N-dimension as the outer loop. |
Contributor
|
@amd-khushbu CI error. |
ThomasNing
requested changes
Mar 11, 2026
Contributor
ThomasNing
left a comment
There was a problem hiding this comment.
Could we also add the protection and limitation on some of the blockscale size on N dimension that doesn't support PermuteN, and why?
| using TypeConfig = | ||
| decltype(GemmQuantTypeConfig<ck_tile::bf8_t, ck_tile::bf8_t, ck_tile::half_t, float>{}); | ||
| return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t>, | ||
| return run_gemm_example_prec_type<GemmConfigPrefill<ck_tile::bf8_t, false>, |
Contributor
There was a problem hiding this comment.
Could we add a comment here to tell the user what the false boolean in here means?
Contributor
|
@amd-khushbu CI failed again. PTAL? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
This PR enables preshuffleB with PermuteN for 2D block scale GEMM operations when the block quantization group size is 128 in the N dimension (BQuantGroupSize::kN == 128).
Motivation
Key Changes
1. Extended PermuteN Support in GEMM Pipeline (
run_gemm_quant_example.inc)TiledPermuteNcondition to enable PermuteN whenBQuantGroupSize::kN == 1 || BQuantGroupSize::kN == 128shuffle_b_permuteNandbq_permuteNinvocation conditions to support the new block size2. Enhanced Tensor Shuffle Utilities (
tensor_shuffle_utils.hpp)bq_permuteNfunction to handle both per-element (group_n == 1) and block-128 (group_n == 128) quantization3. Block GEMM Kernel Changes (
block_universal_gemm_ar_flatbr_bquant_cr.hpp)NPerBlockconstant for proper dimension trackingBPreshuffleQuantpath:BQuantGroupSize::kN > (NWarp * WG::kN)andNPerBlock == BQuantGroupSize::kN: Uses a single quant group per block (prefill scenario)nIterfor decode or multiple groups per warp scenariosTechnical Details
The key insight is that when
BQuantGroupSize::kN == 128(matching the N block size), each thread block processes exactly one quantization group in the N dimension. This allows the same PermuteN optimization to be applied, as the scale values can be broadcast efficiently within the block.Checklist
clang-formaton all changed filesDiscussion
The implementation distinguishes between two scenarios:
This approach maintains backward compatibility while enabling performance optimizations for the common 128-block quantization case used in modern quantized models.