[DRAFT] Implement DPP Reduction in wavefront #1796

stefankoncarevic · 2025-03-31T12:39:18Z

This pull request addresses the implementation of wavefront reduction using DPP (Data Parallel Primitives) instructions. Currently, the implementation covers case that there aren't sufficient threads for a parallel reduction, allowing each individual thread to perform its own reduction.

This is the link to ticket: https://github.com/ROCm/rocMLIR-internal/issues/1524

Current Architecture Support
The code currently supports the MI200 and MI300 architectures. For RDNA and MI100 architectures, Need to adjust the conditions based on the architecture and select the appropriate intrinsic instructions supported by these specific architectures.

Considered Approaches
Kernel Pipeline Approach:
Began by implementing an option for the chipset in the -rock-blockwise-gemm-to-threadwise pass. This option would be available for determining conditions to implement the desired functionality. It involves adding this option to the kernel pipeline, and modifying the our tools rocmlir-driver -c to pass the architecture information for generating the correct instructions for mlir passes. Feedback is needed to assess whether it's beneficial to add this option to the kernel pipeline pass.

ROCDL Backend Approach:
Since the chipset is already integrated within the ROCDL dialect backend pipeline, and instructions are lowered in this pass, another approach would be to add custom intrinsics within AMDGPU.td. These can be used to create conditions within the ROCDL that lower the required instructions based on the chipset being utilized.

Generic Solution with Readlane:
A third potential solution is to attempt a more generalized approach by using readlane and modifying the logic to try to utilize as many common instructions supported across various architectures as possible. This would help in achieving a correct and optimal solution.

- Implemented 'row_share' as a new DPP instruction. - Added verification logic for 'row_share' with permissible range [0-15]. - Updated test cases to include 'row_share' examples and checks.

- Introduced ROCDL_SetInactiveOp to handle inactive lanes by mapping to the `llvm.amdgcn.set.inactive` intrinsic. This is crucial when the number of active lanes is less, ensuring values are managed appropriately. - Added ROCDL_StrictWWMOp to implement strict whole wavefront mode using the `llvm.amdgcn.strict.wwm` intrinsic. This operation ensures calculations consider the entire wavefront, integral to DPP reduction scenarios. These additions support complex wavefront operations needed during DPP (Data Parallel Primitives) reduction, enhancing control over lane execution and ensuring computation correctness in varying lane activeness scenarios.

- Implemented ROCDL_PermLaneX16Op to utilize `llvm.amdgcn.permlanex16` intrinsic, addressing limitations in DPP broadcasting on RDNA architecture. - This enhancement facilitates value propagation across lanes within a wave, providing flexibility where certain DPP instructions are unsupported for broadcasting on RDNA systems.

- Integrated DPP reduction functionality designed to improve performance on supported architectures by minimizing active lane usage through set inactive operations. - Added operations to manage scalar extraction from vectors with `vector::ExtractElementOp` for precise per-element manipulation and enhanced control over lane values. - Introduced `ROCDL::SetInactiveOp` to replace values in inactive lanes, promoting efficiency when active lane count is reduced in DPP scenarios. - Enabled thorough reduction steps through sequential DPP operations including `row_shr`, broadcasting (`row_bcast_15`, `row_bcast_31`), and wave rotations (`wave_ror`) for wavefront manipulation. - Utilized `ROCDL::StrictWWMOp` to ensure computations occur in Whole Wavefront Mode, particularly crucial in scenarios where the number of active lanes is fewer than the wave size. This guarantees consistent and correct results across all lanes within the wavefront, meeting the necessary computational requirements for comprehensive DPP reductions.

Replaced the previous DPP-based broadcasting logic with ROCDL::ReadlaneOp, allowing thread 0 to extract the final reduction result and share it with all threads efficiently. This simplifies the implementation and ensures correctness across different architectures. Removed multiple amdgpu::DPPOp operations (row_share, row_bcast_15, row_bcast_31) used for broadcasting the reduced value. Directly used ROCDL::ReadlaneOp to read the result from thread 0 and distribute it to all threads. Ensures correct behavior across different AMD GPU architectures. Potentially improves performance by using a single lane-read operation instead of multiple DPP-based broadcasts.

krzysz00 · 2025-04-01T16:17:40Z

Happened to notice this PR and wanted to let y'all know that llvm/llvm-project#133204 is a thing that's under development and might be useful.

dhernandez0 · 2025-04-02T12:35:34Z

mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp

-                  loc, reduced, workspaceLDSBuffer,
-                  reductionLoop.getLowerCoords(/*domain=*/2));
+
+            Value BrodcastAll;


I think we don't need to have rocdl and amdgpu dialects here. Can we have some new op rock::wavereduction or something like that? Then, we can lower it to rocdl later.

also, that way it's easier to keep the current implementation if dpp is not supported.

dhernandez0 · 2025-04-02T13:10:37Z

mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp

+                    rewriter.create<arith::ConstantIndexOp>(loc, i));
+                Value scalarInactiveValue = rewriter.create<arith::ConstantOp>(
+                    loc, vecType.getElementType(),
+                    rewriter.getFloatAttr(vecType.getElementType(), 0.0));


for max reduction it should be -inf?

dhernandez0 · 2025-04-02T13:23:47Z

mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp

+                    loc, vecType.getElementType(), scalarVal,
+                    scalarInactiveValue);
+
+                Value dppResult1 = rewriter.create<amdgpu::DPPOp>(


please add comments to understand what the constants are doing here: 0xF, ...

dhernandez0 · 2025-04-02T13:23:52Z

mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp

+                Value dppResult = createReducingOp(op, setInactiveScalar,
+                                                   dppResult1, rewriter);
+
+                Value dppResult2 = rewriter.create<amdgpu::DPPOp>(


please add comments to understand what the constants are doing here: 0xF, ...

- Introduced `rock-wave-reduce-lowering` pass to lower `rock.wave_reduce` ops. - Implemented `RockWaveReduceRewritePattern` using DPP row_shr shifts and broadcasts for intra-wavefront reductions, with fallback to `permlanex16` for gfx10+. - Extended `rocmlir-driver` to pass the detected arch to kernel pipeline so the lowering can apply chipset-specific logic.

Copilot

Pull Request Overview

This PR introduces a wavefront-level reduction operation (rock.wave_reduce) backed by AMDGPU DPP intrinsics and integrates it into the MLIR GPU pipeline.

Add a --chip driver option and thread pipeline flag to control target GPU architecture.
Define the rock.wave_reduce op, implement RockWaveReduceLoweringPass, and register it in the kernel pipeline.
Update blockwise reduction logic to use the new wave-level reduction and extend DPP/ROCDL dialects for new intrinsics.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
mlir/tools/rocmlir-driver/rocmlir-driver.cpp	Pass the new `--chip` option into the kernel pipeline
mlir/test/rocmlir-driver/pipelines.mlir	Add `rock-wave-reduce-lowering` pass invocation with `chipset` parameter
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir	Update CHECKs to expect `rock.wave_reduce`
mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp	Implement the lowering pass for `rock.wave_reduce` and DPP-based reduction
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt	Register the new `RockWaveReduce.cpp` in the build
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp	Use `rock.wave_reduce` in blockwise reductions
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp	Insert the `RockWaveReduceLoweringPass` into the kernel pipeline
mlir/include/mlir/Dialect/Rock/Pipelines/Pipelines.h	Add `chip` option to `KernelOptions`
mlir/include/mlir/Dialect/Rock/Passes.td	Declare the `rock-wave-reduce-lowering` pass and its `chipset` option
mlir/include/mlir/Dialect/Rock/Passes.h	Enable `GEN_PASS_DECL_ROCKWAVEREDUCELOWERINGPASS`
mlir/include/mlir/Dialect/Rock/IR/RockOps.td	Define the new `Rock_WaveReductionOp`
external/llvm-project/mlir/test/Conversion/AMDGPUToROCDL/dpp.mlir	Add a test for the new `row_share` DPP permutation
external/llvm-project/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp	Extend `DPPOp::verify()` for the `row_share` case
external/llvm-project/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp	Lower `row_share` DPPPerm in the AMDGPU-to-ROCDL conversion
external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td	Add new ROCDL intrinsics (`set.inactive`, `strict.wwm`, `permlanex16`)
external/llvm-project/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td	Add `row_share` to the `DPPPerm` enum

Comments suppressed due to low confidence (2)

mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp:106

The defaultVal passed to SetInactiveOp is a vector but scalarVal is an elementType; inactive_value should match src type. Extract the corresponding element from defaultVal (e.g., via vector::ExtractElementOp) to supply a scalar inactive value.

Value setInactiveScalar = rewriter.create<ROCDL::SetInactiveOp>(loc, vecType.getElementType(), scalarVal, defaultVal);

mlir/tools/rocmlir-driver/rocmlir-driver.cpp:176

[nitpick] The driver option is named chip but the pass option uses --chipset. For consistency, consider renaming chip to chipset in KernelOptions.

opts.chip = devName.getChip().str();

Copilot · 2025-06-11T23:05:41Z

mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp

+          loc, vecType.getElementType(), scalarVal, defaultVal);
+      std::array<int, 5> row_shifts = {1, 2, 3, 4, 8};
+      Value dppResult = setInactiveScalar;
+      Value BrodcastAll;


[nitpick] Variable name BrodcastAll is misspelled; consider renaming to BroadcastAll for clarity.

Suggested change

Value BrodcastAll;

Value BroadcastAll;

stefankoncarevic added 4 commits March 24, 2025 09:55

Add support for 'row_share' DPP instruction

6384eea

- Implemented 'row_share' as a new DPP instruction. - Added verification logic for 'row_share' with permissible range [0-15]. - Updated test cases to include 'row_share' examples and checks.

stefankoncarevic requested a review from causten as a code owner March 31, 2025 12:39

stefankoncarevic requested review from dhernandez0, djramic and umangyadav March 31, 2025 12:47

stefankoncarevic added 2 commits April 1, 2025 12:22

Merge branch 'develop' into wave-reduction-dpp

df7dd42

dhernandez0 reviewed Apr 2, 2025

View reviewed changes

stefankoncarevic added 2 commits April 17, 2025 11:37

Updating tests.

88b9639

umangyadav requested a review from Copilot June 11, 2025 23:03

Copilot AI reviewed Jun 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] Implement DPP Reduction in wavefront #1796

[DRAFT] Implement DPP Reduction in wavefront #1796

stefankoncarevic commented Mar 31, 2025 •

edited

Loading

Uh oh!

krzysz00 commented Apr 1, 2025

Uh oh!

dhernandez0 Apr 2, 2025

Uh oh!

dhernandez0 Apr 2, 2025 •

edited

Loading

Uh oh!

dhernandez0 Apr 2, 2025

Uh oh!

dhernandez0 Apr 2, 2025

Uh oh!

dhernandez0 Apr 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DRAFT] Implement DPP Reduction in wavefront #1796

Are you sure you want to change the base?

[DRAFT] Implement DPP Reduction in wavefront #1796

Conversation

stefankoncarevic commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krzysz00 commented Apr 1, 2025

Uh oh!

dhernandez0 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefankoncarevic commented Mar 31, 2025 •

edited

Loading

dhernandez0 Apr 2, 2025 •

edited

Loading