-
Notifications
You must be signed in to change notification settings - Fork 50
[DRAFT] Implement DPP Reduction in wavefront #1796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[DRAFT] Implement DPP Reduction in wavefront #1796
Conversation
- Implemented 'row_share' as a new DPP instruction. - Added verification logic for 'row_share' with permissible range [0-15]. - Updated test cases to include 'row_share' examples and checks.
- Introduced ROCDL_SetInactiveOp to handle inactive lanes by mapping to the `llvm.amdgcn.set.inactive` intrinsic. This is crucial when the number of active lanes is less, ensuring values are managed appropriately. - Added ROCDL_StrictWWMOp to implement strict whole wavefront mode using the `llvm.amdgcn.strict.wwm` intrinsic. This operation ensures calculations consider the entire wavefront, integral to DPP reduction scenarios. These additions support complex wavefront operations needed during DPP (Data Parallel Primitives) reduction, enhancing control over lane execution and ensuring computation correctness in varying lane activeness scenarios.
- Implemented ROCDL_PermLaneX16Op to utilize `llvm.amdgcn.permlanex16` intrinsic, addressing limitations in DPP broadcasting on RDNA architecture. - This enhancement facilitates value propagation across lanes within a wave, providing flexibility where certain DPP instructions are unsupported for broadcasting on RDNA systems.
- Integrated DPP reduction functionality designed to improve performance on supported architectures by minimizing active lane usage through set inactive operations. - Added operations to manage scalar extraction from vectors with `vector::ExtractElementOp` for precise per-element manipulation and enhanced control over lane values. - Introduced `ROCDL::SetInactiveOp` to replace values in inactive lanes, promoting efficiency when active lane count is reduced in DPP scenarios. - Enabled thorough reduction steps through sequential DPP operations including `row_shr`, broadcasting (`row_bcast_15`, `row_bcast_31`), and wave rotations (`wave_ror`) for wavefront manipulation. - Utilized `ROCDL::StrictWWMOp` to ensure computations occur in Whole Wavefront Mode, particularly crucial in scenarios where the number of active lanes is fewer than the wave size. This guarantees consistent and correct results across all lanes within the wavefront, meeting the necessary computational requirements for comprehensive DPP reductions.
Replaced the previous DPP-based broadcasting logic with ROCDL::ReadlaneOp, allowing thread 0 to extract the final reduction result and share it with all threads efficiently. This simplifies the implementation and ensures correctness across different architectures. Removed multiple amdgpu::DPPOp operations (row_share, row_bcast_15, row_bcast_31) used for broadcasting the reduced value. Directly used ROCDL::ReadlaneOp to read the result from thread 0 and distribute it to all threads. Ensures correct behavior across different AMD GPU architectures. Potentially improves performance by using a single lane-read operation instead of multiple DPP-based broadcasts.
|
Happened to notice this PR and wanted to let y'all know that llvm/llvm-project#133204 is a thing that's under development and might be useful. |
| loc, reduced, workspaceLDSBuffer, | ||
| reductionLoop.getLowerCoords(/*domain=*/2)); | ||
|
|
||
| Value BrodcastAll; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to have rocdl and amdgpu dialects here. Can we have some new op rock::wavereduction or something like that? Then, we can lower it to rocdl later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, that way it's easier to keep the current implementation if dpp is not supported.
| rewriter.create<arith::ConstantIndexOp>(loc, i)); | ||
| Value scalarInactiveValue = rewriter.create<arith::ConstantOp>( | ||
| loc, vecType.getElementType(), | ||
| rewriter.getFloatAttr(vecType.getElementType(), 0.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for max reduction it should be -inf?
| loc, vecType.getElementType(), scalarVal, | ||
| scalarInactiveValue); | ||
|
|
||
| Value dppResult1 = rewriter.create<amdgpu::DPPOp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comments to understand what the constants are doing here: 0xF, ...
| Value dppResult = createReducingOp(op, setInactiveScalar, | ||
| dppResult1, rewriter); | ||
|
|
||
| Value dppResult2 = rewriter.create<amdgpu::DPPOp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comments to understand what the constants are doing here: 0xF, ...
- Introduced `rock-wave-reduce-lowering` pass to lower `rock.wave_reduce` ops. - Implemented `RockWaveReduceRewritePattern` using DPP row_shr shifts and broadcasts for intra-wavefront reductions, with fallback to `permlanex16` for gfx10+. - Extended `rocmlir-driver` to pass the detected arch to kernel pipeline so the lowering can apply chipset-specific logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a wavefront-level reduction operation (rock.wave_reduce) backed by AMDGPU DPP intrinsics and integrates it into the MLIR GPU pipeline.
- Add a
--chipdriver option and thread pipeline flag to control target GPU architecture. - Define the
rock.wave_reduceop, implementRockWaveReduceLoweringPass, and register it in the kernel pipeline. - Update blockwise reduction logic to use the new wave-level reduction and extend DPP/ROCDL dialects for new intrinsics.
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| mlir/tools/rocmlir-driver/rocmlir-driver.cpp | Pass the new --chip option into the kernel pipeline |
| mlir/test/rocmlir-driver/pipelines.mlir | Add rock-wave-reduce-lowering pass invocation with chipset parameter |
| mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir | Update CHECKs to expect rock.wave_reduce |
| mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp | Implement the lowering pass for rock.wave_reduce and DPP-based reduction |
| mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt | Register the new RockWaveReduce.cpp in the build |
| mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp | Use rock.wave_reduce in blockwise reductions |
| mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp | Insert the RockWaveReduceLoweringPass into the kernel pipeline |
| mlir/include/mlir/Dialect/Rock/Pipelines/Pipelines.h | Add chip option to KernelOptions |
| mlir/include/mlir/Dialect/Rock/Passes.td | Declare the rock-wave-reduce-lowering pass and its chipset option |
| mlir/include/mlir/Dialect/Rock/Passes.h | Enable GEN_PASS_DECL_ROCKWAVEREDUCELOWERINGPASS |
| mlir/include/mlir/Dialect/Rock/IR/RockOps.td | Define the new Rock_WaveReductionOp |
| external/llvm-project/mlir/test/Conversion/AMDGPUToROCDL/dpp.mlir | Add a test for the new row_share DPP permutation |
| external/llvm-project/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp | Extend DPPOp::verify() for the row_share case |
| external/llvm-project/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp | Lower row_share DPPPerm in the AMDGPU-to-ROCDL conversion |
| external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td | Add new ROCDL intrinsics (set.inactive, strict.wwm, permlanex16) |
| external/llvm-project/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td | Add row_share to the DPPPerm enum |
Comments suppressed due to low confidence (2)
mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp:106
- The
defaultValpassed toSetInactiveOpis a vector butscalarValis an elementType;inactive_valueshould matchsrctype. Extract the corresponding element fromdefaultVal(e.g., viavector::ExtractElementOp) to supply a scalar inactive value.
Value setInactiveScalar = rewriter.create<ROCDL::SetInactiveOp>(loc, vecType.getElementType(), scalarVal, defaultVal);
mlir/tools/rocmlir-driver/rocmlir-driver.cpp:176
- [nitpick] The driver option is named
chipbut the pass option uses--chipset. For consistency, consider renamingchiptochipsetinKernelOptions.
opts.chip = devName.getChip().str();
| loc, vecType.getElementType(), scalarVal, defaultVal); | ||
| std::array<int, 5> row_shifts = {1, 2, 3, 4, 8}; | ||
| Value dppResult = setInactiveScalar; | ||
| Value BrodcastAll; |
Copilot
AI
Jun 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Variable name BrodcastAll is misspelled; consider renaming to BroadcastAll for clarity.
| Value BrodcastAll; | |
| Value BroadcastAll; |
This pull request addresses the implementation of wavefront reduction using DPP (Data Parallel Primitives) instructions. Currently, the implementation covers case that there aren't sufficient threads for a parallel reduction, allowing each individual thread to perform its own reduction.
This is the link to ticket: https://github.com/ROCm/rocMLIR-internal/issues/1524
Current Architecture Support
The code currently supports the MI200 and MI300 architectures. For RDNA and MI100 architectures, Need to adjust the conditions based on the architecture and select the appropriate intrinsic instructions supported by these specific architectures.
Considered Approaches
Kernel Pipeline Approach:
Began by implementing an option for the chipset in the -rock-blockwise-gemm-to-threadwise pass. This option would be available for determining conditions to implement the desired functionality. It involves adding this option to the kernel pipeline, and modifying the our tools rocmlir-driver -c to pass the architecture information for generating the correct instructions for mlir passes. Feedback is needed to assess whether it's beneficial to add this option to the kernel pipeline pass.
ROCDL Backend Approach:
Since the chipset is already integrated within the ROCDL dialect backend pipeline, and instructions are lowered in this pass, another approach would be to add custom intrinsics within AMDGPU.td. These can be used to create conditions within the ROCDL that lower the required instructions based on the chipset being utilized.
Generic Solution with Readlane:
A third potential solution is to attempt a more generalized approach by using readlane and modifying the logic to try to utilize as many common instructions supported across various architectures as possible. This would help in achieving a correct and optimal solution.