Skip to content

Conversation

@stefankoncarevic
Copy link
Contributor

@stefankoncarevic stefankoncarevic commented Mar 31, 2025

This pull request addresses the implementation of wavefront reduction using DPP (Data Parallel Primitives) instructions. Currently, the implementation covers case that there aren't sufficient threads for a parallel reduction, allowing each individual thread to perform its own reduction.

This is the link to ticket: https://github.com/ROCm/rocMLIR-internal/issues/1524

Current Architecture Support
The code currently supports the MI200 and MI300 architectures. For RDNA and MI100 architectures, Need to adjust the conditions based on the architecture and select the appropriate intrinsic instructions supported by these specific architectures.

Considered Approaches
Kernel Pipeline Approach:
Began by implementing an option for the chipset in the -rock-blockwise-gemm-to-threadwise pass. This option would be available for determining conditions to implement the desired functionality. It involves adding this option to the kernel pipeline, and modifying the our tools rocmlir-driver -c to pass the architecture information for generating the correct instructions for mlir passes. Feedback is needed to assess whether it's beneficial to add this option to the kernel pipeline pass.

ROCDL Backend Approach:
Since the chipset is already integrated within the ROCDL dialect backend pipeline, and instructions are lowered in this pass, another approach would be to add custom intrinsics within AMDGPU.td. These can be used to create conditions within the ROCDL that lower the required instructions based on the chipset being utilized.

Generic Solution with Readlane:
A third potential solution is to attempt a more generalized approach by using readlane and modifying the logic to try to utilize as many common instructions supported across various architectures as possible. This would help in achieving a correct and optimal solution.

- Implemented 'row_share' as a new DPP instruction.
- Added verification logic for 'row_share' with permissible
  range [0-15].
- Updated test cases to include 'row_share' examples and checks.
- Introduced ROCDL_SetInactiveOp to handle inactive lanes by
  mapping to the `llvm.amdgcn.set.inactive` intrinsic. This is
  crucial when the number of active lanes is less, ensuring
  values are managed appropriately.

- Added ROCDL_StrictWWMOp to implement strict whole wavefront mode
  using the `llvm.amdgcn.strict.wwm` intrinsic. This operation
  ensures calculations consider the entire wavefront, integral
  to DPP reduction scenarios.

These additions support complex wavefront operations needed during
DPP (Data Parallel Primitives) reduction, enhancing control over
lane execution and ensuring computation correctness in varying
lane activeness scenarios.
- Implemented ROCDL_PermLaneX16Op to utilize
  `llvm.amdgcn.permlanex16` intrinsic, addressing limitations in
  DPP broadcasting on RDNA architecture.

- This enhancement facilitates value propagation across lanes
  within a wave, providing flexibility where certain DPP
  instructions are unsupported for broadcasting on RDNA systems.
- Integrated DPP reduction functionality designed to improve
  performance on supported architectures by minimizing active
  lane usage through set inactive operations.
- Added operations to manage scalar extraction from vectors with
  `vector::ExtractElementOp` for precise per-element manipulation
  and enhanced control over lane values.
- Introduced `ROCDL::SetInactiveOp` to replace values in inactive
  lanes, promoting efficiency when active lane count is reduced
  in DPP scenarios.
- Enabled thorough reduction steps through sequential DPP
  operations including `row_shr`, broadcasting
  (`row_bcast_15`, `row_bcast_31`), and wave rotations
  (`wave_ror`) for wavefront manipulation.
- Utilized `ROCDL::StrictWWMOp` to ensure computations occur in
  Whole Wavefront Mode, particularly crucial in scenarios where
  the number of active lanes is fewer than the wave size. This
  guarantees consistent and correct results across all lanes
  within the wavefront, meeting the necessary computational
  requirements for comprehensive DPP reductions.
Replaced the previous DPP-based broadcasting logic with
ROCDL::ReadlaneOp, allowing thread 0 to extract the final
reduction result and share it with all threads efficiently.
This simplifies the implementation and ensures correctness
across different architectures.

Removed multiple amdgpu::DPPOp operations
(row_share, row_bcast_15, row_bcast_31) used for broadcasting
the reduced value.
Directly used ROCDL::ReadlaneOp to read the result from
thread 0 and distribute it to all threads.
Ensures correct behavior across different AMD GPU architectures.
Potentially improves performance by using a single lane-read
operation instead of multiple DPP-based broadcasts.
@krzysz00
Copy link
Collaborator

krzysz00 commented Apr 1, 2025

Happened to notice this PR and wanted to let y'all know that llvm/llvm-project#133204 is a thing that's under development and might be useful.

loc, reduced, workspaceLDSBuffer,
reductionLoop.getLowerCoords(/*domain=*/2));

Value BrodcastAll;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to have rocdl and amdgpu dialects here. Can we have some new op rock::wavereduction or something like that? Then, we can lower it to rocdl later.

Copy link
Contributor

@dhernandez0 dhernandez0 Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, that way it's easier to keep the current implementation if dpp is not supported.

rewriter.create<arith::ConstantIndexOp>(loc, i));
Value scalarInactiveValue = rewriter.create<arith::ConstantOp>(
loc, vecType.getElementType(),
rewriter.getFloatAttr(vecType.getElementType(), 0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for max reduction it should be -inf?

loc, vecType.getElementType(), scalarVal,
scalarInactiveValue);

Value dppResult1 = rewriter.create<amdgpu::DPPOp>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments to understand what the constants are doing here: 0xF, ...

Value dppResult = createReducingOp(op, setInactiveScalar,
dppResult1, rewriter);

Value dppResult2 = rewriter.create<amdgpu::DPPOp>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments to understand what the constants are doing here: 0xF, ...

- Introduced `rock-wave-reduce-lowering` pass to
  lower `rock.wave_reduce` ops.
- Implemented `RockWaveReduceRewritePattern` using DPP
  row_shr shifts and broadcasts for intra-wavefront reductions,
  with fallback to `permlanex16` for gfx10+.
- Extended `rocmlir-driver` to pass the detected arch to kernel
  pipeline so the lowering can apply chipset-specific logic.
@umangyadav umangyadav requested a review from Copilot June 11, 2025 23:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a wavefront-level reduction operation (rock.wave_reduce) backed by AMDGPU DPP intrinsics and integrates it into the MLIR GPU pipeline.

  • Add a --chip driver option and thread pipeline flag to control target GPU architecture.
  • Define the rock.wave_reduce op, implement RockWaveReduceLoweringPass, and register it in the kernel pipeline.
  • Update blockwise reduction logic to use the new wave-level reduction and extend DPP/ROCDL dialects for new intrinsics.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mlir/tools/rocmlir-driver/rocmlir-driver.cpp Pass the new --chip option into the kernel pipeline
mlir/test/rocmlir-driver/pipelines.mlir Add rock-wave-reduce-lowering pass invocation with chipset parameter
mlir/test/Dialect/Rock/lowering_blockwise_broadcast_reduce.mlir Update CHECKs to expect rock.wave_reduce
mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp Implement the lowering pass for rock.wave_reduce and DPP-based reduction
mlir/lib/Dialect/Rock/Transforms/CMakeLists.txt Register the new RockWaveReduce.cpp in the build
mlir/lib/Dialect/Rock/Transforms/BlockwiseGemmToThreadwise.cpp Use rock.wave_reduce in blockwise reductions
mlir/lib/Dialect/Rock/Pipelines/Pipelines.cpp Insert the RockWaveReduceLoweringPass into the kernel pipeline
mlir/include/mlir/Dialect/Rock/Pipelines/Pipelines.h Add chip option to KernelOptions
mlir/include/mlir/Dialect/Rock/Passes.td Declare the rock-wave-reduce-lowering pass and its chipset option
mlir/include/mlir/Dialect/Rock/Passes.h Enable GEN_PASS_DECL_ROCKWAVEREDUCELOWERINGPASS
mlir/include/mlir/Dialect/Rock/IR/RockOps.td Define the new Rock_WaveReductionOp
external/llvm-project/mlir/test/Conversion/AMDGPUToROCDL/dpp.mlir Add a test for the new row_share DPP permutation
external/llvm-project/mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp Extend DPPOp::verify() for the row_share case
external/llvm-project/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp Lower row_share DPPPerm in the AMDGPU-to-ROCDL conversion
external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td Add new ROCDL intrinsics (set.inactive, strict.wwm, permlanex16)
external/llvm-project/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td Add row_share to the DPPPerm enum
Comments suppressed due to low confidence (2)

mlir/lib/Dialect/Rock/Transforms/RockWaveReduce.cpp:106

  • The defaultVal passed to SetInactiveOp is a vector but scalarVal is an elementType; inactive_value should match src type. Extract the corresponding element from defaultVal (e.g., via vector::ExtractElementOp) to supply a scalar inactive value.
Value setInactiveScalar = rewriter.create<ROCDL::SetInactiveOp>(loc, vecType.getElementType(), scalarVal, defaultVal);

mlir/tools/rocmlir-driver/rocmlir-driver.cpp:176

  • [nitpick] The driver option is named chip but the pass option uses --chipset. For consistency, consider renaming chip to chipset in KernelOptions.
opts.chip = devName.getChip().str();

loc, vecType.getElementType(), scalarVal, defaultVal);
std::array<int, 5> row_shifts = {1, 2, 3, 4, 8};
Value dppResult = setInactiveScalar;
Value BrodcastAll;
Copy link

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Variable name BrodcastAll is misspelled; consider renaming to BroadcastAll for clarity.

Suggested change
Value BrodcastAll;
Value BroadcastAll;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants