-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[MLIR][AMDGPU] Add a wrapper for global LDS load intrinsics in AMDGPU #133498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f271c01 to
92a1ef9
Compare
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
krzysz00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good minus doc wording and formatting nits
kuhar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % nits
|
Wanted to flag #133015 landing for a future PR |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/203/builds/6999 Here is the relevant piece of the build log for the reference |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/204/builds/5812 Here is the relevant piece of the build log for the reference |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/205/builds/5790 Here is the relevant piece of the build log for the reference |
|
To me, this looks like some missed dependency. |
| } else { | ||
| return transferType.getIntOrFloatBitWidth() / 8; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no else after return: https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a-return
| #include "mlir/Dialect/Arith/IR/Arith.h" | ||
| #include "mlir/Dialect/GPU/IR/GPUDialect.h" | ||
| #include "mlir/Dialect/LLVMIR/ROCDLDialect.h" | ||
| #include "mlir/Dialect/MemRef/Utils/MemRefUtils.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to link MLIRMemRefUtils library in cmake to fix the buildbot "undefined reference failure" "mlir::memref::isStaticShapeAndContiguousRowMajor(mlir::MemRefType)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix up here: #134862
Issue introduced in llvm#133498
## Summary
This PR sets the foundation for using `global_load_lds` instruction to
load values from global to LDS memory. The pipeline is as follows:
* Only convert `linalg.copy` emitted in `PromoteGPUMatMulOperands`. When
it sees fit, insert a different attribute
(`#iree_gpu.use_global_load_dma`) to `linalg.copy` to tag it along the
pipeline.
* Tagged `linalg.copy` will not be decomposed/tiled until bufferization.
* after distributed to threads and bufferization, the tagged
`linalg.copy` will then be lowered to a sequence of code responsible for
subgroup-coalesced loading op `iree_gpu.global_load_dma`.
* `iree_gpu.global_load_dma` will be mapped to `amdgpu.gather_to_lds`
op, which will mapped to corresponding rocdl op.
* Disable padding to reduce bank conflict pass because the destination
workgroup memory has to be contiguous.
## Lowering `linalg.copy`
After bufferization and distribute to threads, tagged `linalg.copy`
still exists in the IR:
```
linalg.copy {lowering_config = #iree_gpu.use_global_load_dma}
ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>)
outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>)
```
Note that this `linalg.copy` is kept in the thread's code. The op itself
is then converted into a `for loop`, in which subgroup of threads loads
coalesced chunk of values. For example, assume there are N subgroups
loading from `tensor<a x b x c>`:
* then `i`-th subgruop will load a sub tensor of size `[a/N, b, c]`, so
each slice is consecutive.
* At this moment, assume row-major, and only tile the outermost dim.
* The reason right now we are only dealing with `linalg.copy` emitted by
`GPUPromoteMatmulOperands` is that we know the destination is allocated
contiguously.
* TODO: expand to any memref slices.
* given `gpu.subgroup_id` and `gpu.lane_id`, each thread calculates the
consecutive data chunk the subgroup the thread belongs to is responsible
to load:
* the chunk indices is the delinearized indices of the input tensor,
from:
* `affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) /
num_subgroups)]`, to
* `affine.delinearize_index[(gpu.subgroup_id + 1) *
(num_elems_of(tensor) / num_subgroups) - 1]`
* Assume each subgroup will load `n` values from linearized index `[N_f,
N_b]`, then thread with lane id `i` will try to load: `iter = 0 to n :
N_f + subgroup_size * iter + (i - 1)` .
Then it will be converted to something like the following (in the
example, assume `workgroup size = 256`, `subgroup_size = 64`, loading
`64x128xi8`):
```miler
scf.for %indvar = %c0 to %c32 step %c1 {
;; thread-specific gathering address from global address
%17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar]
%18:2 = affine.delinearize_index %17 into (128, 64) : index, index
;; this iteration's base storing index
%19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar]
%20:2 = affine.delinearize_index %19 into (128, 64) : index, index
iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>>
}
;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here
gpu.barrier
```
## Dependent PRs:
* design doc: https://hackmd.io/N0RitxPzT9GPhM0jEPtOCg?view
* upstream changes required:
* llvm/llvm-project#133498
* llvm/llvm-project#136405
* llvm/llvm-project#137671
* llvm/llvm-project#137425
* #20800 (review)
---------
Signed-off-by: Alan Li <[email protected]>
Issue introduced in llvm#133498
Defining a new
amdgpu.global_loadop, which is a thin wrap around ROCDLglobal_load_ldsintrinsic, along with its lowering logics torocdl.global.load.lds.