[AMDGPU][LDS] Enable global load DMA by default on gfx950+#23230
[AMDGPU][LDS] Enable global load DMA by default on gfx950+#23230
Conversation
de197c4 to
761329c
Compare
1885d10 to
198bdd3
Compare
kuhar
left a comment
There was a problem hiding this comment.
There are some clang-tidy warnings
| if (*maybeChipset < kGfx950) { | ||
| LDBG() << "Target arch " << targetArch << " is not CDNA4+, skipping pass"; | ||
| return; |
There was a problem hiding this comment.
Can we make sure this is not accidentally enabled on rdna cards? Would it be possible to have a lit test for this?
There was a problem hiding this comment.
Indeed, so let's just restrict it to gfx950 only. I will add a test.
There was a problem hiding this comment.
Added some guard to limit to gfx950+ but also arch that has global load lds instructions.
There was a problem hiding this comment.
Is there something else that tells us if DMA is available? Maybe we could check for the dma_sizes target attribute?
There was a problem hiding this comment.
now only enabled when it is both:
- gfx 950+
- have DMA size >= 128bit.
There was a problem hiding this comment.
We could actually remove the gfx950 check - "DMA sizes >= 128 bits" might actually be a good condition to use here
It could make things a touch awkward if we want to not use this op on gfx1250 and need to do phase ordering about it, but for now that's a sufficient condition that doesn't involve mentioning an architecture by name
|
Just going to stick a call for benchmarks here |
compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp
Outdated
Show resolved
Hide resolved
qedawkins
left a comment
There was a problem hiding this comment.
I'm closing my eyes to the amdgpu chipset hardcoding in the GPU dialect since the whole op needs to move anyway, but there's no way to do that right now.
I'll also echo Krzysztof's request. Providing some basic benchmarking results with a feature enablement acts as proof that the feature is working as intended. Easily reproducible results is best, but even a small hand picked sweep is enough.
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Outdated
Show resolved
Hide resolved
21519bf to
49eafdf
Compare
|
Column Z and Column AA are baseline and the new results. Overall it is positive, but very much diminished by a lot of regressions. So I am investigating those regressions. |
Awesome, thanks for running the sweep. Getting a head start on the regressions sounds great, thanks! |
|
#23365 tries to enable DMA for unaligned cases, so we should see if we can merge that before we merge this one. |
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
344d4bd to
5a5b9af
Compare
tensor.pador when padding is required.ci-extra: test_amd_mi355