Skip to content

[AMDGPU][LDS] Enable global load DMA by default on gfx950+#23230

Draft
lialan wants to merge 7 commits intomainfrom
users/lialan/avoid_dma_when_pad
Draft

[AMDGPU][LDS] Enable global load DMA by default on gfx950+#23230
lialan wants to merge 7 commits intomainfrom
users/lialan/avoid_dma_when_pad

Conversation

@lialan
Copy link
Contributor

@lialan lialan commented Jan 21, 2026

  • Automatically use coalesced global load DMA for matmul/IGEMM on CDNA4+ architectures.
  • Falls back to standard promotion and avoid using LDS DMA when source comes from tensor.pad or when padding is required.

ci-extra: test_amd_mi355

@lialan lialan force-pushed the users/lialan/avoid_dma_when_pad branch 2 times, most recently from de197c4 to 761329c Compare January 22, 2026 00:02
@lialan lialan changed the title [AMDGPU][LDS] Do not use DMA in the presence of tensor.pad [AMDGPU][LDS] Turn on coalesced gather dma by default Jan 22, 2026
@lialan lialan force-pushed the users/lialan/avoid_dma_when_pad branch 4 times, most recently from 1885d10 to 198bdd3 Compare January 23, 2026 19:48
@lialan lialan changed the title [AMDGPU][LDS] Turn on coalesced gather dma by default [AMDGPU][LDS] Enable global load DMA by default Jan 23, 2026
@lialan lialan changed the title [AMDGPU][LDS] Enable global load DMA by default [AMDGPU][LDS] Enable global load DMA by default on gfx950+ Jan 23, 2026
@lialan lialan marked this pull request as ready for review January 24, 2026 01:34
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some clang-tidy warnings

Comment on lines 439 to 441
if (*maybeChipset < kGfx950) {
LDBG() << "Target arch " << targetArch << " is not CDNA4+, skipping pass";
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure this is not accidentally enabled on rdna cards? Would it be possible to have a lit test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, so let's just restrict it to gfx950 only. I will add a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some guard to limit to gfx950+ but also arch that has global load lds instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something else that tells us if DMA is available? Maybe we could check for the dma_sizes target attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now only enabled when it is both:

  • gfx 950+
  • have DMA size >= 128bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually remove the gfx950 check - "DMA sizes >= 128 bits" might actually be a good condition to use here

It could make things a touch awkward if we want to not use this op on gfx1250 and need to do phase ordering about it, but for now that's a sufficient condition that doesn't involve mentioning an architecture by name

@krzysz00
Copy link
Contributor

Just going to stick a call for benchmarks here

Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm closing my eyes to the amdgpu chipset hardcoding in the GPU dialect since the whole op needs to move anyway, but there's no way to do that right now.

I'll also echo Krzysztof's request. Providing some basic benchmarking results with a feature enablement acts as proof that the feature is working as intended. Easily reproducible results is best, but even a small hand picked sweep is enough.

@lialan lialan force-pushed the users/lialan/avoid_dma_when_pad branch 2 times, most recently from 21519bf to 49eafdf Compare January 28, 2026 00:20
@lialan
Copy link
Contributor Author

lialan commented Jan 28, 2026

@qedawkins @krzysz00

Column Z and Column AA are baseline and the new results.

Overall it is positive, but very much diminished by a lot of regressions. So I am investigating those regressions.

@lialan lialan marked this pull request as draft January 28, 2026 02:43
@qedawkins
Copy link
Contributor

@qedawkins @krzysz00 here are the benchmark numbers using turbine.

Column Z and Column AA are baseline and the new results.

Overall it is positive, but very much diminished by a lot of regressions. So I am investigating those regressions.

Awesome, thanks for running the sweep. Getting a head start on the regressions sounds great, thanks!

@lialan
Copy link
Contributor Author

lialan commented Feb 5, 2026

#23365 tries to enable DMA for unaligned cases, so we should see if we can merge that before we merge this one.

@Yu-Zhewen Yu-Zhewen force-pushed the users/lialan/avoid_dma_when_pad branch from 344d4bd to 5a5b9af Compare February 6, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants