Fix `02-fused-softmax` tutorial on BMG #4383

anmyachev · 2025-05-31T15:09:14Z

For the reference, I compare PVC perf between https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15424009604?pr=4383 and https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15422398403. Triton geomean: 656.5657 vs 662.7373. The difference is insignificant.

Signed-off-by: Anatoly Myachev <[email protected]>

python/tutorials/02-fused-softmax.py

anmyachev · 2025-05-31T19:37:30Z

python/tutorials/02-fused-softmax.py

    num_warps = min(max_num_warps, max(1, BLOCK_SIZE // (WARP_SIZE * 4)))

-    # Allocate output
-    y = torch.empty_like(x)


Again need to reduce memory pressure for BMG. It should be fine for other GPUs as well.

Hm, we don't use kineto here, it's not a fair comparison (judging by the time received). Need to return the allocation

anmyachev · 2025-05-31T19:39:56Z

python/tutorials/02-fused-softmax.py

 kernels = {}
 # Possible SLM allocation sizes in kB
-tg_slm_sizes = [i * 2**i for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]]  # TODO: Get from properties
+tg_slm_sizes = [2**i for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]]  # TODO: Get from properties


otherwise it returns a number (from occupancy) that is greater than the 196000 kb SLM that are available for BMG, which breaks the program. I don't know why the heuristic was made this way.

@victor-eds is it ok to change it like this?

I'm not sure I understand this heuristic either, but your change totally alters the sizes returned by the allocated_slm_size and does not just restrict the SLM to a lower bound.

previous tg_slm_sizes = [0, 2, 8, 64, 2048, 1048576, 402653184, 137438953472, 13510798882111488, 1180591620717411303424, 7605903601369376408980219232256, 43556142965880123323311949751266331066368]

new tg_slm_sizes = [1, 2, 4, 16, 256, 65536, 16777216, 4294967296, 281474976710656, 18446744073709551616, 79228162514264337593543950336, 340282366920938463463374607431768211456]
So, this change also modifies the behavior of this tutorial on PVC, but I don't know precisely what the impact is occupancy wise.

But generally speaking, I would say that in this test, many parameters are hard-coded, with a comment # TODO: Get from properties. As we start to target multiple architectures, it would probably make sense to replace these hardcoded parameters by target dependents values obtained from the properties.

I don't have much context on this tutorial, but what make sense to me is that this "heuristic" was wrong in the first place, and should have been:
tg_slm_sizes = [i * 2**10 for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties
to simply convert the possible SLM memory sizes from KiB to Bytes as SIZE_SMEM is a size in bytes.

I don't have much context on this tutorial, but what make sense to me is that this "heuristic" was wrong in the first place, and should have been: tg_slm_sizes = [i * 2**10 for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties to simply convert the possible SLM memory sizes from KiB to Bytes as SIZE_SMEM is a size in bytes.

Makes sense. Locally it works on BMG. Do you mind to leave it like this for now, if there is no significant slowdown for PVC?

Yes, we can continue using hard-coded values for now. I've created Issue #4413 to follow up this issue.

Thanks @mfrancepillois!

Signed-off-by: Anatoly Myachev <[email protected]>

I don't know exactly why me disabled this test on A770. However, after this change #4383 , it should run more stably. * https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15451596487 (`02-fused-softmax` passed) Signed-off-by: Anatoly Myachev <[email protected]>

…ersion (intel#4383) This is the first PR that replace the old distributed->distributed layout conversion using linear layout. We tried to match the original conversion mechanism as much as possible for now, but will try to improve its memory usage, reduce bank conflicts, and promote generalizability. There are a list of TODOs after this PR: 1. Remove the old code 2. Implement conversion within warps 3. Implement DotOpLayout conversion 4. Avoid bank conflicts using swizzling instead of padding 5. Update comments/revisit barriers for reduce/atomic operations --------- Co-authored-by: Justin Lebar <[email protected]>

Fix 02-fused-softmax tutorial on BMG

971472c

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev linked an issue May 31, 2025 that may be closed by this pull request

02-fused-softmax tutorial fails on BMG #4384

Closed

anmyachev commented May 31, 2025

View reviewed changes

python/tutorials/02-fused-softmax.py Show resolved Hide resolved

Update python/tutorials/02-fused-softmax.py

0579541

anmyachev commented May 31, 2025

View reviewed changes

python/tutorials/02-fused-softmax.py Show resolved Hide resolved

anmyachev commented May 31, 2025

View reviewed changes

anmyachev requested review from pbchekin and whitneywhtsang May 31, 2025 19:40

anmyachev marked this pull request as ready for review May 31, 2025 19:40

pbchekin approved these changes May 31, 2025

View reviewed changes

whitneywhtsang requested a review from a team May 31, 2025 19:59

leave only torch.xpu.empty_cache()

4212e80

Signed-off-by: Anatoly Myachev <[email protected]>

etiotto approved these changes Jun 3, 2025

View reviewed changes

anmyachev added 2 commits June 3, 2025 19:34

Update 02-fused-softmax.py

ba1254e

Merge branch 'main' into amyachev/fused-softmax

a68628a

mfrancepillois approved these changes Jun 4, 2025

View reviewed changes

anmyachev merged commit a2f2285 into main Jun 4, 2025
19 of 20 checks passed

anmyachev deleted the amyachev/fused-softmax branch June 4, 2025 10:53

anmyachev mentioned this pull request Jun 4, 2025

Run 02-fused-softmax on A770 #4430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `02-fused-softmax` tutorial on BMG #4383

Fix `02-fused-softmax` tutorial on BMG #4383

Uh oh!

anmyachev commented May 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

anmyachev May 31, 2025

Uh oh!

anmyachev May 31, 2025

Uh oh!

anmyachev May 31, 2025

Uh oh!

anmyachev Jun 2, 2025

Uh oh!

mfrancepillois Jun 3, 2025

Uh oh!

mfrancepillois Jun 3, 2025

Uh oh!

anmyachev Jun 3, 2025

Uh oh!

mfrancepillois Jun 4, 2025

Uh oh!

anmyachev Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix 02-fused-softmax tutorial on BMG #4383

Fix 02-fused-softmax tutorial on BMG #4383

Uh oh!

Conversation

anmyachev commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix `02-fused-softmax` tutorial on BMG #4383

Fix `02-fused-softmax` tutorial on BMG #4383

anmyachev commented May 31, 2025 •

edited

Loading