Skip to content

Conversation

@anmyachev anmyachev linked an issue May 31, 2025 that may be closed by this pull request
num_warps = min(max_num_warps, max(1, BLOCK_SIZE // (WARP_SIZE * 4)))

# Allocate output
y = torch.empty_like(x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again need to reduce memory pressure for BMG. It should be fine for other GPUs as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we don't use kineto here, it's not a fair comparison (judging by the time received). Need to return the allocation

kernels = {}
# Possible SLM allocation sizes in kB
tg_slm_sizes = [i * 2**i for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties
tg_slm_sizes = [2**i for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it returns a number (from occupancy) that is greater than the 196000 kb SLM that are available for BMG, which breaks the program. I don't know why the heuristic was made this way.

@victor-eds is it ok to change it like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: #1495

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this heuristic either, but your change totally alters the sizes returned by the allocated_slm_size and does not just restrict the SLM to a lower bound.

  • previous tg_slm_sizes = [0, 2, 8, 64, 2048, 1048576, 402653184, 137438953472, 13510798882111488, 1180591620717411303424, 7605903601369376408980219232256, 43556142965880123323311949751266331066368]
  • new tg_slm_sizes = [1, 2, 4, 16, 256, 65536, 16777216, 4294967296, 281474976710656, 18446744073709551616, 79228162514264337593543950336, 340282366920938463463374607431768211456]
    So, this change also modifies the behavior of this tutorial on PVC, but I don't know precisely what the impact is occupancy wise.

But generally speaking, I would say that in this test, many parameters are hard-coded, with a comment # TODO: Get from properties. As we start to target multiple architectures, it would probably make sense to replace these hardcoded parameters by target dependents values obtained from the properties.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much context on this tutorial, but what make sense to me is that this "heuristic" was wrong in the first place, and should have been:
tg_slm_sizes = [i * 2**10 for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties
to simply convert the possible SLM memory sizes from KiB to Bytes as SIZE_SMEM is a size in bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much context on this tutorial, but what make sense to me is that this "heuristic" was wrong in the first place, and should have been: tg_slm_sizes = [i * 2**10 for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties to simply convert the possible SLM memory sizes from KiB to Bytes as SIZE_SMEM is a size in bytes.

Makes sense. Locally it works on BMG. Do you mind to leave it like this for now, if there is no significant slowdown for PVC?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can continue using hard-coded values for now. I've created Issue #4413 to follow up this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mfrancepillois!

@anmyachev anmyachev marked this pull request as ready for review May 31, 2025 19:40
@whitneywhtsang whitneywhtsang requested a review from a team May 31, 2025 19:59
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev merged commit a2f2285 into main Jun 4, 2025
19 of 20 checks passed
@anmyachev anmyachev deleted the amyachev/fused-softmax branch June 4, 2025 10:53
anmyachev added a commit that referenced this pull request Jun 4, 2025
I don't know exactly why me disabled this test on A770. However, after
this change
#4383 , it
should run more stably.

*
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15451596487
(`02-fused-softmax` passed)

Signed-off-by: Anatoly Myachev <[email protected]>
david-hls pushed a commit to david-hls/intel-xpu-backend-for-triton that referenced this pull request Jun 18, 2025
…ersion (intel#4383)

This is the first PR that replace the old distributed->distributed
layout conversion using linear layout.
We tried to match the original conversion mechanism as much as possible
for now, but will try to improve its memory usage, reduce bank
conflicts, and promote generalizability.

There are a list of TODOs after this PR:

1. Remove the old code
2. Implement conversion within warps
3. Implement DotOpLayout conversion
4. Avoid bank conflicts using swizzling instead of padding
5. Update comments/revisit barriers for reduce/atomic operations

---------

Co-authored-by: Justin Lebar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

02-fused-softmax tutorial fails on BMG

6 participants