-
Notifications
You must be signed in to change notification settings - Fork 75
Fix 02-fused-softmax tutorial on BMG
#4383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Anatoly Myachev <[email protected]>
| num_warps = min(max_num_warps, max(1, BLOCK_SIZE // (WARP_SIZE * 4))) | ||
|
|
||
| # Allocate output | ||
| y = torch.empty_like(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again need to reduce memory pressure for BMG. It should be fine for other GPUs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we don't use kineto here, it's not a fair comparison (judging by the time received). Need to return the allocation
python/tutorials/02-fused-softmax.py
Outdated
| kernels = {} | ||
| # Possible SLM allocation sizes in kB | ||
| tg_slm_sizes = [i * 2**i for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties | ||
| tg_slm_sizes = [2**i for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise it returns a number (from occupancy) that is greater than the 196000 kb SLM that are available for BMG, which breaks the program. I don't know why the heuristic was made this way.
@victor-eds is it ok to change it like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ref: #1495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this heuristic either, but your change totally alters the sizes returned by the allocated_slm_size and does not just restrict the SLM to a lower bound.
- previous
tg_slm_sizes=[0, 2, 8, 64, 2048, 1048576, 402653184, 137438953472, 13510798882111488, 1180591620717411303424, 7605903601369376408980219232256, 43556142965880123323311949751266331066368] - new
tg_slm_sizes=[1, 2, 4, 16, 256, 65536, 16777216, 4294967296, 281474976710656, 18446744073709551616, 79228162514264337593543950336, 340282366920938463463374607431768211456]
So, this change also modifies the behavior of this tutorial on PVC, but I don't know precisely what the impact is occupancy wise.
But generally speaking, I would say that in this test, many parameters are hard-coded, with a comment # TODO: Get from properties. As we start to target multiple architectures, it would probably make sense to replace these hardcoded parameters by target dependents values obtained from the properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much context on this tutorial, but what make sense to me is that this "heuristic" was wrong in the first place, and should have been:
tg_slm_sizes = [i * 2**10 for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from properties
to simply convert the possible SLM memory sizes from KiB to Bytes as SIZE_SMEM is a size in bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much context on this tutorial, but what make sense to me is that this "heuristic" was wrong in the first place, and should have been:
tg_slm_sizes = [i * 2**10 for i in [0, 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128]] # TODO: Get from propertiesto simply convert the possible SLM memory sizes from KiB to Bytes asSIZE_SMEMis a size in bytes.
Makes sense. Locally it works on BMG. Do you mind to leave it like this for now, if there is no significant slowdown for PVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can continue using hard-coded values for now. I've created Issue #4413 to follow up this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mfrancepillois!
Signed-off-by: Anatoly Myachev <[email protected]>
I don't know exactly why me disabled this test on A770. However, after this change #4383 , it should run more stably. * https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15451596487 (`02-fused-softmax` passed) Signed-off-by: Anatoly Myachev <[email protected]>
…ersion (intel#4383) This is the first PR that replace the old distributed->distributed layout conversion using linear layout. We tried to match the original conversion mechanism as much as possible for now, but will try to improve its memory usage, reduce bank conflicts, and promote generalizability. There are a list of TODOs after this PR: 1. Remove the old code 2. Implement conversion within warps 3. Implement DotOpLayout conversion 4. Avoid bank conflicts using swizzling instead of padding 5. Update comments/revisit barriers for reduce/atomic operations --------- Co-authored-by: Justin Lebar <[email protected]>
For the reference, I compare PVC perf between https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15424009604?pr=4383 and https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15422398403. Triton geomean:
656.5657vs662.7373. The difference is insignificant.