-
Notifications
You must be signed in to change notification settings - Fork 33
Fix float3 alignment bug on Metal for gradient accumulation #713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9428780 to
a902963
Compare
|
This is a neat and cleanly implemented work around, but it comes with a performance cost both to CPU and GPU, and doesn't help users who want to write atomic accumulators for structs. @csyonghe @bmillsNV the correct fix here seems to me to be an 'alignedSizeOf' in slang (or simply to make metal return the correct size - depends on what you argue is correct / useful). Is it practical to get that in? I'd rather we do this the right way than with a work around. If we can agree that is the correct fix, I'm happy to merge tensor work in with that test disabled on Mac, provided we fix it before launching the tensor refactor a few weeks from now. The implementation of sizeOf does feel a little quirky to me here in general. Coming from other language, I would expect as standard practice to be assume that, for some array, &X[N] == &X[0] + N * sizeof(elementtype(X)). I appreciate that's questioning the definition of size, but in terms of what's useful in practical terms, the stride based definition is the more useful one in my opinion. |
|
@szihs so I think the correct fix here in the long run is https://discord.com/channels/1303735196696445038/1461282795262316605/1461287168625344671 However I'm happy for this workaround to happen for now, provided it is metal platform only. There's no point paying the cost of extra calculations on other platforms where its not necessary. If we use the above code, the atomicAdd functions should be tweaked so they simply call atomicAddWithStride, passing in sizeof(T) as the stride. |
ccummingsNV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid work around in principle, but we should make it Metal only and make a few tweaks as documented.
AtomicTensor used sizeof(T) to compute byte offsets for RWByteAddressBuffer operations, but sizeof(float3) returns 12 while Metal buffers use 16-byte stride for float3. Changes: - atomics.slang: Add stride-aware atomic methods (atomicAddWithStride, etc.) Non-stride versions now call strided versions with sizeof(T) as default - tensor.slang: Add _element_byte_stride field to AtomicTensor Runtime field on Metal, static const sizeof(T) on other platforms - slangpytensor.cpp/h: Extract and write _element_byte_stride for NativeTensor Only written when the field exists (Metal backend with AtomicTensor) - torchtensormarshall.py: Add _element_byte_stride to calldata on Metal - test_differential_function_call.py: Remove workarounds, use proper tensor reads Fixes #118
39a2834 to
02c6b0a
Compare
ccummingsNV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with the work, though it probably needs a bit of merging with the latest torch integration, as a lot of torch tensor marshall code is now native
The issue was that AtomicTensor used sizeof(T) to compute byte offsets for RWByteAddressBuffer operations, but sizeof(float3) returns 12 while Metal buffers use 16-byte stride for float3.
Changes:
Fixes #118