Skip to content

Conversation

@szihs
Copy link
Collaborator

@szihs szihs commented Jan 14, 2026

The issue was that AtomicTensor used sizeof(T) to compute byte offsets for RWByteAddressBuffer operations, but sizeof(float3) returns 12 while Metal buffers use 16-byte stride for float3.

Changes:

  • Add stride-aware atomic methods (atomicAddWithStride, etc.) in atomics.slang
  • Add _element_byte_stride field to AtomicTensor in tensor.slang
  • Pass element_stride from C++ when binding tensors in slangpytensor.cpp
  • Update tests to use proper tensor reads now that alignment is fixed

Fixes #118

@szihs szihs marked this pull request as ready for review January 14, 2026 06:32
@szihs szihs requested a review from a team as a code owner January 14, 2026 06:32
@szihs szihs force-pushed the dev/haaggarwal/fix-float3-alignment-metal branch from 9428780 to a902963 Compare January 14, 2026 07:14
@ccummingsNV
Copy link
Contributor

ccummingsNV commented Jan 14, 2026

This is a neat and cleanly implemented work around, but it comes with a performance cost both to CPU and GPU, and doesn't help users who want to write atomic accumulators for structs.

@csyonghe @bmillsNV the correct fix here seems to me to be an 'alignedSizeOf' in slang (or simply to make metal return the correct size - depends on what you argue is correct / useful). Is it practical to get that in? I'd rather we do this the right way than with a work around.

If we can agree that is the correct fix, I'm happy to merge tensor work in with that test disabled on Mac, provided we fix it before launching the tensor refactor a few weeks from now.

The implementation of sizeOf does feel a little quirky to me here in general. Coming from other language, I would expect as standard practice to be assume that, for some array, &X[N] == &X[0] + N * sizeof(elementtype(X)). I appreciate that's questioning the definition of size, but in terms of what's useful in practical terms, the stride based definition is the more useful one in my opinion.

@ccummingsNV
Copy link
Contributor

ccummingsNV commented Jan 15, 2026

@szihs so I think the correct fix here in the long run is https://discord.com/channels/1303735196696445038/1461282795262316605/1461287168625344671

However I'm happy for this workaround to happen for now, provided it is metal platform only. There's no point paying the cost of extra calculations on other platforms where its not necessary.

If we use the above code, the atomicAdd functions should be tweaked so they simply call atomicAddWithStride, passing in sizeof(T) as the stride.

Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid work around in principle, but we should make it Metal only and make a few tweaks as documented.

Base automatically changed from dev/ccummings/tensor-refactor to main January 16, 2026 10:32
AtomicTensor used sizeof(T) to compute byte offsets for RWByteAddressBuffer
operations, but sizeof(float3) returns 12 while Metal buffers use 16-byte
stride for float3.

Changes:
- atomics.slang: Add stride-aware atomic methods (atomicAddWithStride, etc.)
  Non-stride versions now call strided versions with sizeof(T) as default
- tensor.slang: Add _element_byte_stride field to AtomicTensor
  Runtime field on Metal, static const sizeof(T) on other platforms
- slangpytensor.cpp/h: Extract and write _element_byte_stride for NativeTensor
  Only written when the field exists (Metal backend with AtomicTensor)
- torchtensormarshall.py: Add _element_byte_stride to calldata on Metal
- test_differential_function_call.py: Remove workarounds, use proper tensor reads

Fixes #118
@szihs szihs force-pushed the dev/haaggarwal/fix-float3-alignment-metal branch from 39a2834 to 02c6b0a Compare January 29, 2026 12:53
@szihs szihs requested a review from ccummingsNV January 29, 2026 13:11
@szihs szihs dismissed ccummingsNV’s stale review January 30, 2026 12:45

updated. can you take a look

Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with the work, though it probably needs a bit of merging with the latest torch integration, as a lot of torch tensor marshall code is now native

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants