Fix float3 alignment bug on Metal for gradient accumulation #713

szihs · 2026-01-14T06:26:05Z

The issue was that AtomicTensor used sizeof(T) to compute byte offsets for RWByteAddressBuffer operations, but sizeof(float3) returns 12 while Metal buffers use 16-byte stride for float3.

Changes:

Add stride-aware atomic methods (atomicAddWithStride, etc.) in atomics.slang
Add _element_byte_stride field to AtomicTensor in tensor.slang
Pass element_stride from C++ when binding tensors in slangpytensor.cpp
Update tests to use proper tensor reads now that alignment is fixed

Fixes #118

ccummingsNV · 2026-01-14T08:46:13Z

This is a neat and cleanly implemented work around, but it comes with a performance cost both to CPU and GPU, and doesn't help users who want to write atomic accumulators for structs.

@csyonghe @bmillsNV the correct fix here seems to me to be an 'alignedSizeOf' in slang (or simply to make metal return the correct size - depends on what you argue is correct / useful). Is it practical to get that in? I'd rather we do this the right way than with a work around.

If we can agree that is the correct fix, I'm happy to merge tensor work in with that test disabled on Mac, provided we fix it before launching the tensor refactor a few weeks from now.

The implementation of sizeOf does feel a little quirky to me here in general. Coming from other language, I would expect as standard practice to be assume that, for some array, &X[N] == &X[0] + N * sizeof(elementtype(X)). I appreciate that's questioning the definition of size, but in terms of what's useful in practical terms, the stride based definition is the more useful one in my opinion.

ccummingsNV · 2026-01-15T09:23:50Z

@szihs so I think the correct fix here in the long run is https://discord.com/channels/1303735196696445038/1461282795262316605/1461287168625344671

However I'm happy for this workaround to happen for now, provided it is metal platform only. There's no point paying the cost of extra calculations on other platforms where its not necessary.

If we use the above code, the atomicAdd functions should be tweaked so they simply call atomicAddWithStride, passing in sizeof(T) as the stride.

ccummingsNV

Valid work around in principle, but we should make it Metal only and make a few tweaks as documented.

slangpy/slang/atomics.slang

slangpy/slang/tensor.slang

src/slangpy_ext/utils/slangpytensor.cpp

AtomicTensor used sizeof(T) to compute byte offsets for RWByteAddressBuffer operations, but sizeof(float3) returns 12 while Metal buffers use 16-byte stride for float3. Changes: - atomics.slang: Add stride-aware atomic methods (atomicAddWithStride, etc.) Non-stride versions now call strided versions with sizeof(T) as default - tensor.slang: Add _element_byte_stride field to AtomicTensor Runtime field on Metal, static const sizeof(T) on other platforms - slangpytensor.cpp/h: Extract and write _element_byte_stride for NativeTensor Only written when the field exists (Metal backend with AtomicTensor) - torchtensormarshall.py: Add _element_byte_stride to calldata on Metal - test_differential_function_call.py: Remove workarounds, use proper tensor reads Fixes #118

updated. can you take a look

ccummingsNV

Happy with the work, though it probably needs a bit of merging with the latest torch integration, as a lot of torch tensor marshall code is now native

szihs marked this pull request as ready for review January 14, 2026 06:32

szihs requested a review from a team as a code owner January 14, 2026 06:32

szihs force-pushed the dev/haaggarwal/fix-float3-alignment-metal branch from 9428780 to a902963 Compare January 14, 2026 07:14

ccummingsNV requested changes Jan 15, 2026

View reviewed changes

slangpy/slang/atomics.slang Show resolved Hide resolved

slangpy/slang/tensor.slang Show resolved Hide resolved

src/slangpy_ext/utils/slangpytensor.cpp Outdated Show resolved Hide resolved

Base automatically changed from dev/ccummings/tensor-refactor to main January 16, 2026 10:32

ccummingsNV previously requested changes Jan 16, 2026

View reviewed changes

src/slangpy_ext/utils/slangpytensor.cpp Outdated Show resolved Hide resolved

szihs force-pushed the dev/haaggarwal/fix-float3-alignment-metal branch from 39a2834 to 02c6b0a Compare January 29, 2026 12:53

szihs requested a review from ccummingsNV January 29, 2026 13:11

ccummingsNV requested changes Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix float3 alignment bug on Metal for gradient accumulation #713

Fix float3 alignment bug on Metal for gradient accumulation #713

szihs commented Jan 14, 2026

Uh oh!

ccummingsNV commented Jan 14, 2026 •

edited

Loading

Uh oh!

ccummingsNV commented Jan 15, 2026 •

edited

Loading

Uh oh!

ccummingsNV left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ccummingsNV left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix float3 alignment bug on Metal for gradient accumulation #713

Are you sure you want to change the base?

Fix float3 alignment bug on Metal for gradient accumulation #713

Conversation

szihs commented Jan 14, 2026

Uh oh!

ccummingsNV commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccummingsNV commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ccummingsNV commented Jan 14, 2026 •

edited

Loading

ccummingsNV commented Jan 15, 2026 •

edited

Loading