Skip to content

Conversation

@dhonnappa-amd
Copy link

Cherry-pick of #2553

cherry-pick of pytorch#160979
Less-performant fix until pytorch#161180
is finalized

* The global reduction path in reduction kernel currently has two
threadfence operation
* The first threadfence is executed by all threads in all the blocks,
whereas the second threadfence is only run by threads in a single block
* For AMD gpus, threadfence is a heavy weight operation, esp. when run
by all the threads in the system (due to cross-XCD synchronizations)
* So using fine-grain fence gives significant performance boost for AMD
gpus.
* We do a release fence when threads write to reduce buffer in global
memory; and then do a acquire fence when threads read from the reduce
buffer

Co-author: @amd-hhashemi, @jeffdaily 

**Reproducer**:
```import time
import torch

shapes = [(2, 896, 59, 91),
]

dims = [(2, 3),
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.bfloat16)
    x = x.to(memory_format=torch.channels_last)
    for _ in range(20):
        _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
    torch.cuda.synchronize()

    start_evt = torch.cuda.Event(enable_timing=True)
    end_evt = torch.cuda.Event(enable_timing=True)
    start_evt.record()
    for _ in range(100):
        _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
    end_evt.record()
    torch.cuda.synchronize()
    print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")
```

Fixes SWDEV-545710
@jerrymannil jerrymannil marked this pull request as ready for review August 22, 2025 19:14
@jerrymannil jerrymannil merged commit 93fa949 into rocm7.1_internal_testing Aug 22, 2025
1 check passed
@jerrymannil jerrymannil deleted the autogenerated/rocm7.1_internal_testing_cherry-pick_pr-2553 branch August 22, 2025 19:14
jerrymannil added a commit that referenced this pull request Sep 5, 2025
…e in reduction (#2563)

Cherry-pick of #2553

Co-authored-by: Jerry Mannil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants