Commit 69d0dd8
[ROCm] Use fine-grain fence in reduction (#2553)
cherry-pick of pytorch#160979
Less-performant fix until pytorch#161180
is finalized
* The global reduction path in reduction kernel currently has two
threadfence operation
* The first threadfence is executed by all threads in all the blocks,
whereas the second threadfence is only run by threads in a single block
* For AMD gpus, threadfence is a heavy weight operation, esp. when run
by all the threads in the system (due to cross-XCD synchronizations)
* So using fine-grain fence gives significant performance boost for AMD
gpus.
* We do a release fence when threads write to reduce buffer in global
memory; and then do a acquire fence when threads read from the reduce
buffer
Co-author: @amd-hhashemi, @jeffdaily
**Reproducer**:
```import time
import torch
shapes = [(2, 896, 59, 91),
]
dims = [(2, 3),
]
for i, shape in enumerate(shapes):
x = torch.randn(shape, device='cuda', dtype=torch.bfloat16)
x = x.to(memory_format=torch.channels_last)
for _ in range(20):
_ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
torch.cuda.synchronize()
start_evt = torch.cuda.Event(enable_timing=True)
end_evt = torch.cuda.Event(enable_timing=True)
start_evt.record()
for _ in range(100):
_ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
end_evt.record()
torch.cuda.synchronize()
print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")
```
Fixes SWDEV-5457101 parent fd4b1e7 commit 69d0dd8
1 file changed
+9
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
797 | 797 | | |
798 | 798 | | |
799 | 799 | | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
800 | 803 | | |
801 | 804 | | |
| 805 | + | |
802 | 806 | | |
| 807 | + | |
803 | 808 | | |
804 | 809 | | |
805 | 810 | | |
806 | 811 | | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
807 | 815 | | |
| 816 | + | |
808 | 817 | | |
809 | 818 | | |
810 | 819 | | |
| |||
0 commit comments