Commit 6e3be13
[ROCm] Use opportunistic fastatomics based on hueristics (pytorch#159430)
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address
Co-author: @amd-hhashemi
Reproducer:
```
import time
import torch
x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)
for _ in range(20):
x.index_add_(0, ind, src)
start_time = time.time()
for i in range(100):
x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```
Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us
After:
Avg time for index_add_: 2675.15 us
```
Pull Request resolved: pytorch#159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily1 parent d533e56 commit 6e3be13
1 file changed
+8
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
282 | 282 | | |
283 | 283 | | |
284 | 284 | | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
285 | 293 | | |
286 | 294 | | |
287 | 295 | | |
| |||
0 commit comments