Commit d533e56
[ROCm] Limit number of values per thread for reductions on three dimensions (pytorch#159652)
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.
Pull Request resolved: pytorch#159652
Approved by: https://github.com/jeffdaily1 parent 3322e77 commit d533e56
1 file changed
+14
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
209 | 209 | | |
210 | 210 | | |
211 | 211 | | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
212 | 216 | | |
213 | 217 | | |
214 | 218 | | |
| |||
1166 | 1170 | | |
1167 | 1171 | | |
1168 | 1172 | | |
1169 | | - | |
| 1173 | + | |
1170 | 1174 | | |
| 1175 | + | |
| 1176 | + | |
| 1177 | + | |
| 1178 | + | |
| 1179 | + | |
| 1180 | + | |
| 1181 | + | |
| 1182 | + | |
| 1183 | + | |
1171 | 1184 | | |
1172 | 1185 | | |
1173 | 1186 | | |
| |||
0 commit comments