Commit bb4009a
[Inductor] Naive foreach autotune support (pytorch#162053)
Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code.
Before:
triton_for_fused_18.kd 🔍 | 4.986 ms | 4.986 ms | 2.493 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.098 ms | 0.098 ms | 0.049 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.036 ms | 0.036 ms | 0.018 ms | 2 |
After:
triton_for_fused_18.kd 🔍 | 1.273 ms | 1.273 ms | 0.636 ms | 2 |
triton_for_fused_6.kd 🔍 | 0.044 ms | 0.044 ms | 0.022 ms | 2 |
triton_for_fused_7.kd 🔍 | 0.024 ms | 0.024 ms | 0.012 ms | 2 |
num_warps=8 default due to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton_combo_kernel.py#L374
Pull Request resolved: pytorch#162053
Approved by: https://github.com/mlazos, https://github.com/naromero77amd, https://github.com/jeffdaily
Co-authored-by: Nichols A. Romero <[email protected]>1 parent 9e9e8fa commit bb4009a
File tree
2 files changed
+14
-3
lines changed- torch/_inductor
- codegen
- runtime
2 files changed
+14
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
627 | 627 | | |
628 | 628 | | |
629 | 629 | | |
630 | | - | |
| 630 | + | |
631 | 631 | | |
632 | 632 | | |
633 | 633 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3621 | 3621 | | |
3622 | 3622 | | |
3623 | 3623 | | |
3624 | | - | |
| 3624 | + | |
3625 | 3625 | | |
3626 | 3626 | | |
3627 | 3627 | | |
| 3628 | + | |
| 3629 | + | |
| 3630 | + | |
| 3631 | + | |
| 3632 | + | |
| 3633 | + | |
| 3634 | + | |
| 3635 | + | |
| 3636 | + | |
| 3637 | + | |
| 3638 | + | |
3628 | 3639 | | |
3629 | 3640 | | |
3630 | | - | |
| 3641 | + | |
3631 | 3642 | | |
3632 | 3643 | | |
3633 | 3644 | | |
| |||
0 commit comments