-
Notifications
You must be signed in to change notification settings - Fork 76
Closed
Description
#2109 explores layout conversion in the advanced path to improve reduction performance (see #1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):
- Reshape input tensor so no data movement is needed and we can perform reduction of elements within the work-item
tt.reshape - Perform reduction within the work-item
tt.reduce - Convert layout so a transposition within the sub-group as explained in the investigation is performed
triton_gpu.convert_layout - Finalize reduction (within work-item and possibly within the work-group)
tt.reduce - Convert back to initial layout
triton_gpu.convert_layout
Note 5 can be dropped in case the new layout is beneficial for performance.