Skip to content

Port "sub-group transpose reduction" to default path #2266

@victor-eds

Description

@victor-eds

#2109 explores layout conversion in the advanced path to improve reduction performance (see #1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):

  1. Reshape input tensor so no data movement is needed and we can perform reduction of elements within the work-item tt.reshape
  2. Perform reduction within the work-item tt.reduce
  3. Convert layout so a transposition within the sub-group as explained in the investigation is performed triton_gpu.convert_layout
  4. Finalize reduction (within work-item and possibly within the work-group) tt.reduce
  5. Convert back to initial layout triton_gpu.convert_layout

Note 5 can be dropped in case the new layout is beneficial for performance.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions