Port "sub-group transpose reduction" to default path

#2109 explores layout conversion in the advanced path to improve reduction performance (see https://github.com/intel/intel-xpu-backend-for-triton/issues/1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):
1. Reshape input tensor so no data movement is needed and we can perform reduction of elements within the work-item `tt.reshape`
2. Perform reduction within the work-item `tt.reduce`
3. Convert layout so a transposition within the sub-group as explained in the investigation is performed `triton_gpu.convert_layout`
4. Finalize reduction (within work-item and possibly within the work-group) `tt.reduce`
5. Convert back to initial layout `triton_gpu.convert_layout`

Note 5 can be dropped in case the new layout is beneficial for performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port "sub-group transpose reduction" to default path #2266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Port "sub-group transpose reduction" to default path #2266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions