Skip to content

Commit ee78046

Browse files
authored
[XPU][OptRed] Revamp -tritonintelgpu-optimize-reduction-locality (#2800)
Original implementation had two critical issues: - *Functional*: It did not preserve register order, so it was computing a different reduction. - *Performance*: When converting back to the original tensor type, it did: `reshape(convert_layout(res))`. That means the `reshape` operation served as an anchor and the suboptimal slice layout was propagated. This was fixed as follows: - Keep register order. - Do `convert_layout(reshape(res))` when converting back to the original type, thus propagating the more optimal layout. See implementation for further details. Signed-off-by: victor-eds <[email protected]>
1 parent da569f1 commit ee78046

File tree

3 files changed

+444
-438
lines changed

3 files changed

+444
-438
lines changed

0 commit comments

Comments
 (0)