Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs

We would like to remove the `RewriteTensorPointer` pass which rewrites block pointers into regular pointers (except when it determines load/store operations on block ptrs can be converted to 2D block reads/writes). The idea is to avoid loosing semantic information too early and instead deal with block ptr that cannot be used to generate 2D block reads/stores while lowering that operation). 

For this scheme to work, we first need to improve the lowering code for tt.load and tt.store operations that use a block ptr with an element type that is not (currently) supported by the 2D read instructions available on the target GPU (e.g. the element is FP8).  

See https://github.com/intel/intel-xpu-backend-for-triton/issues/2359#issuecomment-2378283042 for more context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of tt.load and tt.store for FP8 when converting block ptr to regular ptrs #2374

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374