You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
This PR enables tensor descriptor pipelining in TLX to improve performance of TMA operations on Hopper and Blackwell GPUs. The implementation includes a new make_tensor_descriptor API with custom MLIR parsing and support for automatic scratch memory allocation. More specifically, the Tensor Descriptor Pipelining Infrastructure includes:
- Implemented pipelining support for tensor descriptors to enable efficient asynchronous data movement
- Added support for automatic scratch memory allocation for descriptor storage
- Updated TMA lowering pass to handle pipelined descriptor operations
- New `tlx.make_tensor_descriptor` API
Example usage
```
# For cases requiring manual memory management
desc_ptr = tlx.global_alloc(nbytes=128, alignment=128)
desc = tlx.make_tensor_descriptor(
desc_ptr=desc_ptr,
base=tensor_ptr,
shape=[M, N],
strides=[N, tl.constexpr(1)],
block_shape=[64, 64],
padding_option="zero", # Handle out-of-bounds accesses
)
# Use the descriptor with async load/store operations
buffer = tl.zeros([64, 64], dtype=tl.float16)
tlx.async_descriptor_load(desc, buffer, [row_offset, col_offset])
```
Changes are made to existing lit tests to maintain test compatibility with the new parser to avoid supporting legacy type parsing. Actually upstream already started adding new lit tests in that way.
Pull Request resolved: #706
Reviewed By: njriasan
Differential Revision: D88103067
Pulled By: htyu
fbshipit-source-id: 0ca3340add4bae9693b81929c612e91499bb9b84
0 commit comments