Skip to content

Conversation

@silee2
Copy link
Contributor

@silee2 silee2 commented Oct 22, 2025

create_nd_tdesc is currently lowered to a fixed size vector that encodes 2D shape and strides for base memory.
Supporting base memory rank > 2, requires a different approach.
Consumers of create_nd_tdesc op - load_nd, store_nd, prefetch_nd - now gets base memory information directly from create_nd_tdesc instead of going through the fixed payload.
As a result of this change, result type of create_nd_tdesc is lowered to single i64 value.

@charithaintc
Copy link
Contributor

charithaintc commented Oct 22, 2025

result type of create_nd_tdesc is lowered to single i64 value.

Any specific reason for this? why not increase the payload size to accommodate additional ranks information. I think having a payload is useful to retrieve info at the consumer. I would reconsider this unless we plan to get rid of createNd and move everything to consumer side.

@silee2
Copy link
Contributor Author

silee2 commented Oct 23, 2025

result type of create_nd_tdesc is lowered to single i64 value.

Any specific reason for this? why not increase the payload size to accommodate additional ranks information. I think having a payload is useful to retrieve info at the consumer. I would reconsider this unless we plan to get rid of createNd and move everything to consumer side.

Several reasons:

  • Payload cannot be just 2D fixed type. Offset calculation needs to be done as part of lowering consumer op and it needs source rank information and shape/strides. Amount of info is variable depending on source.
  • Type convertor cannot get source rank info when deciding lowered type. Lowered payload type has to be generic, independent of rank. Structure will be similar to that of unranked memref.
  • Producer - create_nd_tdesc - and consumers have direct def-use connection. All information regarding source memory for offset computation used for consumer op can be directly accessed through create_nd_tdesc op in the consumer op lowering pattern. Creating a payload just replicates this information and makes it available for access at runtime at XeVM level. It does not provide any additional benefit.
  • Using payload can makes compile time folding and other optimizations difficult to do. Test cases in this PR calls canonicalize and cse to show the benefit of not having a payload structure. If shape and strides are compile-time constants, entire compute chain for offset computation gets optimized to constants as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-c29451f2af1cb1d5540fe09d3b8056829598407289249e2165908cb3b075883c and even if shape and strides are not constant, if multiple tensor descriptors have same shape and strides, common compute can be shared as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-e16a5efc3f4f94de5091c890907f30ad57dbd3af2be49b0ce9d5e954acb79e13

@silee2
Copy link
Contributor Author

silee2 commented Oct 23, 2025

FYI, @dchigarev

@charithaintc
Copy link
Contributor

result type of create_nd_tdesc is lowered to single i64 value.

Any specific reason for this? why not increase the payload size to accommodate additional ranks information. I think having a payload is useful to retrieve info at the consumer. I would reconsider this unless we plan to get rid of createNd and move everything to consumer side.

Several reasons:

  • Payload cannot be just 2D fixed type. Offset calculation needs to be done as part of lowering consumer op and it needs source rank information and shape/strides. Amount of info is variable depending on source.
  • Type convertor cannot get source rank info when deciding lowered type. Lowered payload type has to be generic, independent of rank. Structure will be similar to that of unranked memref.
  • Producer - create_nd_tdesc - and consumers have direct def-use connection. All information regarding source memory for offset computation used for consumer op can be directly accessed through create_nd_tdesc op in the consumer op lowering pattern. Creating a payload just replicates this information and makes it available for access at runtime at XeVM level. It does not provide any additional benefit.
  • Using payload can makes compile time folding and other optimizations difficult to do. Test cases in this PR calls canonicalize and cse to show the benefit of not having a payload structure. If shape and strides are compile-time constants, entire compute chain for offset computation gets optimized to constants as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-c29451f2af1cb1d5540fe09d3b8056829598407289249e2165908cb3b075883c and even if shape and strides are not constant, if multiple tensor descriptors have same shape and strides, common compute can be shared as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-e16a5efc3f4f94de5091c890907f30ad57dbd3af2be49b0ce9d5e954acb79e13

I see.

In that case, can you clarify how it is handled when tensor_desc is a block arg or a func arg? Sorry I did not have time to a closer look at the changes yet.

I feel like after this change, create_nd does not really give us any value, rather it becomes a burden because we have to look it up during lowering. We can simply move all info to loadNd to simplify things. We should consider removing create_nd in that case.

For example, if passed as a block arg or func arg.
@silee2
Copy link
Contributor Author

silee2 commented Oct 24, 2025

can you clarify how it is handled when tensor_desc is a block arg or a func arg?

Short answer is, lowering pattern will not handle those cases.
I think func arg case will not happen in practice.
The ops are used as part of device kernel and tensor_desc is not used as a device kernel arg.
Only other case is using tensor_desc as a func arg inside device kernel, but xegpu currently does not support those cases.

Block arg case can be an issue. Block args are used for loops or control flow.
Loop issue can be solved by forcing create_nd_desc to be placed in the current block, but handling control flow cannot be done with a simple restriction.

Will update the PR to search for create_nd_tdesc through block args but not interprocedurally.

@silee2
Copy link
Contributor Author

silee2 commented Oct 24, 2025

Agree with @charithaintc that create_nd_tdesc is a burden.
MemRef-like information - base addr, offset, shape, strides - can be carried in two different ways. Static and dynamic.
For memref types, static information is carried as part of type and, as such, visible in print form. As they are carried as part of type, passes have access to it and use it for compile time optimization.
Information in dynamic case is carried through a runtime payload. For example, at LLVM dialect level, ranked memrefs and unranked memrefs are handled like https://mlir.llvm.org/docs/TargetLLVMIR/#default-calling-convention-for-ranked-memref
https://mlir.llvm.org/docs/TargetLLVMIR/#default-calling-convention-for-unranked-memref
Accessing info is done at runtime by extracting payload fields.

Issue with create_nd_tdesc is, it cut offs the static information channel completely.
Payload is the only solution that works for all cases including block arg and func arg.
Then, why not to just use payload? It prevents static optimizations and is bad for performance. Typical XeGPU usage is more often static rather than dynamic.

This PR tries to work around that issue by forwarding static values from operands/attributes of create_nd_tdesc directly to the consumer ops to compensate for the disconnected static information channel.

But as @charithaintc pointed out, forwarding becomes challenging in case of control flow and function calls. That is why runtime payload/structure is used as a general solution for memref type.

Deprecating create_nd_tdesc op and feed memref directy to consumer ops will solve the issue for memref based source memory but XeGPU would still need an op to create a memref from integer base address, offset, shape and strides that can works for both static and dynamic cases.

@Jianhui-Li
Copy link
Contributor

Jianhui-Li commented Oct 29, 2025

I think that it is not worthy the complexity to support 3d+ offsets at the core lowering. The requirement was only to create a 2D tensor tile out of 3D+ tensor and offsets, not walking the tensor tile with 3D+ offsets. So it can be addressed by asking user to do subview upfront to 2d tensor, and xegpu only allow 2d tensor tile on top of 2d flatten tensor unless future HW relaxes it.
See the discussion #162095 (comment)

@silee2
Copy link
Contributor Author

silee2 commented Oct 29, 2025

Closing PR as an alternative solution was suggested above.

@silee2 silee2 closed this Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants