[MLIR][Conversion] XeGPU to XeVM: create_nd_tdesc - Add support for base memory rank > 2 #164701

silee2 · 2025-10-22T20:05:45Z

create_nd_tdesc is currently lowered to a fixed size vector that encodes 2D shape and strides for base memory.
Supporting base memory rank > 2, requires a different approach.
Consumers of create_nd_tdesc op - load_nd, store_nd, prefetch_nd - now gets base memory information directly from create_nd_tdesc instead of going through the fixed payload.
As a result of this change, result type of create_nd_tdesc is lowered to single i64 value.

… for base memory rank > 2

charithaintc · 2025-10-22T21:37:29Z

result type of create_nd_tdesc is lowered to single i64 value.

Any specific reason for this? why not increase the payload size to accommodate additional ranks information. I think having a payload is useful to retrieve info at the consumer. I would reconsider this unless we plan to get rid of createNd and move everything to consumer side.

silee2 · 2025-10-23T17:03:42Z

result type of create_nd_tdesc is lowered to single i64 value.

Any specific reason for this? why not increase the payload size to accommodate additional ranks information. I think having a payload is useful to retrieve info at the consumer. I would reconsider this unless we plan to get rid of createNd and move everything to consumer side.

Several reasons:

Payload cannot be just 2D fixed type. Offset calculation needs to be done as part of lowering consumer op and it needs source rank information and shape/strides. Amount of info is variable depending on source.
Type convertor cannot get source rank info when deciding lowered type. Lowered payload type has to be generic, independent of rank. Structure will be similar to that of unranked memref.
Producer - create_nd_tdesc - and consumers have direct def-use connection. All information regarding source memory for offset computation used for consumer op can be directly accessed through create_nd_tdesc op in the consumer op lowering pattern. Creating a payload just replicates this information and makes it available for access at runtime at XeVM level. It does not provide any additional benefit.
Using payload can makes compile time folding and other optimizations difficult to do. Test cases in this PR calls canonicalize and cse to show the benefit of not having a payload structure. If shape and strides are compile-time constants, entire compute chain for offset computation gets optimized to constants as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-c29451f2af1cb1d5540fe09d3b8056829598407289249e2165908cb3b075883c and even if shape and strides are not constant, if multiple tensor descriptors have same shape and strides, common compute can be shared as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-e16a5efc3f4f94de5091c890907f30ad57dbd3af2be49b0ce9d5e954acb79e13

silee2 · 2025-10-23T17:07:35Z

FYI, @dchigarev

charithaintc · 2025-10-23T17:27:28Z

result type of create_nd_tdesc is lowered to single i64 value.

Any specific reason for this? why not increase the payload size to accommodate additional ranks information. I think having a payload is useful to retrieve info at the consumer. I would reconsider this unless we plan to get rid of createNd and move everything to consumer side.

Several reasons:

Payload cannot be just 2D fixed type. Offset calculation needs to be done as part of lowering consumer op and it needs source rank information and shape/strides. Amount of info is variable depending on source.

Type convertor cannot get source rank info when deciding lowered type. Lowered payload type has to be generic, independent of rank. Structure will be similar to that of unranked memref.

Producer - create_nd_tdesc - and consumers have direct def-use connection. All information regarding source memory for offset computation used for consumer op can be directly accessed through create_nd_tdesc op in the consumer op lowering pattern. Creating a payload just replicates this information and makes it available for access at runtime at XeVM level. It does not provide any additional benefit.

Using payload can makes compile time folding and other optimizations difficult to do. Test cases in this PR calls canonicalize and cse to show the benefit of not having a payload structure. If shape and strides are compile-time constants, entire compute chain for offset computation gets optimized to constants as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-c29451f2af1cb1d5540fe09d3b8056829598407289249e2165908cb3b075883c and even if shape and strides are not constant, if multiple tensor descriptors have same shape and strides, common compute can be shared as shown in https://github.com/llvm/llvm-project/pull/164701/files#diff-e16a5efc3f4f94de5091c890907f30ad57dbd3af2be49b0ce9d5e954acb79e13

I see.

In that case, can you clarify how it is handled when tensor_desc is a block arg or a func arg? Sorry I did not have time to a closer look at the changes yet.

I feel like after this change, create_nd does not really give us any value, rather it becomes a burden because we have to look it up during lowering. We can simply move all info to loadNd to simplify things. We should consider removing create_nd in that case.

For example, if passed as a block arg or func arg.

silee2 · 2025-10-24T17:02:56Z

can you clarify how it is handled when tensor_desc is a block arg or a func arg?

Short answer is, lowering pattern will not handle those cases.
I think func arg case will not happen in practice.
The ops are used as part of device kernel and tensor_desc is not used as a device kernel arg.
Only other case is using tensor_desc as a func arg inside device kernel, but xegpu currently does not support those cases.

Block arg case can be an issue. Block args are used for loops or control flow.
Loop issue can be solved by forcing create_nd_desc to be placed in the current block, but handling control flow cannot be done with a simple restriction.

Will update the PR to search for create_nd_tdesc through block args but not interprocedurally.

silee2 · 2025-10-24T17:41:56Z

Agree with @charithaintc that create_nd_tdesc is a burden.
MemRef-like information - base addr, offset, shape, strides - can be carried in two different ways. Static and dynamic.
For memref types, static information is carried as part of type and, as such, visible in print form. As they are carried as part of type, passes have access to it and use it for compile time optimization.
Information in dynamic case is carried through a runtime payload. For example, at LLVM dialect level, ranked memrefs and unranked memrefs are handled like https://mlir.llvm.org/docs/TargetLLVMIR/#default-calling-convention-for-ranked-memref
https://mlir.llvm.org/docs/TargetLLVMIR/#default-calling-convention-for-unranked-memref
Accessing info is done at runtime by extracting payload fields.

Issue with create_nd_tdesc is, it cut offs the static information channel completely.
Payload is the only solution that works for all cases including block arg and func arg.
Then, why not to just use payload? It prevents static optimizations and is bad for performance. Typical XeGPU usage is more often static rather than dynamic.

This PR tries to work around that issue by forwarding static values from operands/attributes of create_nd_tdesc directly to the consumer ops to compensate for the disconnected static information channel.

But as @charithaintc pointed out, forwarding becomes challenging in case of control flow and function calls. That is why runtime payload/structure is used as a general solution for memref type.

Deprecating create_nd_tdesc op and feed memref directy to consumer ops will solve the issue for memref based source memory but XeGPU would still need an op to create a memref from integer base address, offset, shape and strides that can works for both static and dynamic cases.

Jianhui-Li · 2025-10-29T03:38:09Z

I think that it is not worthy the complexity to support 3d+ offsets at the core lowering. The requirement was only to create a 2D tensor tile out of 3D+ tensor and offsets, not walking the tensor tile with 3D+ offsets. So it can be addressed by asking user to do subview upfront to 2d tensor, and xegpu only allow 2d tensor tile on top of 2d flatten tensor unless future HW relaxes it.
See the discussion #162095 (comment)

silee2 · 2025-10-29T18:55:09Z

Closing PR as an alternative solution was suggested above.

silee2 added 10 commits October 21, 2025 22:14

[MLIR][Conversion] XeGPU to XeVM: Create nd tensor descriptor payload…

4a92953

… for base memory rank > 2

Fix bugs and add test case for high rank base memref.

e510643

Replace 2D block load payload with i64.

4e4cbd0

Update test check.

2546a37

Update test check.

05c889a

Update test check.

9a2ea5f

Update test check.

a79bd40

Fix dynamic stride compute issue and add test case.

e053be1

Add more high rank base memory test cases.

88ab9aa

Merge remote-tracking branch 'origin/main' into xegpuNdBaseMemory

5bb126b

silee2 requested review from Jianhui-Li and charithaintc October 22, 2025 20:06

Tensor descriptor may not be directly fed from create_nd_tdesc op.

de7c964

For example, if passed as a block arg or func arg.

Jianhui-Li mentioned this pull request Oct 29, 2025

[MLIR][XeGPU][VectorToXeGPU] Lower vector.load/store/transfer_read/transfer_write to new offsets syntax #162095

Merged

silee2 closed this Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLIR][Conversion] XeGPU to XeVM: create_nd_tdesc - Add support for base memory rank > 2 #164701

[MLIR][Conversion] XeGPU to XeVM: create_nd_tdesc - Add support for base memory rank > 2 #164701

Uh oh!

silee2 commented Oct 22, 2025 •

edited

Loading

Uh oh!

charithaintc commented Oct 22, 2025 •

edited

Loading

Uh oh!

silee2 commented Oct 23, 2025

Uh oh!

silee2 commented Oct 23, 2025

Uh oh!

charithaintc commented Oct 23, 2025

Uh oh!

silee2 commented Oct 24, 2025

Uh oh!

silee2 commented Oct 24, 2025 •

edited

Loading

Uh oh!

Jianhui-Li commented Oct 29, 2025 •

edited

Loading

Uh oh!

silee2 commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[MLIR][Conversion] XeGPU to XeVM: create_nd_tdesc - Add support for base memory rank > 2 #164701

[MLIR][Conversion] XeGPU to XeVM: create_nd_tdesc - Add support for base memory rank > 2 #164701

Uh oh!

Conversation

silee2 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charithaintc commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silee2 commented Oct 23, 2025

Uh oh!

silee2 commented Oct 23, 2025

Uh oh!

charithaintc commented Oct 23, 2025

Uh oh!

silee2 commented Oct 24, 2025

Uh oh!

silee2 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jianhui-Li commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silee2 commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

silee2 commented Oct 22, 2025 •

edited

Loading

charithaintc commented Oct 22, 2025 •

edited

Loading

silee2 commented Oct 24, 2025 •

edited

Loading

Jianhui-Li commented Oct 29, 2025 •

edited

Loading