XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Jianhui-Li · 2025-07-08T06:43:21Z

Please review these guidelines to help with the review process:

Have you provided a meaningful PR description?
Have you added a test, a reproducer, or a reference to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
If this PR is a work in progress, are you filing the PR as a draft?
Have you organized your commits logically and ensured each can be built by itself?

Garra1980 · 2025-07-08T13:59:02Z

docs/rfcs/XeGPU.md


+## XeGPU operations to access share local memory
+Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout). 
+User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.


Since we're talking about WG-level here I think #1033 should me merged before this one

docs/rfcs/XeGPU.md

chencha3

LGTM

chencha3 · 2025-08-14T14:45:08Z

docs/rfcs/XeGPU.md

+
+Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.
+
+When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.


What is the purpose of making memref operand optional?

removed. was there to ensure consistent with the earlier definition.

docs/rfcs/XeGPU.md

Garra1980 · 2025-09-29T14:10:24Z

cc @dchigarev

Garra1980 · 2025-09-29T14:44:03Z

docs/rfcs/XeGPU.md

+#dpas_wg = {sg_layout = [8, 4],  sg_data= [32, 32], order=[1, 0] }
+
+%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
+%a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>


Garra1980 · 2025-09-30T14:49:12Z

docs/rfcs/XeGPU.md

+
+The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.
+
+It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.


Is it correct that "transpose+convert_layout" pattern should be generated by higher dialects?

Add matrix_desc and operations

1b6dc9c

Jianhui-Li changed the title ~~Add matrix_desc and operations~~ XeGPU RFC update: Add matrix_desc and operations for share local memory Jul 8, 2025

Jianhui-Li changed the title ~~XeGPU RFC update: Add matrix_desc and operations for share local memory~~ XeGPU RFC update: Add matrix_desc and operations for share local memory access Jul 8, 2025

Garra1980 reviewed Jul 8, 2025

View reviewed changes

Jianhui-Li added 10 commits July 8, 2025 09:04

Update XeGPU.md

39198a3

Update XeGPU.md

c1ab298

Update XeGPU.md

1bbcdaa

Update XeGPU.md

8906c08

Update XeGPU.md

3c0c8ad

Update XeGPU.md

bfc834b

Update XeGPU.md

be46fad

Update XeGPU.md

c75986c

Update XeGPU.md

d5b0684

Update XeGPU.md

3657d36

akroviakov reviewed Jul 11, 2025

View reviewed changes

docs/rfcs/XeGPU.md Outdated Show resolved Hide resolved

akroviakov reviewed Jul 11, 2025

View reviewed changes

docs/rfcs/XeGPU.md Outdated Show resolved Hide resolved

Jianhui-Li added 5 commits July 11, 2025 17:43

Update XeGPU.md

3b79e51

Update XeGPU.md

5dd778e

Update XeGPU.md

63bed6a

Update XeGPU.md

da43f5d

Update XeGPU.md

dbaea60

Jianhui-Li mentioned this pull request Aug 14, 2025

[mlir][xegpu] Add definitions of MemDescType and related ops. llvm/llvm-project#153273

Merged

chencha3 approved these changes Aug 14, 2025

View reviewed changes

Jianhui-Li and others added 3 commits August 15, 2025 07:19

Update XeGPU.md

be04d1d

Update XeGPU.md

53690ae

Nit fixes

d0cc268

Garra1980 reviewed Aug 15, 2025

View reviewed changes

docs/rfcs/XeGPU.md Show resolved Hide resolved

Fix pre-commit

e418745

Jianhui-Li changed the title ~~XeGPU RFC update: Add matrix_desc and operations for share local memory access~~ XeGPU RFC update: Add mem_desc and operations for share local memory access Aug 15, 2025

add lane level attributes

bf87ab2

Jianhui-Li added 4 commits September 25, 2025 18:47

add subgroup to lane distribution example

40ae990

update the lowering to xevm

978ce9d

minor fix

4867a23

add reduction example

450b6b8

Garra1980 reviewed Sep 29, 2025

View reviewed changes

Garra1980 reviewed Sep 30, 2025

View reviewed changes

Jianhui-Li added 2 commits October 14, 2025 16:23

Update XeGPU.md

d622bba

remove vec_len and vec_dir attribute

f45f67b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Uh oh!

Jianhui-Li commented Jul 8, 2025

Uh oh!

Garra1980 Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

chencha3 left a comment

Uh oh!

chencha3 Aug 14, 2025

Uh oh!

Jianhui-Li Aug 15, 2025

Uh oh!

Uh oh!

Garra1980 commented Sep 29, 2025

Uh oh!

Garra1980 Sep 29, 2025

Uh oh!

Garra1980 Sep 30, 2025

Uh oh!

Jianhui-Li Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.

		When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.


		The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.

		It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.

XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Are you sure you want to change the base?

XeGPU RFC update: Add mem_desc and operations for share local memory access #1092

Uh oh!

Conversation

Jianhui-Li commented Jul 8, 2025

Uh oh!

Garra1980 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chencha3 left a comment

Choose a reason for hiding this comment

Uh oh!

chencha3 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Garra1980 commented Sep 29, 2025

Uh oh!

Garra1980 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Garra1980 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Jianhui-Li Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants