Skip to content

Conversation

Jianhui-Li
Copy link
Contributor

Please review these guidelines to help with the review process:

  • Have you provided a meaningful PR description?
  • Have you added a test, a reproducer, or a reference to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • If this PR is a work in progress, are you filing the PR as a draft?
  • Have you organized your commits logically and ensured each can be built by itself?

@Jianhui-Li Jianhui-Li changed the title Add matrix_desc and operations XeGPU RFC update: Add matrix_desc and operations for share local memory Jul 8, 2025
@Jianhui-Li Jianhui-Li changed the title XeGPU RFC update: Add matrix_desc and operations for share local memory XeGPU RFC update: Add matrix_desc and operations for share local memory access Jul 8, 2025

## XeGPU operations to access share local memory
Users must create `matrix_desc` to hold a matrix in the share local memory. The matrix must be row-major. The matrix can attach a attribute for its memory layout, for example, a blocked layout or just original non-blocked row-major layout (aka. linear layout).
User can get a subview of an existing `matrix_desc` to get a new `matrix_desc`, potentially having a stride. Then user can use load_matrix and store_matrix to move the matrix data between share local memory and vectors (registers). The matrix is typically 2d and but can be multi-dimension. XeGPU's load_matrix and store_matrix works at workgroup level only. It uses xegpu.layout to describe how the matrix is decomposed to data fragments and maps to work items. The workgroup level operation loads the entire matrix to vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're talking about WG-level here I think #1033 should me merged before this one

Copy link
Contributor

@chencha3 chencha3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


Users create a `matrix_desc` to represent a matrix stored in shared local memory (SLM). The operation takes a memory buffer (1D int8 memref with empty layout) and create a structured representation of the share local memory. The result matrix_desc has proper information including shape, element type, and memory layout attributes (@block and @strides). The @block attribute indicates that the matrix follows a blocked layout, enabling optimized lowering to 1D block loads. The @strides attribute specifies the logical strides of each dimension and is typically used to support chunked loads.

When there is no input memref operand, it allocates SLM for the matrix, assuming a row-major contiguous layout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of making memref operand optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. was there to ensure consistent with the earlier definition.

@Jianhui-Li Jianhui-Li changed the title XeGPU RFC update: Add matrix_desc and operations for share local memory access XeGPU RFC update: Add mem_desc and operations for share local memory access Aug 15, 2025
@Garra1980
Copy link
Contributor

cc @dchigarev

#dpas_wg = {sg_layout = [8, 4], sg_data= [32, 32], order=[1, 0] }
%at = xegpu.load_nd %tdesc: tensor_desc<32x256xf16, #Coop_t_wg> -> vector<32x256xf16>
%a = vector.transpose %1 {layout_result_0 = #Coop_wg}: vector<32x256xf16> to vector<256x32xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%1->%at?


The code is transformed to use store_matrix and load_matrix to implement the transpose cooperatively in shared local memory. Note that both load_nd and store_matrix use smaller sg_data values, meaning each subgroup processes a smaller fragment, enabling a cooperative transpose across threads.

It is generally preferable to detect and fuse the “transpose + convert_layout” pattern at the workgroup level early in the compilation pipeline. Early fusion directly influences the blocking strategy for `load_matrix` and `store_matrix`, which are the lowered forms of logical layout conversion and transpose. If this fusion is not performed at the workgroup level, later fusion passes may only fuse transpose with load at the subgroup level, potentially missing the most optimized code sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that "transpose+convert_layout" pattern should be generated by higher dialects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants