diff --git a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td index ab5fb4a4a7de9..f1bed70253ef3 100644 --- a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td +++ b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td @@ -183,53 +183,54 @@ def XeGPU_LayoutAttr : XeGPUAttr<"Layout", "layout"> { 1-dimensional layout. The first dimension in the order list is the fastest-changing dimension. If it is not present, the default value is [1, 0]. - ### Examples: - 1. Subgroup level layout: - ```mlir - #xegpu.layout - ``` - In this example, there are 16 work-items per subgroup, and is organized as - [[0, 1, 2, .., 7],[8, 9, .., 15]]. The distribution unit is 1x1. - - 2. Subgroup level layout with order: - ```mlir - #xegpu.layout - ``` - In this example, there are 16 work-items per subgroup, and is organized as - [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. The distribution unit is 1x1. - - 3. Subgroup level layout with inst_data - ```mlir - #xegpu.layout - ``` - In this example, the original problem size is partitioned into smaller subproblems of dimensions [8, 16], - which are then distributed among 16 work-items arranged as [[0, 1, 2, ..., 7], [8, 9, ..., 15]]. Each - work-item is assigned four 2x2 blocks in a round-robin manner. - - 4. Workgroup level layout: - ```mlir - #xegpu.layout - ``` - In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups - arranged as [[0, 1, 2, 3], [4, 5, 6, 7]]. Each subgroup accesses a 16x16 block per instruction, which - is further distributed to 16 work items which is organized as [[0, 1, 2, .., 7],[8, 9, .., 15]]. - - 5. Workgroup level layout with order: - ```mlir - #xegpu.layout - ``` - In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups - arranged as [[0, 2, 4, 6], [1, 3, 5, 7]]. Each subgroup accesses a 16x16 block per instruction, which - is further distributed to 16 work items which is organized as [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. - - 6. Workgroup level layout with inst_data: - ```mlir - #xegpu.layout - ``` - This example is similar to the previous ones, but the `inst_data` parameter divides `sg_data` into two instructions, - each processing an 8x16 block. These blocks are further distributed across 16 work-items with a distribution unit of 1x1. - Unlike the 2x2 distribution unit in example 3, which results in accessing contiguous 2x2 blocks, the 1x1 distribution - unit may result in non-contiguous access. + Examples: + + 1. Subgroup level layout: + ```mlir + #xegpu.layout + ``` + In this example, there are 16 work-items per subgroup, and is organized as + [[0, 1, 2, .., 7],[8, 9, .., 15]]. The distribution unit is 1x1. + + 2. Subgroup level layout with order: + ```mlir + #xegpu.layout + ``` + In this example, there are 16 work-items per subgroup, and is organized as + [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. The distribution unit is 1x1. + + 3. Subgroup level layout with inst_data + ```mlir + #xegpu.layout + ``` + In this example, the original problem size is partitioned into smaller subproblems of dimensions [8, 16], + which are then distributed among 16 work-items arranged as [[0, 1, 2, ..., 7], [8, 9, ..., 15]]. Each + work-item is assigned four 2x2 blocks in a round-robin manner. + + 4. Workgroup level layout: + ```mlir + #xegpu.layout + ``` + In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups + arranged as [[0, 1, 2, 3], [4, 5, 6, 7]]. Each subgroup accesses a 16x16 block per instruction, which + is further distributed to 16 work items which is organized as [[0, 1, 2, .., 7],[8, 9, .., 15]]. + + 5. Workgroup level layout with order: + ```mlir + #xegpu.layout + ``` + In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups + arranged as [[0, 2, 4, 6], [1, 3, 5, 7]]. Each subgroup accesses a 16x16 block per instruction, which + is further distributed to 16 work items which is organized as [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. + + 6. Workgroup level layout with inst_data: + ```mlir + #xegpu.layout + ``` + This example is similar to the previous ones, but the `inst_data` parameter divides `sg_data` into two instructions, + each processing an 8x16 block. These blocks are further distributed across 16 work-items with a distribution unit of 1x1. + Unlike the 2x2 distribution unit in example 3, which results in accessing contiguous 2x2 blocks, the 1x1 distribution + unit may result in non-contiguous access. }]; let parameters = (ins diff --git a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUDialect.td b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUDialect.td index 765f218f95d26..fb5a1e6f1db0c 100644 --- a/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUDialect.td +++ b/mlir/include/mlir/Dialect/XeGPU/IR/XeGPUDialect.td @@ -16,11 +16,20 @@ def XeGPU_Dialect : Dialect { let cppNamespace = "::mlir::xegpu"; let summary = "The XeGPU dialect that models Intel GPU's ISA"; let description = [{ - The XeGPU dialect models Intel Xe ISA semantics but works at vector and - TensorDesc data type. It provides 1:1 mappings to match Xe instructions - like DPAS and 2D block load. The matrix size being processed at this level - exactly matches the hardware instructions or the intrinsic supported by - the lower-level GPU compiler. + The XeGPU dialect closely models a subset of the Xe GPU's ISA, providing an + abstraction to support high-performance GEMM code generation. It serves as a + bridge dialect in the MLIR gradual lowering process, working with MLIR memref + and vector types, and complements the Arith, Math, Vector, and Memref dialects. + XeGPU operations are introduced for special Xe instructions not modeled by the + LLVM/SPIR-V dialect, such as DPAS and 2D block load and store. + + It supports a tile-based programming model, decomposing the GEMM kernel into + large predefined tile sizes at the subgroup and workgroup levels. XeGPU allows + the high-level GEMM algorithm to be easily expressed. Underneath, it uses + target-specific recipes and hardware features to achieve optimal performance + on specific hardware. By decomposing GEMM at submatrix granularity and mapping it + to registers, it naturally supports optimizations like fusing with neighboring + operations. }]; let dependentDialects = ["arith::ArithDialect"];