You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: programming_examples/basic/matrix_multiplication/matrix_vector/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ In this design, one or multiple AI Engine compute cores (spread across hardware
17
17
## Differences from the [Whole-Array Matrix-Matrix Multiplication Design](../whole_array/README.md)
18
18
19
19
- A specialized matrix-*vector* microkernel, named `matvec_vectorized` is used in this design, as opposed to the more general matrix-matrix microkernel (`matmul_vectorized`) used in the matrix-matrix-multiplication designs.
20
-
- The data movement in this design varies as follows: An identical `32`-element chunk of the vector `B` is **broadcast** to the cores in all columns, whereas _distinct_ subsequent `32`×`32`-sized tiles of the `A` matrix are **distributed** to the cores. As such, each core is responsible for a distinct `32`-element chunk of the output vector `C`. These chunks are assembled (**joined**) at the shim tile level (in the `sequence()` function).
20
+
- The data movement in this design varies as follows: An identical `32`-element chunk of the vector `B` is **broadcast** to the cores in all columns, whereas _distinct_ subsequent `32`×`32`-sized tiles of the `A` matrix are **distributed** to the cores. As such, each core is responsible for a distinct `32`-element chunk of the output vector `C`. These chunks are assembled (**joined**) at the shim tile level (in the `aiex.runtime_sequence()`).
21
21
- This design does not use all available compute cores. Instead, it uses at most one core in each hardware column. The variable `n_cores` defines the number of columns to be used. It would however be possible to extend this design to use all cores.
Copy file name to clipboardExpand all lines: programming_examples/basic/matrix_multiplication/whole_array/README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ At a high level, the code does the following (in order):
22
22
23
23
1.[**Defining Core Computations:**](#4-defining-core-computations) The `core_body()` function contains the code that will be loaded onto each AIE core. This code describes the matrix multiplication using the input submatrices `a` and `b` acquired through the ObjectFIFOs. The results are accumulated in the output submatrix `c`.
24
24
25
-
1.[**Defining External Data Transfer Sequences:**](#5-defining-external-data-transfer-sequences) The `sequence()`function sets up matrix data movement from the host into the AIE compute cores, and back to the host after computation. It initializes Data Movement Accelerator (DMA) transfers, sets memory access patterns, and performs synchronization.
25
+
1.[**Defining External Data Transfer Sequences:**](#5-defining-external-data-transfer-sequences) The `aie.runtime_sequence()`op sets up matrix data movement from the host into the AIE compute cores, and back to the host after computation. It initializes Data Movement Accelerator (DMA) transfers, sets memory access patterns, and performs synchronization.
26
26
27
27
1.**Generating the Design:** The `my_matmul()` function triggers the code generation process and represents the main entry point of the design. The final print statement outputs the MLIR representation of the AIE array configuration.
28
28
@@ -72,7 +72,7 @@ The input and output matrix sizes are given by the user. We subdivide the input
72
72
73
73
1.**Tiling to Compute Core Submatrix Chunks:** The input and output matrices stream to/from the AIE compute cores in chunks of size of `m`×`k`, `k`×`n` and `n`×`m`. Tiling into these chunks allows each of the computation cores to concurrently work on distinct sub-sections of the input matrices in parallel, which improves performance. This also reduces on-chip memory requirements. The final result is re-assembled using the sub-matrix results of all cores.
74
74
75
-
> This tiling occurs in the `sequence()`function describing the host-to-memory-tile transfer.
75
+
> This tiling occurs in the `aie.runtime_sequence()`operation describing the host-to-memory-tile transfer.
76
76
We describe it further below, in section *"5. Defining External Data Transfer Sequences"*.
77
77
78
78
1.**Tiling to Vector Intrinsic Size:** The AIE compute cores calculate the matrix multiplication using efficient "multiply-accumulate" vector intrinsic instructions (`MAC` instructions). These hardware instructions process very small blocks of the matrix: size `r`×`s` blocks of `A` and size `s`×`t` blocks of `B`, producing an output of size `r`×`t` (`C`).
@@ -198,7 +198,7 @@ We define a `core_body()` function for each compute core `i`, inside of which we
198
198
199
199
### 5. Defining External Data Transfer Sequences
200
200
201
-
The function signature of the `sequence()`function lists as its arguments all the external buffers from the host that we wish to read from or write to on the AI Engine's shim tiles. The body of this function describes how these buffers are transfered from and to the host, including tiling the input matrices into `m`×`k` and `k`×`n`-sized sub-matrices, and combining the `m`×`n`-sized output tiles into the larger output `M`×`N` matrix buffer.
201
+
The signature of the `aie.runtime_sequence()`operation lists as its arguments all the external buffers from the host that we wish to read from or write to on the AI Engine's shim tiles. The body of this function describes how these buffers are transfered from and to the host, including tiling the input matrices into `m`×`k` and `k`×`n`-sized sub-matrices, and combining the `m`×`n`-sized output tiles into the larger output `M`×`N` matrix buffer.
202
202
203
203
* The `tile_row_block` variable segments the M (rows of A) into smaller chunks, each containing `rows_per_block` tile rows. This is done so the buffer descriptors (BDs) can be reused for efficient DMA transfers.
0 commit comments