You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs_sphinx/chapters/overview.rst
+36Lines changed: 36 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,3 +52,39 @@ by introducing loops over the *K*, *M*, and *N* dimensions over our microkernel.
52
52
We then address edge cases where the *M* dimension of the matrix is not a multiple of 4, a prerequirement assumed up to this point.
53
53
After that, we extend the microkernel to support batch-reduced matrix multiplication, a widely used operation in machine learning workloads.
54
54
Finally, we explore how to transpose a matrix in memory using Neon instructions on a fixed-sized :math:`8\times8` matrix.
55
+
56
+
Code Generation
57
+
---------------
58
+
59
+
In this chapter, we took a look on how to JIT (Just-In-Time) generate the kernel we written in the :doc:`neon` chapter.
60
+
Furthermore, we dynamically adjust the generated machine code to perform arbitrary-sized matrix multiplications using fast kernels.
61
+
62
+
To begin, we wrap the necessary machine code instructions in an assembly-like manner using C++ functions to make kernel generation easier.
63
+
We then implement kernel generation for Batch-Reduce General Matrix-Matrix Multiplication (BRGEMM) and Unary Operations for zero, identity and ReLU.
64
+
Finally, we measure the performance of our generated kernels across different size configurations.
65
+
66
+
Tensor Operation
67
+
----------------
68
+
69
+
This chapter introduces an additional layer of abstraction to code generation by describing higher-level tensor operations.
70
+
We therefore examine how to generate the correct kernel based on a provided tensor configuration object, i.e. the abstraction.
71
+
This object describes which operations on parameters, such as the size and type of dimensions, the execution type and the strides of the involved tensors, are required to generate and execute a kernel.
72
+
Furthermore, we also perform optimization passes such as primitive and shared identification, dimension splitting, dimension fusion and dimension reordering.
73
+
These optimizations help to boost the performance of the generated kernel for a given tensor operation.
74
+
75
+
Einsum Tree
76
+
-----------
77
+
78
+
In this chapter, we introduce an additional layer of abstraction by defining a tree representation of multiple chained contractions on a set of two or more input tensors.
79
+
We therefore process a string representation of nested tensor operations alongside a list of the dimension sizes of the tensors used.
80
+
We then generate a tree representation from these input values, where each non-leaf node represents a single tensor operation. These operations are lowered to kernels, as described in the 'tensor_operations' chapter.
81
+
Furthermore, we optimize this tree representation by performing optimization passes: Swap, Reorder and Permutation Insert on a node of the tree.
82
+
83
+
Individual Phase
84
+
----------------
85
+
86
+
In the final chapter, we developed a plan on how to further develop the project.
87
+
We created a draft to convert the project into a CMake library with a convenient tensor interface.
88
+
We then provide a step-by-step description of how we converted our project into a CMake library.
89
+
We also present our library interface, which defines a high-level tensor structure and operations such as unary, GEMM, contraction and Einsum expressions.
90
+
Finally, to help users work with our library, we provide an example project that uses all the tensor operations, as well as extensive documentation with examples.
0 commit comments