Skip to content

Commit 158ec0c

Browse files
committed
doc: finished overview
1 parent fb1a40f commit 158ec0c

File tree

5 files changed

+56
-10
lines changed

5 files changed

+56
-10
lines changed

docs_sphinx/chapters/code_generation.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -631,7 +631,7 @@ Unary Primitives
631631

632632
Now we further extend our kernel with primitives.
633633
Primitives are operation which only operate one one input i.e. B:=op(A).
634-
We will take a look at the Zero, Identity and ReLu primitives and their transpose variants.
634+
We will take a look at the Zero, Identity and ReLU primitives and their transpose variants.
635635

636636
Zero Primitive
637637
^^^^^^^^^^^^^^
@@ -858,13 +858,13 @@ because we already load and store the transposed matrix.
858858
- **BM_unary_identity_transpose/M:512/N:512** kernel: :math:`4.409` GiB/s
859859
- **BM_unary_identity_transpose/M:2048/N:2048** kernel: :math:`3.817` GiB/s
860860

861-
ReLu Primitive
861+
ReLU Primitive
862862
^^^^^^^^^^^^^^
863863

864864
1. generate
865865
"""""""""""
866866

867-
**Task**: Extend the implementation of the ``mini_jit::Unary::generate`` function to support the ReLu primitive.
867+
**Task**: Extend the implementation of the ``mini_jit::Unary::generate`` function to support the ReLU primitive.
868868

869869
Files: ``unary_relu.cpp`` & ``unary_relu_transpose.cpp``
870870

docs_sphinx/chapters/einsum_trees.rst

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -547,8 +547,6 @@ On the three example we get the following performance:
547547
BM_einsum_tree_optimize_third_example/config:4/optimize:1/min_warmup_time:0.300_stddev 853382 ns 535716 ns 10 1.23652G/s
548548
BM_einsum_tree_optimize_third_example/config:4/optimize:1/min_warmup_time:0.300_cv 0.70 % 0.45 % 10 0.44%
549549
550-
**First Example:** 142.7 GFLOPS
551-
552-
**Second Example:** 276.9 GFLOPS
553-
554-
**Third Example:** 277.8 GFLOPS
550+
- **First Example:** 142.7 GFLOPS
551+
- **Second Example:** 276.9 GFLOPS
552+
- **Third Example:** 277.8 GFLOPS

docs_sphinx/chapters/overview.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,39 @@ by introducing loops over the *K*, *M*, and *N* dimensions over our microkernel.
5252
We then address edge cases where the *M* dimension of the matrix is not a multiple of 4, a prerequirement assumed up to this point.
5353
After that, we extend the microkernel to support batch-reduced matrix multiplication, a widely used operation in machine learning workloads.
5454
Finally, we explore how to transpose a matrix in memory using Neon instructions on a fixed-sized :math:`8 \times 8` matrix.
55+
56+
Code Generation
57+
---------------
58+
59+
In this chapter, we took a look on how to JIT (Just-In-Time) generate the kernel we written in the :doc:`neon` chapter.
60+
Furthermore, we dynamically adjust the generated machine code to perform arbitrary-sized matrix multiplications using fast kernels.
61+
62+
To begin, we wrap the necessary machine code instructions in an assembly-like manner using C++ functions to make kernel generation easier.
63+
We then implement kernel generation for Batch-Reduce General Matrix-Matrix Multiplication (BRGEMM) and Unary Operations for zero, identity and ReLU.
64+
Finally, we measure the performance of our generated kernels across different size configurations.
65+
66+
Tensor Operation
67+
----------------
68+
69+
This chapter introduces an additional layer of abstraction to code generation by describing higher-level tensor operations.
70+
We therefore examine how to generate the correct kernel based on a provided tensor configuration object, i.e. the abstraction.
71+
This object describes which operations on parameters, such as the size and type of dimensions, the execution type and the strides of the involved tensors, are required to generate and execute a kernel.
72+
Furthermore, we also perform optimization passes such as primitive and shared identification, dimension splitting, dimension fusion and dimension reordering.
73+
These optimizations help to boost the performance of the generated kernel for a given tensor operation.
74+
75+
Einsum Tree
76+
-----------
77+
78+
In this chapter, we introduce an additional layer of abstraction by defining a tree representation of multiple chained contractions on a set of two or more input tensors.
79+
We therefore process a string representation of nested tensor operations alongside a list of the dimension sizes of the tensors used.
80+
We then generate a tree representation from these input values, where each non-leaf node represents a single tensor operation. These operations are lowered to kernels, as described in the 'tensor_operations' chapter.
81+
Furthermore, we optimize this tree representation by performing optimization passes: Swap, Reorder and Permutation Insert on a node of the tree.
82+
83+
Individual Phase
84+
----------------
85+
86+
In the final chapter, we developed a plan on how to further develop the project.
87+
We created a draft to convert the project into a CMake library with a convenient tensor interface.
88+
We then provide a step-by-step description of how we converted our project into a CMake library.
89+
We also present our library interface, which defines a high-level tensor structure and operations such as unary, GEMM, contraction and Einsum expressions.
90+
Finally, to help users work with our library, we provide an example project that uses all the tensor operations, as well as extensive documentation with examples.

docs_sphinx/chapters/tensor_operations.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,6 +407,10 @@ Performance Benchmarking
407407
BM_tensor_Zero+BRGEMM/size_a:262144/size_b:262144/size_c:1048576/config:3/min_warmup_time:0.300_stddev 8350 ns 7959 ns 10 217.4M/s
408408
BM_tensor_Zero+BRGEMM/size_a:262144/size_b:262144/size_c:1048576/config:3/min_warmup_time:0.300_cv 0.19 % 0.18 % 10 0.18%
409409
410+
.. raw:: html
411+
412+
<hr>
413+
410414

411415
- Last: Relu
412416
- A: 8388608, B: 8192, C: 8388608
@@ -421,6 +425,10 @@ Performance Benchmarking
421425
BM_tensor_Relu/size_a:8388608/size_b:8192/size_c:8388608/config:4/min_warmup_time:0.300_stddev 11637 ns 11124 ns 10 65.7127M/s
422426
BM_tensor_Relu/size_a:8388608/size_b:8192/size_c:8388608/config:4/min_warmup_time:0.300_cv 0.69 % 0.66 % 10 0.66%
423427
428+
.. raw:: html
429+
430+
<hr>
431+
424432

425433
- Main: BRGEMM & Last: RELU
426434
- A: 262144, B: 262144, C: 1048576
@@ -436,6 +444,10 @@ Performance Benchmarking
436444
BM_tensor_BRGEMM+RELU/size_a:262144/size_b:262144/size_c:1048576/config:5/min_warmup_time:0.300_stddev 9309 ns 9001 ns 10 243.248M/s
437445
BM_tensor_BRGEMM+RELU/size_a:262144/size_b:262144/size_c:1048576/config:5/min_warmup_time:0.300_cv 0.21 % 0.20 % 10 0.20%
438446
447+
.. raw:: html
448+
449+
<hr>
450+
439451

440452
- Main: BRGEMM & Last: RELU
441453
- A: 524288, B: 524288, C: 1048576

docs_sphinx/submissions/report_25_05_22.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -252,10 +252,10 @@ File: ``unary_identity.cpp`` & File: ``unary_identity_transpose.cpp``
252252
- **BM_unary_identity_transpose/M:512/N:512** kernel: :math:`4.409` GiB/s
253253
- **BM_unary_identity_transpose/M:2048/N:2048** kernel: :math:`3.817` GiB/s
254254

255-
ReLu Primitive
255+
ReLU Primitive
256256
^^^^^^^^^^^^^^
257257

258-
1. mini_jit::Unary::generate function to support the ReLu primitive
258+
1. mini_jit::Unary::generate function to support the ReLU primitive
259259
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
260260

261261
File: ``unary_relu.cpp`` & File: ``unary_relu_transpose.cpp``

0 commit comments

Comments
 (0)