doc: finished overview

RivinHD · RivinHD · commit 158ec0cba152 · 2025-07-29T16:28:06.000+02:00
diff --git a/docs_sphinx/chapters/code_generation.rst b/docs_sphinx/chapters/code_generation.rst
@@ -631,7 +631,7 @@ Unary Primitives
 
 Now we further extend our kernel with primitives.
 Primitives are operation which only operate one one input i.e. B:=op(A).
-We will take a look at the Zero, Identity and ReLu primitives and their transpose variants.
+We will take a look at the Zero, Identity and ReLU primitives and their transpose variants.
 
 Zero Primitive
 ^^^^^^^^^^^^^^
@@ -858,13 +858,13 @@ because we already load and store the transposed matrix.
 - **BM_unary_identity_transpose/M:512/N:512** kernel: :math:`4.409` GiB/s
 - **BM_unary_identity_transpose/M:2048/N:2048** kernel: :math:`3.817` GiB/s
 
-ReLu Primitive
+ReLU Primitive
 ^^^^^^^^^^^^^^
 
 1. generate
 """""""""""
 
-**Task**: Extend the implementation of the ``mini_jit::Unary::generate`` function to support the ReLu primitive.
+**Task**: Extend the implementation of the ``mini_jit::Unary::generate`` function to support the ReLU primitive.
 
 Files: ``unary_relu.cpp`` & ``unary_relu_transpose.cpp``
 
diff --git a/docs_sphinx/chapters/einsum_trees.rst b/docs_sphinx/chapters/einsum_trees.rst
@@ -547,8 +547,6 @@ On the three example we get the following performance:
     BM_einsum_tree_optimize_third_example/config:4/optimize:1/min_warmup_time:0.300_stddev      853382 ns       535716 ns           10 1.23652G/s
     BM_einsum_tree_optimize_third_example/config:4/optimize:1/min_warmup_time:0.300_cv            0.70 %          0.45 %            10      0.44%
 
-**First Example:** 142.7 GFLOPS
-
-**Second Example:** 276.9 GFLOPS
-
-**Third Example:** 277.8 GFLOPS
+- **First Example:** 142.7 GFLOPS
+- **Second Example:** 276.9 GFLOPS
+- **Third Example:** 277.8 GFLOPS
diff --git a/docs_sphinx/chapters/overview.rst b/docs_sphinx/chapters/overview.rst
@@ -52,3 +52,39 @@ by introducing loops over the *K*, *M*, and *N* dimensions over our microkernel.
 We then address edge cases where the *M* dimension of the matrix is not a multiple of 4, a prerequirement assumed up to this point.
 After that, we extend the microkernel to support batch-reduced matrix multiplication, a widely used operation in machine learning workloads.
 Finally, we explore how to transpose a matrix in memory using Neon instructions on a fixed-sized :math:`8 \times 8` matrix.
+
+Code Generation
+---------------
+
+In this chapter, we took a look on how to JIT (Just-In-Time) generate the kernel we written in the :doc:`neon` chapter.
+Furthermore, we dynamically adjust the generated machine code to perform arbitrary-sized matrix multiplications using fast kernels. 
+
+To begin, we wrap the necessary machine code instructions in an assembly-like manner using C++ functions to make kernel generation easier.
+We then implement kernel generation for Batch-Reduce General Matrix-Matrix Multiplication (BRGEMM) and Unary Operations for zero, identity and ReLU.
+Finally, we measure the performance of our generated kernels across different size configurations.
+
+Tensor Operation
+----------------
+
+This chapter introduces an additional layer of abstraction to code generation by describing higher-level tensor operations.
+We therefore examine how to generate the correct kernel based on a provided tensor configuration object, i.e. the abstraction.
+This object describes which operations on parameters, such as the size and type of dimensions, the execution type and the strides of the involved tensors, are required to generate and execute a kernel.
+Furthermore, we also perform optimization passes such as primitive and shared identification, dimension splitting, dimension fusion and dimension reordering.
+These optimizations help to boost the performance of the generated kernel for a given tensor operation.
+
+Einsum Tree
+-----------
+
+In this chapter, we introduce an additional layer of abstraction by defining a tree representation of multiple chained contractions on a set of two or more input tensors.
+We therefore process a string representation of nested tensor operations alongside a list of the dimension sizes of the tensors used.
+We then generate a tree representation from these input values, where each non-leaf node represents a single tensor operation. These operations are lowered to kernels, as described in the 'tensor_operations' chapter.
+Furthermore, we optimize this tree representation by performing optimization passes: Swap, Reorder and Permutation Insert on a node of the tree.
+
+Individual Phase
+----------------
+
+In the final chapter, we developed a plan on how to further develop the project.
+We created a draft to convert the project into a CMake library with a convenient tensor interface.
+We then provide a step-by-step description of how we converted our project into a CMake library.
+We also present our library interface, which defines a high-level tensor structure and operations such as unary, GEMM, contraction and Einsum expressions.
+Finally, to help users work with our library, we provide an example project that uses all the tensor operations, as well as extensive documentation with examples.
diff --git a/docs_sphinx/chapters/tensor_operations.rst b/docs_sphinx/chapters/tensor_operations.rst
@@ -407,6 +407,10 @@ Performance Benchmarking
     BM_tensor_Zero+BRGEMM/size_a:262144/size_b:262144/size_c:1048576/config:3/min_warmup_time:0.300_stddev            8350 ns         7959 ns           10   217.4M/s
     BM_tensor_Zero+BRGEMM/size_a:262144/size_b:262144/size_c:1048576/config:3/min_warmup_time:0.300_cv                0.19 %          0.18 %            10      0.18%
 
+.. raw:: html
+
+    <hr>
+
 
 - Last: Relu
 - A: 8388608, B: 8192, C: 8388608
@@ -421,6 +425,10 @@ Performance Benchmarking
     BM_tensor_Relu/size_a:8388608/size_b:8192/size_c:8388608/config:4/min_warmup_time:0.300_stddev                   11637 ns        11124 ns           10 65.7127M/s
     BM_tensor_Relu/size_a:8388608/size_b:8192/size_c:8388608/config:4/min_warmup_time:0.300_cv                        0.69 %          0.66 %            10      0.66%
 
+.. raw:: html
+
+    <hr>
+
 
 - Main: BRGEMM & Last: RELU
 - A: 262144, B: 262144, C: 1048576
@@ -436,6 +444,10 @@ Performance Benchmarking
     BM_tensor_BRGEMM+RELU/size_a:262144/size_b:262144/size_c:1048576/config:5/min_warmup_time:0.300_stddev            9309 ns         9001 ns           10 243.248M/s
     BM_tensor_BRGEMM+RELU/size_a:262144/size_b:262144/size_c:1048576/config:5/min_warmup_time:0.300_cv                0.21 %          0.20 %            10      0.20%
 
+.. raw:: html
+
+    <hr>
+
 
 - Main: BRGEMM & Last: RELU
 - A: 524288, B: 524288, C: 1048576
diff --git a/docs_sphinx/submissions/report_25_05_22.rst b/docs_sphinx/submissions/report_25_05_22.rst
@@ -252,10 +252,10 @@ File: ``unary_identity.cpp`` & File: ``unary_identity_transpose.cpp``
 - **BM_unary_identity_transpose/M:512/N:512** kernel: :math:`4.409` GiB/s
 - **BM_unary_identity_transpose/M:2048/N:2048** kernel: :math:`3.817` GiB/s
 
-ReLu Primitive
+ReLU Primitive
 ^^^^^^^^^^^^^^
 
-1. mini_jit::Unary::generate function to support the ReLu primitive
+1. mini_jit::Unary::generate function to support the ReLU primitive
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 File: ``unary_relu.cpp`` & File: ``unary_relu_transpose.cpp``