chore: fix typo improved benchmark results

RivinHD · RivinHD · commit 9d68e3a01f39 · 2025-05-01T09:53:59.000+02:00
diff --git a/docs_sphinx/submissions/report_25_05_01.rst b/docs_sphinx/submissions/report_25_05_01.rst
@@ -183,10 +183,10 @@ These 3 different ``fmla`` blocks gets repeated with ``.rept 2`` to achieve the
 
 **Benchmarks**
 
-We run the benchmark with the following command: 
+We run the benchmark with the following command:
 
-.. code-block:: 
-  
+.. code-block::
+ 
   ./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
 
 Therefore we do 10 repetitions of the benchmark which do about ``120 000 000`` iterations each on our matmul kernels.
@@ -197,17 +197,17 @@ Therefore we do 10 repetitions of the benchmark which do about ``120 000 000`` i
   ----------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                             Time             CPU   Iterations      FLOPS
   ----------------------------------------------------------------------------------------------------------------------------------
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean               5.89 ns         5.87 ns           10 32.7048G/s
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median             5.89 ns         5.87 ns           10 32.7228G/s
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev            0.046 ns        0.044 ns           10 244.331M/s
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv                 0.77 %          0.75 %            10      0.75%
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean             5.74 ns         5.72 ns           10 33.5453G/s
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median           5.73 ns         5.71 ns           10 33.6103G/s
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev          0.051 ns        0.050 ns           10 291.918M/s
-  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv               0.90 %          0.88 %            10      0.87%
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean               5.84 ns         5.82 ns           10 33.0036G/s
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median             5.83 ns         5.81 ns           10 33.0317G/s
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev            0.025 ns        0.025 ns           10 143.339M/s
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv                 0.43 %          0.44 %            10      0.43%
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean             5.71 ns         5.69 ns           10 33.7234G/s
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median           5.70 ns         5.68 ns           10 33.7732G/s
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev          0.038 ns        0.038 ns           10 224.892M/s
+  Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv               0.67 %          0.67 %            10      0.67
 
-We see that the simple first implementation of our matmul kernel gets about **32.7 GFLOPS**.
-The optimized unrolled version gets about 0.8 GFLOPS more resulting in **33.5 GFLOPS**.
+We see that the simple first implementation of our matmul kernel gets about **33.0 GFLOPS**.
+The optimized unrolled version gets about 0.7 GFLOPS more resulting in **33.7 GFLOPS**.
 
 
 Loops
@@ -395,7 +395,7 @@ Loops
 
 **Optimization**
 
-Usage of already optmiized `matmul_16_6_1` from task 2.
+Usage of already optimized `matmul_16_6_1` from task 2.
 
 **Benchmarks**
 
@@ -412,20 +412,20 @@ We run the benchmark with the following command:
   ----------------------------------------------------------------------------------------------------------------------------------
   Benchmark                                                                             Time             CPU   Iterations      FLOPS
   ----------------------------------------------------------------------------------------------------------------------------------
-  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_mean            396 ns          396 ns           10 31.0266G/s
-  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_median          396 ns          396 ns           10 31.0281G/s
-  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_stddev        0.069 ns        0.057 ns           10 4.50274M/s
-  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_cv             0.02 %          0.01 %            10      0.01%
-  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_mean           1728 ns         1728 ns           10 28.4438G/s
-  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_median         1728 ns         1728 ns           10 28.4445G/s
-  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_stddev        0.115 ns        0.106 ns           10  1.7484M/s
-  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_cv             0.01 %          0.01 %            10      0.01%
-  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_mean        13078 ns        13077 ns           10 22.5524G/s
-  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_median      13078 ns        13077 ns           10  22.552G/s
-  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_stddev       1.83 ns         1.60 ns           10 2.76464M/s
-  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_cv           0.01 %          0.01 %            10      0.01%
-
-
-- Mean FLOPS for loop over K: **31.0 GFLOPS**.
-- Mean FLOPS for loop over M: **28.4 GFLOPS**.
-- Mean FLOPS for loop over N: **22.6 GFLOPS**.
+  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_mean            368 ns          367 ns           10 33.4632G/s
+  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_median          368 ns          367 ns           10 33.5034G/s
+  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_stddev         1.78 ns         1.75 ns           10 158.697M/s
+  GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_cv             0.48 %          0.48 %            10      0.47%
+  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_mean           1526 ns         1520 ns           10 32.3285G/s
+  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_median         1526 ns         1520 ns           10 32.3321G/s
+  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_stddev         10.2 ns         9.97 ns           10 211.542M/s
+  GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_cv             0.67 %          0.66 %            10      0.65%
+  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_mean        12177 ns        12135 ns           10 24.3028G/s
+  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_median      12167 ns        12126 ns           10 24.3211G/s
+  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_stddev       54.9 ns         54.1 ns           10 107.995M/s
+  GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_cv           0.45 %          0.45 %            10      0.44%
+
+
+- Mean FLOPS for loop over K: **33.5 GFLOPS**.
+- Mean FLOPS for loop over M: **32.3 GFLOPS**.
+- Mean FLOPS for loop over N: **24.3 GFLOPS**.