tile-ai
diff --git a/‎_sources/deeplearning_operators/gemv.md.txt‎
Lines changed: 11 additions & 11 deletions b/‎_sources/deeplearning_operators/gemv.md.txt‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎api/modules.html‎
Lines changed: 0 additions & 7 deletions b/‎api/modules.html‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎api/tilelang.autotuner.html‎
Lines changed: 0 additions & 7 deletions b/‎api/tilelang.autotuner.html‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎api/tilelang.cache.html‎
Lines changed: 0 additions & 7 deletions b/‎api/tilelang.cache.html‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎api/tilelang.cache.kernel_cache.html‎
Lines changed: 0 additions & 7 deletions b/‎api/tilelang.cache.kernel_cache.html‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎api/tilelang.carver.analysis.html‎
Lines changed: 0 additions & 7 deletions b/‎api/tilelang.carver.analysis.html‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎api/tilelang.carver.arch.arch_base.html‎
Lines changed: 0 additions & 7 deletions b/‎api/tilelang.carver.arch.arch_base.html‎
Lines changed: 0 additions & 7 deletions
diff --git a/‎api/tilelang.carver.arch.cdna.html‎
Lines changed: 0 additions & 7 deletions b/‎api/tilelang.carver.arch.cdna.html‎
Lines changed: 0 additions & 7 deletions
@@ -1,4 +1,4 @@
-General Matrix-Vector Multiplication (GEMV)
+# General Matrix-Vector Multiplication (GEMV)
 ===========================================
 
 <div style="text-align: left;">
@@ -16,7 +16,7 @@ Example code can be found at `examples/gemv/example_gemv.py`.
 
 General matrix-vector multiplication (GEMV) can be viewed as a specialized case of general matrix-matrix multiplication (GEMM). It plays a critical role in deep learning, especially during the inference phase of large language models. In this tutorial, we will optimize GEMV from a thread-level perspective step by step using `TileLang`.
 
-# Triton implementation
+## Triton Implementation
 When implementing a GEMV kernel, you might start with a high-level approach using a tool like `Triton`.
 
 A simple Triton kernel for GEMV might look like this:
@@ -39,7 +39,7 @@ def _gemv_naive(
 
 `Triton` is straightforward to use, as it operates at the block level. However, this approach may not allow for fine-grained thread-level optimization. In this tutorial, we will demonstrate how to write an optimized GEMV kernel in `TileLang` that exposes more low-level control.
 
-# Naive Implementation in TileLang
+## Naive Implementation in TileLang
 If you have a basic understanding of CUDA C, it is natural to start with a naive GEMV kernel by adapting a GEMM tiling strategy. You can think of GEMV as a `(1, k) * (k, n)` GEMM. Below is a simple example:
 
 ```python
@@ -120,7 +120,7 @@ In this design, the first 128 threads act as the data producer and the last 128
 
 At this level, we only gain very little computation power from our GPU with around **~0.17 ms** compared to torch/cuBLAS's **~0.008 ms**, which is around 20x slower.
 
-# More concurrency
+## More Concurrency
 
 To further increase the concurrency of our kernel, we can exploit finer thread-level parallelism. Instead of assigning each thread to compute a single output element in C, you can introduce parallelism along the K dimension. Each thread computes a partial accumulation, and you then combine these partial results. This approach requires primitives like `atomicAdd` in CUDA.
 
@@ -163,7 +163,7 @@ def naive_splitk_gemv(
 
 By introducing parallelism along K dimension, our kernel now achieves **~0.024 ms**, an improvement, but still not on par with torch/cuBLAS.
 
-## Customizing Parallelism in K Dimension
+### Customizing Parallelism in K Dimension
 If your K dimension is large, you can further customize how many elements each thread processes by introducing a `reduce_threads` parameter. This way, each thread handles multiple elements per iteration:
 
 ```python
@@ -207,9 +207,9 @@ def splitk_gemv(
 ```
 
 
-# Vectorized Reads
+## Vectorized Reads
 
-GEMV is less computation intensive than GEMM as the computation intensity and memory throuput will be the optimization bottleneck. One effective strategy is to use vectorized load/store operations (e.g., `float2`, `float4`). In `TileLang`, you can specify vectorized operations via `T.vectorized`:
+GEMV is less computation intensive than GEMM as the computation intensity and memory throughput will be the optimization bottleneck. One effective strategy is to use vectorized load/store operations (e.g., `float2`, `float4`). In `TileLang`, you can specify vectorized operations via `T.vectorized`:
 
 ```python
 def splitk_gemv_vectorized(
@@ -255,7 +255,7 @@ def splitk_gemv_vectorized(
 With vectorized read, now the kernel finishs in **~0.0084 ms**, which is getting close to cuBLAS performance.
 
 
-# `tvm_thread_allreduce` Instead of `atomicAdd`
+## `tvm_thread_allreduce` Instead of `atomicAdd`
 
 [`tvm_thread_allreduce`](https://tvm.apache.org/docs/reference/api/python/tir/tir.html#tvm.tir.tvm_thread_allreduce) has implemented optimization when making an all-reduce across a number of threads, which should outperfrom out plain smem + `atomidAdd`:
 
@@ -315,7 +315,7 @@ def splitk_gemv_vectorized_tvm(
 
 With this optimization, the kernel latency now reduces from **~0.0084 ms** to **~0.0069 ms**, which is faster than torch/cuBLAS!
 
-# Autotune
+## Autotune
 
 `BLOCK_N`, `BLOCK_K`, `reduce_threads` are hyperparameters in our kernel, which can be tuned to improve performance. We can use the `tilelang.autotune` feature to automatically search for optimal configurations:
 
@@ -450,9 +450,9 @@ extern "C" __global__ void __launch_bounds__(64, 1) main_kernel(half_t* __restri
 
 This corresponds closely to our `TileLang` program, with necessary synchronization and low-level optimizations inserted automatically.
 
-# Conclusion
+## Conclusion
 
-## Benchmark Table on Hopper GPU
+### Benchmark Table on Hopper GPU
 
 | Kernel Name   | Latency   |
 |------------|------------|
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>
 
@@ -178,13 +178,6 @@
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
-<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>