Skip to content

Commit 966f533

Browse files
committed
Update docs
1 parent b1565df commit 966f533

File tree

139 files changed

+59
-993
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

139 files changed

+59
-993
lines changed

_sources/deeplearning_operators/gemv.md.txt

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
General Matrix-Vector Multiplication (GEMV)
1+
# General Matrix-Vector Multiplication (GEMV)
22
===========================================
33

44
<div style="text-align: left;">
@@ -16,7 +16,7 @@ Example code can be found at `examples/gemv/example_gemv.py`.
1616

1717
General matrix-vector multiplication (GEMV) can be viewed as a specialized case of general matrix-matrix multiplication (GEMM). It plays a critical role in deep learning, especially during the inference phase of large language models. In this tutorial, we will optimize GEMV from a thread-level perspective step by step using `TileLang`.
1818

19-
# Triton implementation
19+
## Triton Implementation
2020
When implementing a GEMV kernel, you might start with a high-level approach using a tool like `Triton`.
2121

2222
A simple Triton kernel for GEMV might look like this:
@@ -39,7 +39,7 @@ def _gemv_naive(
3939

4040
`Triton` is straightforward to use, as it operates at the block level. However, this approach may not allow for fine-grained thread-level optimization. In this tutorial, we will demonstrate how to write an optimized GEMV kernel in `TileLang` that exposes more low-level control.
4141

42-
# Naive Implementation in TileLang
42+
## Naive Implementation in TileLang
4343
If you have a basic understanding of CUDA C, it is natural to start with a naive GEMV kernel by adapting a GEMM tiling strategy. You can think of GEMV as a `(1, k) * (k, n)` GEMM. Below is a simple example:
4444

4545
```python
@@ -120,7 +120,7 @@ In this design, the first 128 threads act as the data producer and the last 128
120120

121121
At this level, we only gain very little computation power from our GPU with around **~0.17 ms** compared to torch/cuBLAS's **~0.008 ms**, which is around 20x slower.
122122

123-
# More concurrency
123+
## More Concurrency
124124

125125
To further increase the concurrency of our kernel, we can exploit finer thread-level parallelism. Instead of assigning each thread to compute a single output element in C, you can introduce parallelism along the K dimension. Each thread computes a partial accumulation, and you then combine these partial results. This approach requires primitives like `atomicAdd` in CUDA.
126126

@@ -163,7 +163,7 @@ def naive_splitk_gemv(
163163

164164
By introducing parallelism along K dimension, our kernel now achieves **~0.024 ms**, an improvement, but still not on par with torch/cuBLAS.
165165

166-
## Customizing Parallelism in K Dimension
166+
### Customizing Parallelism in K Dimension
167167
If your K dimension is large, you can further customize how many elements each thread processes by introducing a `reduce_threads` parameter. This way, each thread handles multiple elements per iteration:
168168

169169
```python
@@ -207,9 +207,9 @@ def splitk_gemv(
207207
```
208208

209209

210-
# Vectorized Reads
210+
## Vectorized Reads
211211

212-
GEMV is less computation intensive than GEMM as the computation intensity and memory throuput will be the optimization bottleneck. One effective strategy is to use vectorized load/store operations (e.g., `float2`, `float4`). In `TileLang`, you can specify vectorized operations via `T.vectorized`:
212+
GEMV is less computation intensive than GEMM as the computation intensity and memory throughput will be the optimization bottleneck. One effective strategy is to use vectorized load/store operations (e.g., `float2`, `float4`). In `TileLang`, you can specify vectorized operations via `T.vectorized`:
213213

214214
```python
215215
def splitk_gemv_vectorized(
@@ -255,7 +255,7 @@ def splitk_gemv_vectorized(
255255
With vectorized read, now the kernel finishs in **~0.0084 ms**, which is getting close to cuBLAS performance.
256256

257257

258-
# `tvm_thread_allreduce` Instead of `atomicAdd`
258+
## `tvm_thread_allreduce` Instead of `atomicAdd`
259259

260260
[`tvm_thread_allreduce`](https://tvm.apache.org/docs/reference/api/python/tir/tir.html#tvm.tir.tvm_thread_allreduce) has implemented optimization when making an all-reduce across a number of threads, which should outperfrom out plain smem + `atomidAdd`:
261261

@@ -315,7 +315,7 @@ def splitk_gemv_vectorized_tvm(
315315

316316
With this optimization, the kernel latency now reduces from **~0.0084 ms** to **~0.0069 ms**, which is faster than torch/cuBLAS!
317317

318-
# Autotune
318+
## Autotune
319319

320320
`BLOCK_N`, `BLOCK_K`, `reduce_threads` are hyperparameters in our kernel, which can be tuned to improve performance. We can use the `tilelang.autotune` feature to automatically search for optimal configurations:
321321

@@ -450,9 +450,9 @@ extern "C" __global__ void __launch_bounds__(64, 1) main_kernel(half_t* __restri
450450

451451
This corresponds closely to our `TileLang` program, with necessary synchronization and low-level optimizations inserted automatically.
452452

453-
# Conclusion
453+
## Conclusion
454454

455-
## Benchmark Table on Hopper GPU
455+
### Benchmark Table on Hopper GPU
456456

457457
| Kernel Name | Latency |
458458
|------------|------------|

api/modules.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

api/tilelang.autotuner.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

api/tilelang.cache.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

api/tilelang.cache.kernel_cache.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

api/tilelang.carver.analysis.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

api/tilelang.carver.arch.arch_base.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

api/tilelang.carver.arch.cdna.html

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -178,13 +178,6 @@
178178
<ul>
179179
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/elementwise.html">ElementWise Operators</a></li>
180180
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html">General Matrix-Vector Multiplication (GEMV)</a></li>
181-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#triton-implementation">Triton implementation</a></li>
182-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#naive-implementation-in-tilelang">Naive Implementation in TileLang</a></li>
183-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#more-concurrency">More concurrency</a></li>
184-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#vectorized-reads">Vectorized Reads</a></li>
185-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#tvm-thread-allreduce-instead-of-atomicadd"><code class="docutils literal notranslate"><span class="pre">tvm_thread_allreduce</span></code> Instead of <code class="docutils literal notranslate"><span class="pre">atomicAdd</span></code></a></li>
186-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#autotune">Autotune</a></li>
187-
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/gemv.html#conclusion">Conclusion</a></li>
188181
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul.html">General Matrix-Matrix Multiplication with Tile Library</a></li>
189182
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/matmul_dequant.html">General Matrix-Matrix Multiplication with Dequantization</a></li>
190183
<li class="toctree-l1"><a class="reference internal" href="../deeplearning_operators/flash_attention.html">Flash Attention</a></li>

0 commit comments

Comments
 (0)