Skip to content

Commit 603044c

Browse files
author
GitHub Actions
committed
Update docs
1 parent 22cd03d commit 603044c

File tree

3 files changed

+652
-6
lines changed

3 files changed

+652
-6
lines changed
Lines changed: 332 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,332 @@
1-
ElementWise Operators
2-
=====================
1+
# ElementWise Operators
2+
3+
<div style="text-align: left;">
4+
<em>Author:</em> <a href="https://github.com/chenghuaWang">Chenghua Wang</a>
5+
</div>
6+
7+
:::{warning}
8+
:class: myclass1 myclass2
9+
:name: a-tip-reference
10+
11+
This document is still **experimental** and may be incomplete.
12+
Suggestions and improvements are highly encouraged—please submit a PR!
13+
:::
14+
15+
Elementwise operators are widely used in deep learning and often serve as the first example encountered by those beginning to explore parallel programming. This tutorial will analyze several implementations of the elementwise addition operator using TileLang and compare them with the corresponding CUDA implementation. By the end of this tutorial, you will learn:
16+
17+
1. How to implement an elementwise operator using TileLang.
18+
2. How to compile operators with dynamic shapes.
19+
3. How TileLang addresses boundary-related issues.
20+
4. The similarities and differences between operators implemented in TileLang and those implemented in CUDA/CuTe.
21+
22+
Please note that this tutorial does not delve deeply into the design principles of TileLang. For a broader understanding of TileLang, we recommend consulting the [Overview](../get_started/overview.md).
23+
24+
## Elementwise add in TileLang
25+
26+
```python
27+
def elementwise_add(N, threads=256, dtype="bfloat16"):
28+
29+
@T.prim_func
30+
def main(A: T.Buffer((N), dtype), B: T.Buffer((N), dtype), C: T.Buffer((N), dtype)):
31+
with T.Kernel(T.ceildiv(N, threads), threads=threads) as (b_x):
32+
# vector add.
33+
for i in T.Parallel(threads):
34+
C[b_x * threads + i] = A[b_x * threads + i] + B[b_x * threads + i]
35+
36+
return main
37+
```
38+
39+
All logic for TileLang kernels must be implemented within the `T.Kernel(...)` scope. In this example, initializing `T.kernel(...)` requires specifying both the grid size and the number of threads per block. The returned value `bx` corresponds to `blockIdx.x` in CUDA. In the provided implementation, `T.Parallel` is used to process the data tile (of size `1 x threads`) assigned to the block for computation.
40+
41+
Those familiar with CUDA programming might wonder where `threadIdx` fits into this. Note that the code inside `T.Kernel` operates at the **block level**, not the **thread level**. In this example, your focus is solely on defining the block-level logic. During compilation, TileLang automatically maps computations to the corresponding threads and applies further optimizations. The optimized code generated by TileLang may closely align with carefully handcrafted computational logic, as demonstrated in Section 2 with a concrete example. While TileLang also supports thread-level programming semantics, this will be covered in subsequent discussions.
42+
43+
The program can be compiled using the following code:
44+
45+
```python
46+
program = elementwise_add(1024, threads=256, dtype="bfloat16")
47+
kernel = tilelang.compile(program, out_idx=-1, target="cuda", execution_backend="cython")
48+
```
49+
Launching the kernel is straightforward, just call it directly like a function:
50+
51+
```python
52+
C = kernel(A, B)
53+
```
54+
55+
The vector add operation can also be extended to two-dimensional cases, where both implementations demonstrate comparable efficiency in practice. Below is an example from the test section that readers can refer to: [example](https://github.com/tile-ai/tilelang/blob/main/testing/python/kernel/test_tilelang_kernel_element_wise_add.py). The code for this kernel is provided below:
56+
57+
```python
58+
import tilelang.language as T
59+
def elementwise_add(
60+
M,
61+
N,
62+
block_M,
63+
block_N,
64+
in_dtype,
65+
out_dtype,
66+
threads,
67+
):
68+
@T.prim_func
69+
def main(
70+
A: T.Buffer((M, N), in_dtype),
71+
B: T.Buffer((M, N), in_dtype),
72+
C: T.Buffer((M, N), out_dtype),
73+
):
74+
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by):
75+
start_x = bx * block_N
76+
start_y = by * block_M
77+
78+
for (local_y, local_x) in T.Parallel(block_M, block_N):
79+
y = start_y + local_y
80+
x = start_x + local_x
81+
82+
C[y, x] = A[y, x] + B[y, x]
83+
84+
return main
85+
```
86+
87+
### How to compile operators with dynamic shapes?
88+
89+
In the compilation process above, a fixed shape was used. However, in practical usage, we often want the kernel to support dynamic shapes. So, how can we compile a kernel in TileLang to handle dynamic shapes? In TileLang, we can replace the target size with a dynamic symbolic value, making the dimension dynamic. The following example illustrates this:
90+
91+
```python
92+
program = elementwise_add(T.symbolic("N"), threads=256, dtype="bfloat16")
93+
kernel = tilelang.compile(program, out_idx=-1, target="cuda", execution_backend="cython")
94+
```
95+
96+
The resulting CUDA code for the kernel will include an additional `int N` parameter after the `bfloat16_t* __restrict__ A`, `bfloat16_t* __restrict__ B`, and `bfloat16_t* __restrict__ C` parameters.
97+
98+
### How TileLang addresses boundary-related issues.
99+
100+
TileLang automatically incorporates boundary-checking conditions; however, this comes at a cost. These boundary conditions may prevent TileLang from performing more advanced optimizations. I will introduce an example from the next section in advance. The corresponding code is also provided below, but note that it involves the associated CUDA code. Readers are encouraged to first review the next section before returning to this paragraph for a clearer understanding.
101+
102+
When compiling the example below, let's set `N` to 2047:
103+
104+
```python
105+
def elementwise_add(N, num_per_thread=8, threads=256, dtype="bfloat16"):
106+
107+
@T.prim_func
108+
def main(A: T.Buffer((N), dtype), B: T.Buffer((N), dtype), C: T.Buffer((N), dtype)):
109+
with T.Kernel(T.ceildiv(N, threads * num_per_thread), threads=threads) as (b_x):
110+
# vector add.
111+
for i, j in T.Parallel(threads, num_per_thread):
112+
offsets = (b_x * threads + i) * num_per_thread
113+
C[offsets + j] = A[offsets + j] + B[offsets + j]
114+
115+
return main
116+
```
117+
118+
TileLang will generate the following CUDA code:
119+
120+
```c++
121+
extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
122+
#pragma unroll
123+
for (int i = 0; i < 8; ++i) {
124+
if (((i * 256) + ((int)threadIdx.x)) < 2047) {
125+
C[((i * 256) + ((int)threadIdx.x))] = (A[((i * 256) + ((int)threadIdx.x))] + B[((i * 256) + ((int)threadIdx.x))]);
126+
}
127+
}
128+
}
129+
```
130+
131+
We can observe that TileLang did not apply optimizations such as vectorization or coalesced memory access. In fact, except for the tail group of data, all other threads could have executed more optimized code.
132+
133+
## Comparison of TileLang, CUDA, and CuTe
134+
135+
For the subsequent examples, this tutorial will use the vector add operation for simplicity and brevity.
136+
137+
Typically, those new to CUDA programming often write CUDA code in a style similar to this:
138+
139+
```c++
140+
// vector add
141+
__global__ void elementwise_add(float* a, float* b, float* c, int N) {
142+
int idx = threadIdx.x + blockIdx.x * blockDim.x;
143+
if (idx < N) {
144+
c[idx] = a[idx] + b[idx];
145+
}
146+
}
147+
```
148+
149+
The code above assigns each thread to compute a single element, which is evidently inefficient since common acceleration techniques like coalesced memory access and vectorization are not utilized. However, TileLang code written with similar logic (e.g., loop-based traversal) can be optimized by the compiler into highly efficient implementations, making it more accessible for beginners. Additionally, the final generated code from the compiler remains observable, providing transparency into the optimization process.
150+
151+
The CUDA code generated by TileLang for the compiled kernel can be retrieved using the `kernel.get_kernel_source()` method. Below is the CUDA code produced for the vector addition example from Section 1:
152+
153+
```cu
154+
extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
155+
if (((int)threadIdx.x) < 32) {
156+
uint4 __1;
157+
uint4 v_ = *(uint4*)(A + ((((int)blockIdx.x) * 256) + (((int)threadIdx.x) * 8)));
158+
uint4 v__1 = *(uint4*)(B + ((((int)blockIdx.x) * 256) + (((int)threadIdx.x) * 8)));
159+
((nv_bfloat162*)(&(__1.x)))->x = (((nv_bfloat162*)(&(v_.x)))->x+((nv_bfloat162*)(&(v__1.x)))->x);
160+
((nv_bfloat162*)(&(__1.x)))->y = (((nv_bfloat162*)(&(v_.x)))->y+((nv_bfloat162*)(&(v__1.x)))->y);
161+
((nv_bfloat162*)(&(__1.y)))->x = (((nv_bfloat162*)(&(v_.y)))->x+((nv_bfloat162*)(&(v__1.y)))->x);
162+
((nv_bfloat162*)(&(__1.y)))->y = (((nv_bfloat162*)(&(v_.y)))->y+((nv_bfloat162*)(&(v__1.y)))->y);
163+
((nv_bfloat162*)(&(__1.z)))->x = (((nv_bfloat162*)(&(v_.z)))->x+((nv_bfloat162*)(&(v__1.z)))->x);
164+
((nv_bfloat162*)(&(__1.z)))->y = (((nv_bfloat162*)(&(v_.z)))->y+((nv_bfloat162*)(&(v__1.z)))->y);
165+
((nv_bfloat162*)(&(__1.w)))->x = (((nv_bfloat162*)(&(v_.w)))->x+((nv_bfloat162*)(&(v__1.w)))->x);
166+
((nv_bfloat162*)(&(__1.w)))->y = (((nv_bfloat162*)(&(v_.w)))->y+((nv_bfloat162*)(&(v__1.w)))->y);
167+
*(uint4*)(C + ((((int)blockIdx.x) * 256) + (((int)threadIdx.x) * 8))) = __1;
168+
}
169+
}
170+
```
171+
172+
In the code above, TileLang not only automatically maps block-level parallelism to threads but also applies optimizations such as vectorization and coalesced memory access.
173+
174+
While TileLang incorporates various optimizations for the aforementioned case, its behavior may sometimes appear counterintuitive. For example, when targeting 256 threads for task processing, applying vectorization can result in each thread computing 8 data elements—effectively utilizing only 32 active threads. Interestingly, the kernel launch configuration still retains the original allocation of 256 threads.
175+
176+
In such scenarios, explicitly specifying the number of elements computed per thread can help "guide" TileLang's code generation process, leading to implementations that are more closely aligned with the intended design.
177+
178+
```python
179+
def elementwise_add(N, num_per_thread=8, threads=256, dtype="bfloat16"):
180+
181+
@T.prim_func
182+
def main(A: T.Buffer((N), dtype), B: T.Buffer((N), dtype), C: T.Buffer((N), dtype)):
183+
with T.Kernel(T.ceildiv(N, threads * num_per_thread), threads=threads) as (b_x):
184+
# vector add.
185+
for i, j in T.Parallel(threads, num_per_thread):
186+
offsets = (b_x * threads + i) * num_per_thread
187+
C[offsets + j] = A[offsets + j] + B[offsets + j]
188+
189+
return main
190+
```
191+
192+
The corresponding CUDA code generated for the above example is presented below:
193+
194+
```c++
195+
extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
196+
uint4 __1;
197+
uint4 v_ = *(uint4*)(A + (((int)threadIdx.x) * 8));
198+
uint4 v__1 = *(uint4*)(B + (((int)threadIdx.x) * 8));
199+
((nv_bfloat162*)(&(__1.x)))->x = (((nv_bfloat162*)(&(v_.x)))->x+((nv_bfloat162*)(&(v__1.x)))->x);
200+
((nv_bfloat162*)(&(__1.x)))->y = (((nv_bfloat162*)(&(v_.x)))->y+((nv_bfloat162*)(&(v__1.x)))->y);
201+
((nv_bfloat162*)(&(__1.y)))->x = (((nv_bfloat162*)(&(v_.y)))->x+((nv_bfloat162*)(&(v__1.y)))->x);
202+
((nv_bfloat162*)(&(__1.y)))->y = (((nv_bfloat162*)(&(v_.y)))->y+((nv_bfloat162*)(&(v__1.y)))->y);
203+
((nv_bfloat162*)(&(__1.z)))->x = (((nv_bfloat162*)(&(v_.z)))->x+((nv_bfloat162*)(&(v__1.z)))->x);
204+
((nv_bfloat162*)(&(__1.z)))->y = (((nv_bfloat162*)(&(v_.z)))->y+((nv_bfloat162*)(&(v__1.z)))->y);
205+
((nv_bfloat162*)(&(__1.w)))->x = (((nv_bfloat162*)(&(v_.w)))->x+((nv_bfloat162*)(&(v__1.w)))->x);
206+
((nv_bfloat162*)(&(__1.w)))->y = (((nv_bfloat162*)(&(v_.w)))->y+((nv_bfloat162*)(&(v__1.w)))->y);
207+
*(uint4*)(C + (((int)threadIdx.x) * 8)) = __1;
208+
}
209+
```
210+
Aha, this CUDA code aligns closely with conventional programming practices, making it more familiar and intuitive.
211+
212+
But what happens if we provide additional hints to TileLang? For instance, by explicitly specifying register copies using the `T.copy(...)` operation. The example below demonstrates a vector addition implementation. Unlike the previous examples, this code explicitly loads data into registers before performing computations.
213+
214+
```python
215+
def elementwise_add(N, NUM_ELE_PER_THREAD=8, threads=256, dtype="bfloat16"):
216+
217+
@T.prim_func
218+
def main(A: T.Buffer((N), dtype), B: T.Buffer((N), dtype), C: T.Buffer((N), dtype)):
219+
with T.Kernel(T.ceildiv(N, threads * NUM_ELE_PER_THREAD), threads=threads) as (b_x):
220+
A_register = T.alloc_fragment((threads * NUM_ELE_PER_THREAD), dtype)
221+
B_register = T.alloc_fragment((threads * NUM_ELE_PER_THREAD), dtype)
222+
C_register = T.alloc_fragment((threads * NUM_ELE_PER_THREAD), dtype)
223+
224+
s_start = b_x * threads * NUM_ELE_PER_THREAD
225+
s_end = (b_x + 1) * threads * NUM_ELE_PER_THREAD
226+
227+
# LDG. 128
228+
T.copy(
229+
A[s_start:s_end],
230+
A_register,
231+
)
232+
T.copy(
233+
B[s_start:s_end],
234+
B_register,
235+
)
236+
237+
# vector add.
238+
for tid, i in T.Parallel(threads, NUM_ELE_PER_THREAD):
239+
C_register[tid * NUM_ELE_PER_THREAD + i] = (
240+
A_register[tid * NUM_ELE_PER_THREAD + i] +
241+
B_register[tid * NUM_ELE_PER_THREAD + i])
242+
243+
# STG. 128
244+
T.copy(
245+
C_register,
246+
C[s_start:s_end],
247+
)
248+
249+
return main
250+
```
251+
252+
In the example above, each thread is responsible for computing 8 elements. The `T.copy(...)` method functions at the block level, and TileLang automatically maps data movement operations to individual threads. This design may resonate more intuitively with CUDA developers. Let us now analyze the CUDA code generated from this implementation.
253+
254+
```c++
255+
// N is set to 8192 * 8192 when compiling
256+
extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
257+
bfloat16_t A_register[8];
258+
bfloat16_t B_register[8];
259+
*(uint4*)(A_register + 0) = *(uint4*)(A + ((((int)blockIdx.x) * 2048) + (((int)threadIdx.x) * 8)));
260+
*(uint4*)(B_register + 0) = *(uint4*)(B + ((((int)blockIdx.x) * 2048) + (((int)threadIdx.x) * 8)));
261+
uint4 __1;
262+
uint4 v_ = *(uint4*)(A_register + 0);
263+
uint4 v__1 = *(uint4*)(B_register + 0);
264+
((nv_bfloat162*)(&(__1.x)))->x = (((nv_bfloat162*)(&(v_.x)))->x+((nv_bfloat162*)(&(v__1.x)))->x);
265+
((nv_bfloat162*)(&(__1.x)))->y = (((nv_bfloat162*)(&(v_.x)))->y+((nv_bfloat162*)(&(v__1.x)))->y);
266+
((nv_bfloat162*)(&(__1.y)))->x = (((nv_bfloat162*)(&(v_.y)))->x+((nv_bfloat162*)(&(v__1.y)))->x);
267+
((nv_bfloat162*)(&(__1.y)))->y = (((nv_bfloat162*)(&(v_.y)))->y+((nv_bfloat162*)(&(v__1.y)))->y);
268+
((nv_bfloat162*)(&(__1.z)))->x = (((nv_bfloat162*)(&(v_.z)))->x+((nv_bfloat162*)(&(v__1.z)))->x);
269+
((nv_bfloat162*)(&(__1.z)))->y = (((nv_bfloat162*)(&(v_.z)))->y+((nv_bfloat162*)(&(v__1.z)))->y);
270+
((nv_bfloat162*)(&(__1.w)))->x = (((nv_bfloat162*)(&(v_.w)))->x+((nv_bfloat162*)(&(v__1.w)))->x);
271+
((nv_bfloat162*)(&(__1.w)))->y = (((nv_bfloat162*)(&(v_.w)))->y+((nv_bfloat162*)(&(v__1.w)))->y);
272+
*(uint4*)(A_register + 0) = __1;
273+
*(uint4*)(C + ((((int)blockIdx.x) * 2048) + (((int)threadIdx.x) * 8))) = *(uint4*)(A_register + 0);
274+
}
275+
```
276+
277+
We observed the emergence of two additional registers, `A_register` and `B_register`. However, during the actual computation, these registers are simply reassigned to `v_` and `v__1`, respectively.
278+
279+
To evaluate complexity, one could implement the same elementwise addition operator using CuTe and compare it with the TileLang version. The corresponding CuTe code is provided below:
280+
281+
```c++
282+
template<int NUM_ELE_PER_THREAD=8>
283+
__global__ void elementwise_add(nv_bfloat16* C,
284+
const nv_bfloat16* A,
285+
const nv_bfloat16* B,
286+
int N) {
287+
using namespace cute;
288+
289+
const int idx = threadIdx.x + blockIdx.x * blockDim.x;
290+
291+
Tensor t_C = make_tensor(make_gmem_ptr(C), make_shape(N));
292+
Tensor t_A = make_tensor(make_gmem_ptr(A), make_shape(N));
293+
Tensor t_B = make_tensor(make_gmem_ptr(B), make_shape(N));
294+
295+
Tensor t_C_tile = local_tile(t_C, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
296+
Tensor t_A_tile = local_tile(t_A, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
297+
Tensor t_B_tile = local_tile(t_B, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
298+
299+
Tensor reg_buffer_A = make_tensor_like(t_A_tile);
300+
Tensor reg_buffer_B = make_tensor_like(t_B_tile);
301+
Tensor reg_buffer_C = make_tensor_like(t_C_tile);
302+
303+
// LDG. 128
304+
copy(t_A_tile, reg_buffer_A);
305+
copy(t_B_tile, reg_buffer_B);
306+
307+
auto reg_C_vector = recast<nv_bfloat162>(reg_buffer_C);
308+
auto reg_A_vector = recast<nv_bfloat162>(reg_buffer_A);
309+
auto reg_B_vector = recast<nv_bfloat162>(reg_buffer_B);
310+
311+
// Perform vectorized addition
312+
#pragma unroll
313+
for (int vec_idx = 0; vec_idx < size(reg_C_vector); ++vec_idx) {
314+
reg_C_vector(vec_idx) = reg_A_vector(vec_idx) + reg_B_vector(vec_idx);
315+
}
316+
317+
auto reg_C_flat = recast<nv_bfloat16>(reg_C_vector);
318+
319+
// STG. 128
320+
copy(reg_C_flat, t_C_tile);
321+
}
322+
```
323+
324+
## Conclusion
325+
326+
This tutorial showcases the implementation of the elementwise addition operator using TileLang, while also comparing various design approaches. TileLang significantly reduces the complexity of CUDA programming, enabling high performance with minimal code. Nevertheless, working with TileLang demands careful attention to specific implementation details. To ensure computational efficiency, it is essential to thoroughly examine the generated CUDA kernels.
327+
328+
---
329+
330+
**Reference:**
331+
332+
[1] The CuTe code implementation draws inspiration from the techniques discussed in this blog: https://zhuanlan.zhihu.com/p/690703999

0 commit comments

Comments
 (0)