Skip to content

Commit 2f75697

Browse files
authored
feat: Add a naive ws-hgemm for sm8x (#366)
* Use 128-bit data loading * add a naive ws-hgemm in sm8x * rename wshgemm to ws-hgemm
1 parent cb6a049 commit 2f75697

File tree

5 files changed

+592
-2
lines changed

5 files changed

+592
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
<div id="news"></div>
4040

4141
- [2025-07-13] **[🤗flux-faster](https://github.com/xlite-dev/flux-faster)** is released! A forked version of [huggingface/flux-fast](https://github.com/huggingface/flux-fast) that **makes flux-fast even faster** with **[cache-dit](https://github.com/vipshop/cache-dit)**, **3.3x** speedup on NVIDIA L20 while still maintaining **high precision**.
42-
42+
4343
- [2025-06-16]: [🤗CacheDiT](https://github.com/vipshop/cache-dit) is released! A **Training-free** and **Easy-to-use** Cache Acceleration Toolbox for Diffusion Transformers (**DBCache**, **DBPrune**, **TaylorSeer**, **FBCache**, etc)🔥. Feel free to take a try!
4444

4545
<div align='center'>

kernels/layer-norm/layer_norm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ def run_benchmark(
9898

9999
print(" " * 40 + f"f16 overflow without f32")
100100
print("-" * 85)
101-
x_f16 = x.half() * 100 # this will cause overflow for kernels without `f32`
101+
x_f16 = x.half() * 100 # this will cause overflow for kernels without `f32`
102102
run_benchmark(lib.layer_norm_f16_f16, x_f16, "f16f16", out_f16)
103103
run_benchmark(lib.layer_norm_f16_f32, x_f16, "f16f32", out_f16)
104104
run_benchmark(lib.layer_norm_f16x2_f16, x_f16, "f16x2f16", out_f16)

kernels/ws-hgemm/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Warp Specialization HGemm
2+
3+
## 0x00 说明
4+
5+
包含以下内容:
6+
7+
- [X] ws_hgemm_naive_cute_kernel
8+
9+
10+
## 测试
11+
12+
```bash
13+
python3 ws_hgemm.py
14+
```
15+
16+
输出:
17+
18+
```bash
19+
--------------------------------------------------------------------------------
20+
out_ws_hgemm_naive_cute: [4096.0, 4096.0, 4096.0], time:3.71974587ms
21+
out_f16_th: [4096.0, 4096.0, 4096.0], time:5.05561471ms
22+
--------------------------------------------------------------------------------
23+
```

0 commit comments

Comments
 (0)