Releases: ashvardanian/less_slow.cpp
Release v0.10.3
v0.10.2: Fast Math Patches
- Improve: Horner method (cab8824)
- Make: Default to
-O2(56016d5) - Fix: Compiling w/out Intel TBB (2346e03)
- Docs: Typo (#39) (99a91ba)
- Improve: Stricter range limits &
fast-math(7ae2c01) - Make: Formatting CMake (0e3c916)
- Improve: Detecting CUDA availability (91c5f4e)
Thanks to @corneliusroemer, @dzaima, @DrChr 🤗
Release v0.10.1
v0.10: cuBLASLt examples for `fp8_e4m3` GEMM
DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:
--------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256 12496 ns 12496 ns 56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512 13089 ns 13089 ns 53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024 14882 ns 14882 ns 46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048 25802 ns 25802 ns 26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096 109316 ns 109313 ns 6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192 821080 ns 821050 ns 629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384 7135472 ns 7135461 ns 93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO 0.00 N^3 0.00 N^3
cublaslt_tops<fp8_e4m3_t, float>_RMS 2 % 2 % The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.
I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗
Release v0.9.2
Release v0.9.1
How to count GPU Tensor operations correctly 🤯
Measuring Tensor-Core throughput is tricky! Many families of matrix-multiplications instructions exist. Practically every Nvidia GPU generation brings new tiles, new numeric types, mixed-precision schemes, and "structured sparsity" models. All of those together form some of the longest PTX IR instructions. To make things worse, across generations, Tensor Core scheduling and collective execution scale are different!
- Before Volta and Tensor Cores, each GPU thread would execute its own scalar Fused-Multiply-Add — easy-peasy, as long as you know how to choose the optimal grid size for your GPU model.
- On Volta, with new
mma.*instructions andwmma::intrinsics, 8 threads would execute every tiled Mat-Mul together. This scale of collaboration was creatively called by Nvidia engineersa octeta "quadpair", of course 🤦♂️ - On Ampere, with new
wmma.mma.*instructions, all of the 32 threads in a single "warp" would work together. This abstraction makes sense to people familiar with CUDA C++ and how scheduling works on the GPU. Great! - On Hopper, things changed again, of course, with
wgmma.mma_async.sync.*, which supports basic asynchronous primitives at the hardware level. It has 128 threads across 4 consecutive "warps" forming a "warp group". - On Blackwell, you would be wise to expect a new change, and it came with a broader set of functionality refactored into an all-new
tcgen05.*namespace of instructions 🧠 🔫
This new PR addresses this by explicitly marking the collaboration "scale" and counting TOPS differently for each family of instructions.
Almost equally tricky is making sure that the PTXAS assembler doesn't optimize out relevant code blocks. In the past, one approach I'd use is putting an impossible condition at the end of a CUDA C++ kernel, like this:
template <typename input_type_, typename output_type_, int m_, int n_, int k_, int repetitions_ = 128>
__device__ inline void tops_tc_cuda_kernel() {
using namespace nvcuda;
wmma::fragment<wmma::matrix_a, m_, n_, k_, input_type_, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, m_, n_, k_, input_type_, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, m_, n_, k_, output_type_> c_frag;
for (int i = 0; i != repetitions_; ++i) wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
if (threadIdx.x == 2147483647) wmma::store_matrix_sync(nullptr, c_frag, 16, wmma::mem_row_major);
}This way, the compiler will see that I'm trying to export the accumulated value and will not remove our mma_sync call, even if the target address is a NULL pointer. Another approach I'd often use in PTX is to define dummy global variables and export a few values there:
.visible .global .align 4 .s32 dummy_sink_s32[32];
.visible .global .align 4 .f32 dummy_sink_f32[32];
.visible .entry tops_f16f32_sm90tc_m64n256k16_loop128_ptx_kernel() {
...
loop_exit:
// Zero argument means - wait for all committed WGMMAs to complete.
wgmma.wait_group.sync.aligned 0;
// Use volatile stores to force the accumulator values to be written out.
// This dummy write (to a global variable) makes the work observable and
// prevents the multiplication pipeline from being optimized out.
st.global.volatile.f32 [dummy_sink_f32], accum0;
st.global.volatile.f32 [dummy_sink_f32+4], accum1;
ret;
}
But with WGMMA, the PTXAS tool will optimize our multiplications if the shared-memory tile descriptors aren't valid. Even if it's just for a benchmark. So this PR shows how to assemble valid descriptors 🤗
This PR fixes those issues and adds more PTX kernels to highlight the different aspects of GPGPU development 🤗
Minor
- Add:
f16f32WMMA variant for Ampere (28e639e) - Add:
f16f32MMA variant for Volta (1359ca7) - Add: Inline-PTX in C++ for WGMMA (6e16165)
- Add: WGMMA synchronization (0207843)
- Add: Inlined PTX kernels in CUDA C++ (e2a1bfc)
Patch
- Docs: New H200 stats (b5d4610)
- Docs: Naming temporary compilation results (da36475)
- Improve: Drop small WGMMA for conciseness (7f63ef2)
- Fix: Invoke
f16f32in WGMMA (4423421) - Fix:
tf32perf and waiting on fences (ea4a3e0) - Fix: Counting TOPS across TC generations (85f78c3)
- Make: Split Hopper and Ampere PTX (733cbac)
- Make: Target SM 9.0a over SM 9.0 (726c1e1)
Release v0.8.2
Release v0.8.1
v0.8: Mat-Muls on Nvidia Hopper and Blackwell
This release answers a few questions:
- CUTLASS vs CUBLAS performance: which to choose?
- How did
MMAinstructions change with Hopper H100? - How did they change again with Blackwell B200?
Minor
- Add: Warp-Group Binary MMA (d6daf3a)
- Add: Larger
m64n256k8WGMMA variant (3e3530e) - Add: Warp-Group Async kernels (6cc7e34)
- Add:
f64MMA PTX variant (ae450e5) - Add: CuTe draft (fdea727)
- Add: CUTLASS placeholders (b1ab93d)
- Add: Hopper
sm90aPTX kernels (4bcf74a)
Patch
- Improve:
CUresulterror handling (d74d430) - Improve: Logging CUDA errors (953a696)
- Fix: Synchronize TCs (494ba52)
- Improve: Impossible
%tidcondition against NVCC (8a9c9c5) - Make: Temporarily block CUTLASS (df1b39c)
- Improve: Cleaner PTX code (71dea0c)
- Improve: Avoid NVCC-specific features (3d65c7f)
- Fix: Re-creating a CUDA stream (e831650)
- Make: Compile in parallel by default (8e671c6)
- Make: Separate host-only code (f751fbf)
- Docs: Counter-intuitive PTX facts (822fa2f)
- Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
- Make: CUTLASS dependency (f272c40)
- Fix: Synchronize cuBLAS for profiling (4077f26)
- Docs: Blackwell tensor cores (ec35b35)
- Fix: Missing
_Float16in NVCC, usehalf(71cadca) - Improve: Same size range for GEMM (d914fce)
- Fix: Different output size for
cublasGemmEx(304c880)