Releases · ashvardanian/less_slow.cpp

18 Apr 22:13

ashvardanian

v0.10.3

8a002ce

Release v0.10.3

Release: v0.10.3 [skip ci]

Patch

Docs: Cleaner stance on std::sin approximation (d4cbe85)

Assets 2

18 Apr 21:44

ashvardanian

v0.10.2

3f2ebba

v0.10.2: Fast Math Patches

Improve: Horner method (cab8824)
Make: Default to -O2 (56016d5)
Fix: Compiling w/out Intel TBB (2346e03)
Docs: Typo (#39) (99a91ba)
Improve: Stricter range limits & fast-math (7ae2c01)
Make: Formatting CMake (0e3c916)
Improve: Detecting CUDA availability (91c5f4e)

Thanks to @corneliusroemer, @dzaima, @DrChr 🤗

Contributors

dzaima, DrChr, and corneliusroemer

Assets 2

09 Apr 06:16

ashvardanian

v0.10.1

8f32d65

Release v0.10.1

Release: v0.10.1 [skip ci]

Patch

Fix: Destroy CUDA events (c50e2e7)

Assets 2

27 Feb 12:56

ashvardanian

v0.10.0

91d6c3b

v0.10: cuBLASLt examples for `fp8_e4m3` GEMM

DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256         12496 ns        12496 ns        56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512         13089 ns        13089 ns        53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024        14882 ns        14882 ns        46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048        25802 ns        25802 ns        26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096       109316 ns       109313 ns         6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192       821080 ns       821050 ns          629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384     7135472 ns      7135461 ns           93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO         0.00 N^3        0.00 N^3  
cublaslt_tops<fp8_e4m3_t, float>_RMS             2 %             2 %

The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.

I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗

Assets 2

23 Feb 13:36

ashvardanian

v0.9.2

898f571

Release v0.9.2

Release: v0.9.2 [skip ci]

Patch

Docs: Counting PTX as Assembly lines (cb470dd)

Assets 2

12 Feb 17:10

ashvardanian

v0.9.1

657e8dc

Release v0.9.1

Release: v0.9.1 [skip ci]

Patch

Docs: List project structure (5f603c7)

Assets 2

11 Feb 12:08

ashvardanian

v0.9.0

905e8dd

How to count GPU Tensor operations correctly 🤯

Measuring Tensor-Core throughput is tricky! Many families of matrix-multiplications instructions exist. Practically every Nvidia GPU generation brings new tiles, new numeric types, mixed-precision schemes, and "structured sparsity" models. All of those together form some of the longest PTX IR instructions. To make things worse, across generations, Tensor Core scheduling and collective execution scale are different!

Before Volta and Tensor Cores, each GPU thread would execute its own scalar Fused-Multiply-Add — easy-peasy, as long as you know how to choose the optimal grid size for your GPU model.
On Volta, with new mma.* instructions and wmma:: intrinsics, 8 threads would execute every tiled Mat-Mul together. This scale of collaboration was creatively called by Nvidia engineers ~~a octet~~ a "quadpair", of course 🤦‍♂️
On Ampere, with new wmma.mma.* instructions, all of the 32 threads in a single "warp" would work together. This abstraction makes sense to people familiar with CUDA C++ and how scheduling works on the GPU. Great!
On Hopper, things changed again, of course, with wgmma.mma_async.sync.*, which supports basic asynchronous primitives at the hardware level. It has 128 threads across 4 consecutive "warps" forming a "warp group".
On Blackwell, you would be wise to expect a new change, and it came with a broader set of functionality refactored into an all-new tcgen05.* namespace of instructions 🧠 🔫

This new PR addresses this by explicitly marking the collaboration "scale" and counting TOPS differently for each family of instructions.

Almost equally tricky is making sure that the PTXAS assembler doesn't optimize out relevant code blocks. In the past, one approach I'd use is putting an impossible condition at the end of a CUDA C++ kernel, like this:

template <typename input_type_, typename output_type_, int m_, int n_, int k_, int repetitions_ = 128>
__device__ inline void tops_tc_cuda_kernel() {
    using namespace nvcuda;
    wmma::fragment<wmma::matrix_a, m_, n_, k_, input_type_, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, m_, n_, k_, input_type_, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, m_, n_, k_, output_type_> c_frag;
    for (int i = 0; i != repetitions_; ++i) wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    if (threadIdx.x == 2147483647) wmma::store_matrix_sync(nullptr, c_frag, 16, wmma::mem_row_major);
}

This way, the compiler will see that I'm trying to export the accumulated value and will not remove our mma_sync call, even if the target address is a NULL pointer. Another approach I'd often use in PTX is to define dummy global variables and export a few values there:

.visible .global .align 4 .s32 dummy_sink_s32[32];
.visible .global .align 4 .f32 dummy_sink_f32[32];
.visible .entry tops_f16f32_sm90tc_m64n256k16_loop128_ptx_kernel() {
    ...
loop_exit:
    // Zero argument means - wait for all committed WGMMAs to complete.
    wgmma.wait_group.sync.aligned 0;

    // Use volatile stores to force the accumulator values to be written out.
    // This dummy write (to a global variable) makes the work observable and 
    // prevents the multiplication  pipeline from being optimized out.
    st.global.volatile.f32 [dummy_sink_f32],      accum0;
    st.global.volatile.f32 [dummy_sink_f32+4],    accum1;
    ret;
}

But with WGMMA, the PTXAS tool will optimize our multiplications if the shared-memory tile descriptors aren't valid. Even if it's just for a benchmark. So this PR shows how to assemble valid descriptors 🤗

This PR fixes those issues and adds more PTX kernels to highlight the different aspects of GPGPU development 🤗

Minor

Add: f16f32 WMMA variant for Ampere (28e639e)
Add: f16f32 MMA variant for Volta (1359ca7)
Add: Inline-PTX in C++ for WGMMA (6e16165)
Add: WGMMA synchronization (0207843)
Add: Inlined PTX kernels in CUDA C++ (e2a1bfc)

Patch

Docs: New H200 stats (b5d4610)
Docs: Naming temporary compilation results (da36475)
Improve: Drop small WGMMA for conciseness (7f63ef2)
Fix: Invoke f16f32 in WGMMA (4423421)
Fix: tf32 perf and waiting on fences (ea4a3e0)
Fix: Counting TOPS across TC generations (85f78c3)
Make: Split Hopper and Ampere PTX (733cbac)
Make: Target SM 9.0a over SM 9.0 (726c1e1)

Assets 2

07 Feb 23:45

ashvardanian

v0.8.2

afcf491

Release v0.8.2

Release: v0.8.2 [skip ci]

Patch

Docs: Recommend CMake from PyPI (3ff4265)

Assets 2

07 Feb 22:39

ashvardanian

v0.8.1

56bb71a

Release v0.8.1

Release: v0.8.1 [skip ci]

Patch

Improve: Shrink PTX loops (152e59a)
Fix: Reopen ASIO compilation (d1909f9)
Improve: Naming scheme for PTX kernels (c1b21c8)
Fix: Illegal memory access on Volta (b7eab6d)

Assets 2

07 Feb 21:08

ashvardanian

v0.8.0

15ed377

v0.8: Mat-Muls on Nvidia Hopper and Blackwell

This release answers a few questions:

CUTLASS vs CUBLAS performance: which to choose?
How did MMA instructions change with Hopper H100?
How did they change again with Blackwell B200?

Minor

Add: Warp-Group Binary MMA (d6daf3a)
Add: Larger m64n256k8 WGMMA variant (3e3530e)
Add: Warp-Group Async kernels (6cc7e34)
Add: f64 MMA PTX variant (ae450e5)
Add: CuTe draft (fdea727)
Add: CUTLASS placeholders (b1ab93d)
Add: Hopper sm90a PTX kernels (4bcf74a)

Patch

Improve: CUresult error handling (d74d430)
Improve: Logging CUDA errors (953a696)
Fix: Synchronize TCs (494ba52)
Improve: Impossible %tid condition against NVCC (8a9c9c5)
Make: Temporarily block CUTLASS (df1b39c)
Improve: Cleaner PTX code (71dea0c)
Improve: Avoid NVCC-specific features (3d65c7f)
Fix: Re-creating a CUDA stream (e831650)
Make: Compile in parallel by default (8e671c6)
Make: Separate host-only code (f751fbf)
Docs: Counter-intuitive PTX facts (822fa2f)
Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
Make: CUTLASS dependency (f272c40)
Fix: Synchronize cuBLAS for profiling (4077f26)
Docs: Blackwell tensor cores (ec35b35)
Fix: Missing _Float16 in NVCC, use half (71cadca)
Improve: Same size range for GEMM (d914fce)
Fix: Different output size for cublasGemmEx (304c880)

Assets 2

Releases: ashvardanian/less_slow.cpp

Release v0.10.3

Patch

Uh oh!

v0.10.2: Fast Math Patches

Contributors

Uh oh!

Release v0.10.1

Patch

Uh oh!

v0.10: cuBLASLt examples for `fp8_e4m3` GEMM

Uh oh!

Release v0.9.2

Patch

Uh oh!

Release v0.9.1

Patch

Uh oh!

How to count GPU Tensor operations correctly 🤯

Minor

Patch

Uh oh!

Release v0.8.2

Patch

Uh oh!

Release v0.8.1

Patch

Uh oh!

v0.8: Mat-Muls on Nvidia Hopper and Blackwell

Minor

Patch

Uh oh!