Skip to content

Releases: ashvardanian/less_slow.cpp

Release v0.10.3

18 Apr 22:13

Choose a tag to compare

Release: v0.10.3 [skip ci]

Patch

  • Docs: Cleaner stance on std::sin approximation (d4cbe85)

v0.10.2: Fast Math Patches

18 Apr 21:44

Choose a tag to compare

  • Improve: Horner method (cab8824)
  • Make: Default to -O2 (56016d5)
  • Fix: Compiling w/out Intel TBB (2346e03)
  • Docs: Typo (#39) (99a91ba)
  • Improve: Stricter range limits & fast-math (7ae2c01)
  • Make: Formatting CMake (0e3c916)
  • Improve: Detecting CUDA availability (91c5f4e)

Thanks to @corneliusroemer, @dzaima, @DrChr 🤗

Release v0.10.1

09 Apr 06:16

Choose a tag to compare

Release: v0.10.1 [skip ci]

Patch

  • Fix: Destroy CUDA events (c50e2e7)

v0.10: cuBLASLt examples for `fp8_e4m3` GEMM

27 Feb 12:56

Choose a tag to compare

DeepSeek has just released their mixed-precision FP8 GEMM implementation, and it felt like a good time to introduce some cuBLASLt snippets as a baseline for such work. On Nvidia H200, the results for different input sizes look like this:

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
cublaslt_tops<fp8_e4m3_t, float>/256         12496 ns        12496 ns        56284 TOP=2.67999T/s
cublaslt_tops<fp8_e4m3_t, float>/512         13089 ns        13089 ns        53100 TOP=20.4883T/s
cublaslt_tops<fp8_e4m3_t, float>/1024        14882 ns        14882 ns        46918 TOP=144.23T/s
cublaslt_tops<fp8_e4m3_t, float>/2048        25802 ns        25802 ns        26869 TOP=665.679T/s
cublaslt_tops<fp8_e4m3_t, float>/4096       109316 ns       109313 ns         6021 TOP=1.25715P/s
cublaslt_tops<fp8_e4m3_t, float>/8192       821080 ns       821050 ns          629 TOP=1.33907P/s
cublaslt_tops<fp8_e4m3_t, float>/16384     7135472 ns      7135461 ns           93 TOP=1.23269P/s
cublaslt_tops<fp8_e4m3_t, float>_BigO         0.00 N^3        0.00 N^3  
cublaslt_tops<fp8_e4m3_t, float>_RMS             2 %             2 % 

The advertised throughput for H100 and H200 in the SXM form factor is 2 Peta-Ops, and cuBLASLt achieves around 67% of that in the shared benchmarks. So, one should definitely be able to squeeze more.

I haven't tried implementing synthetic ALU benchmarks for different FP8-oriented PTX instructions, so if you have time and want to try something new - feel free to submit a PR 🤗

Release v0.9.2

23 Feb 13:36

Choose a tag to compare

Release: v0.9.2 [skip ci]

Patch

  • Docs: Counting PTX as Assembly lines (cb470dd)

Release v0.9.1

12 Feb 17:10

Choose a tag to compare

Release: v0.9.1 [skip ci]

Patch

  • Docs: List project structure (5f603c7)

How to count GPU Tensor operations correctly 🤯

11 Feb 12:08

Choose a tag to compare

Measuring Tensor-Core throughput is tricky! Many families of matrix-multiplications instructions exist. Practically every Nvidia GPU generation brings new tiles, new numeric types, mixed-precision schemes, and "structured sparsity" models. All of those together form some of the longest PTX IR instructions. To make things worse, across generations, Tensor Core scheduling and collective execution scale are different!

  • Before Volta and Tensor Cores, each GPU thread would execute its own scalar Fused-Multiply-Add — easy-peasy, as long as you know how to choose the optimal grid size for your GPU model.
  • On Volta, with new mma.* instructions and wmma:: intrinsics, 8 threads would execute every tiled Mat-Mul together. This scale of collaboration was creatively called by Nvidia engineers a octet a "quadpair", of course 🤦‍♂️
  • On Ampere, with new wmma.mma.* instructions, all of the 32 threads in a single "warp" would work together. This abstraction makes sense to people familiar with CUDA C++ and how scheduling works on the GPU. Great!
  • On Hopper, things changed again, of course, with wgmma.mma_async.sync.*, which supports basic asynchronous primitives at the hardware level. It has 128 threads across 4 consecutive "warps" forming a "warp group".
  • On Blackwell, you would be wise to expect a new change, and it came with a broader set of functionality refactored into an all-new tcgen05.* namespace of instructions 🧠 🔫

This new PR addresses this by explicitly marking the collaboration "scale" and counting TOPS differently for each family of instructions.


Almost equally tricky is making sure that the PTXAS assembler doesn't optimize out relevant code blocks. In the past, one approach I'd use is putting an impossible condition at the end of a CUDA C++ kernel, like this:

template <typename input_type_, typename output_type_, int m_, int n_, int k_, int repetitions_ = 128>
__device__ inline void tops_tc_cuda_kernel() {
    using namespace nvcuda;
    wmma::fragment<wmma::matrix_a, m_, n_, k_, input_type_, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, m_, n_, k_, input_type_, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, m_, n_, k_, output_type_> c_frag;
    for (int i = 0; i != repetitions_; ++i) wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    if (threadIdx.x == 2147483647) wmma::store_matrix_sync(nullptr, c_frag, 16, wmma::mem_row_major);
}

This way, the compiler will see that I'm trying to export the accumulated value and will not remove our mma_sync call, even if the target address is a NULL pointer. Another approach I'd often use in PTX is to define dummy global variables and export a few values there:

.visible .global .align 4 .s32 dummy_sink_s32[32];
.visible .global .align 4 .f32 dummy_sink_f32[32];
.visible .entry tops_f16f32_sm90tc_m64n256k16_loop128_ptx_kernel() {
    ...
loop_exit:
    // Zero argument means - wait for all committed WGMMAs to complete.
    wgmma.wait_group.sync.aligned 0;

    // Use volatile stores to force the accumulator values to be written out.
    // This dummy write (to a global variable) makes the work observable and 
    // prevents the multiplication  pipeline from being optimized out.
    st.global.volatile.f32 [dummy_sink_f32],      accum0;
    st.global.volatile.f32 [dummy_sink_f32+4],    accum1;
    ret;
}

But with WGMMA, the PTXAS tool will optimize our multiplications if the shared-memory tile descriptors aren't valid. Even if it's just for a benchmark. So this PR shows how to assemble valid descriptors 🤗


This PR fixes those issues and adds more PTX kernels to highlight the different aspects of GPGPU development 🤗

Minor

  • Add: f16f32 WMMA variant for Ampere (28e639e)
  • Add: f16f32 MMA variant for Volta (1359ca7)
  • Add: Inline-PTX in C++ for WGMMA (6e16165)
  • Add: WGMMA synchronization (0207843)
  • Add: Inlined PTX kernels in CUDA C++ (e2a1bfc)

Patch

  • Docs: New H200 stats (b5d4610)
  • Docs: Naming temporary compilation results (da36475)
  • Improve: Drop small WGMMA for conciseness (7f63ef2)
  • Fix: Invoke f16f32 in WGMMA (4423421)
  • Fix: tf32 perf and waiting on fences (ea4a3e0)
  • Fix: Counting TOPS across TC generations (85f78c3)
  • Make: Split Hopper and Ampere PTX (733cbac)
  • Make: Target SM 9.0a over SM 9.0 (726c1e1)

Release v0.8.2

07 Feb 23:45

Choose a tag to compare

Release: v0.8.2 [skip ci]

Patch

  • Docs: Recommend CMake from PyPI (3ff4265)

Release v0.8.1

07 Feb 22:39

Choose a tag to compare

Release: v0.8.1 [skip ci]

Patch

  • Improve: Shrink PTX loops (152e59a)
  • Fix: Reopen ASIO compilation (d1909f9)
  • Improve: Naming scheme for PTX kernels (c1b21c8)
  • Fix: Illegal memory access on Volta (b7eab6d)

v0.8: Mat-Muls on Nvidia Hopper and Blackwell

07 Feb 21:08

Choose a tag to compare

This release answers a few questions:

  • CUTLASS vs CUBLAS performance: which to choose?
  • How did MMA instructions change with Hopper H100?
  • How did they change again with Blackwell B200?

Minor

  • Add: Warp-Group Binary MMA (d6daf3a)
  • Add: Larger m64n256k8 WGMMA variant (3e3530e)
  • Add: Warp-Group Async kernels (6cc7e34)
  • Add: f64 MMA PTX variant (ae450e5)
  • Add: CuTe draft (fdea727)
  • Add: CUTLASS placeholders (b1ab93d)
  • Add: Hopper sm90a PTX kernels (4bcf74a)

Patch

  • Improve: CUresult error handling (d74d430)
  • Improve: Logging CUDA errors (953a696)
  • Fix: Synchronize TCs (494ba52)
  • Improve: Impossible %tid condition against NVCC (8a9c9c5)
  • Make: Temporarily block CUTLASS (df1b39c)
  • Improve: Cleaner PTX code (71dea0c)
  • Improve: Avoid NVCC-specific features (3d65c7f)
  • Fix: Re-creating a CUDA stream (e831650)
  • Make: Compile in parallel by default (8e671c6)
  • Make: Separate host-only code (f751fbf)
  • Docs: Counter-intuitive PTX facts (822fa2f)
  • Docs: H200 vs MI 300X vs GB200 specs (cc36bcd)
  • Make: CUTLASS dependency (f272c40)
  • Fix: Synchronize cuBLAS for profiling (4077f26)
  • Docs: Blackwell tensor cores (ec35b35)
  • Fix: Missing _Float16 in NVCC, use half (71cadca)
  • Improve: Same size range for GEMM (d914fce)
  • Fix: Different output size for cublasGemmEx (304c880)