Feature/prefetch2 by maddyscientist · Pull Request #1604 · lattice/quda

maddyscientist · 2025-12-03T01:11:22Z

This work is latest towards optimizing QUDA for Blackwell:

Adds supports for "spatial prefetching", where we over fetch data to L2 when issuing a global load. Exposed as an optional template parameter to vector_load. At present, not deployed anywhere.
Add support for prefetching instructions, in the form of both per-thread prefetching (which works on all CUDA architectures), and TMA-based prefetching, which is Hopper+ only. Prefetching type is set using QUDA_DSLASH_PREFETCH CMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptor
Add an experimental L1 prefetch (using LDGSTS). Disabled, but left for future experiments.
Add single-threaded execution region helper function target::is_thread_zero() which should be used for TMA issuance.
Optionally store the backward shifted gauge field. This simplifies all dslash indexing, as all spatial indices thus correspond to "this" site. Enabled with QUDA_DSLASH_DOUBLE_STORE=ON which is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.
Added prefetching support to both Wilson and Staggered dslash kernels, parameterized using QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED CMake parameters.
Optimization of the neighbor indexing for the dslash kernels. This reduced integer instruction overheads.
Reduction in pointer arithmetic overhead (use more 32-bit integer operations where possible). Added three operand and four operand variants of vector_load and vector_store to this end (respectively).
Optimize FFMA2 issuance to reduce total number of floating point instructions on Blackwell
Optimization of short <-> float conversation to reduce instruction overheads
Optimization of staggered packing kernels (replace division by int with division by fast_intdiv)
Extends some OpenMP parallelization on the host that was missing.

The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.

…tions for CUDA

…tead of logic operations when computing the neighboring index; this is branch free and less operations

…d by default

…quarter precision support

…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features

…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push

…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)

…ble on CUDA platform

…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads

…d and vector_store to reduce indexing overheads

…tNOrder uses optimized 3-operand indexing

TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and later architectures. This commit wraps the cuTensorMapEncodeTiled calls with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.

include/dslash.h

include/gauge_field_order.h

weinbe2 · 2026-01-27T21:56:56Z

CMakeLists.txt


+option(QUDA_DSLASH_DOUBLE_STORE "store a forwards shifted copy of the gauge fields for simplified Dslash indexing" OFF)
+mark_as_advanced(QUDA_DSLASH_DOUBLE_STORE)
+set(QUDA_DSLASH_PREFETCH_TMA "0" CACHE STRING "enable TMA prefetching (Hopper+, 0 - disable, 1 - bulk, 2 - tensor)")


Sticking a pin in our offline discussion about replacing numbers with string descriptors (iirc)

…ck separated

…ioned to allow for kernels to query

…, placing the end face (which is otherwise lost) into the ghost

…the gauge field from the ghost region - ensures coalesced access regarding less of partitioning

… - comms partitioning were effectively disabled for testing

…ggered

…distance of 2

…ow created unless TENSOR prefetching type is enabled

havogt · 2026-02-06T08:17:26Z

cscs-ci run

…ure monitoring

maddyscientist added 30 commits September 19, 2025 16:42

Initial support for prefetching (over fetching) added to load instruc…

63b7ff4

…tions for CUDA

Fix for half precision

191105b

Apply some missing OMP parallelization to host functions

5b41229

Fix for fine-grained accessor vector loads

a2efb44

Add prefetching instructions for CUDA

c815076

Optimizaiton of neighbor indexing for dslash kernels: use bitwise ins…

177c18b

…tead of logic operations when computing the neighboring index; this is branch free and less operations

Add support for creating a backward gauge field

eae953d

Some small improvedments to shift(GaugeField) function

2540a1b

Gauge shift should encode shift value in aux_string

e686437

Add support for experimental double storage of gauge fields - disable…

676c643

…d by default

Fix some issues with gauge shift: fix single-GPU builds and add half/…

9c2025b

…quarter precision support

make doBulk and doHalo constexpr

721fbd5

Add target::is_thread_zero and target::is_lane_zero helper functions …

02a4cb9

…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features

Expose prefetching instructions

33b5f2f

Add prefetching support to gauge and colorspinor fields

ccf7a55

Add L2 gauge-field prefetching support to both Wilson and staggered d…

0642f63

…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push

QUDA_DSLASH_DOUBLE_STORE is now a CMake parameter

72a001f

Add TMA prefetch support for Wilson and staggered fermions (enabled w…

02e7bc3

…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)

Add target::uniform helper which is used to create warp-uniform varia…

7bb5cdc

…ble on CUDA platform

Fix typo in last commit

f42a507

Fix bug with non-double-store staggered dslash

e2df25f

Fix bug with parity setting

3010aa6

Fix bulk prefetch of phase

acfaf5b

Add 3-d and 4-d TMA prefetch instructions

67f8ce4

first version of tensor descriptor TMA prefetch - almost certainly buggy

946bed0

Fix some warnings and set Uback tensor descriptor for wilson dslash

d772d5f

colorspinor::FloatNOrder load/save functions use 3-operand vector_loa…

9910869

…d and vector_store to reduce indexing overheads

Continued improvements to tensor TMA prefetch variant and gauge::Floa…

b9a4d5f

…tNOrder uses optimized 3-operand indexing