Skip to content

Feature/prefetch2#1604

Open
maddyscientist wants to merge 118 commits intodevelopfrom
feature/prefetch2
Open

Feature/prefetch2#1604
maddyscientist wants to merge 118 commits intodevelopfrom
feature/prefetch2

Conversation

@maddyscientist
Copy link
Member

This work is latest towards optimizing QUDA for Blackwell:

  • Adds supports for "spatial prefetching", where we over fetch data to L2 when issuing a global load. Exposed as an optional template parameter to vector_load. At present, not deployed anywhere.
  • Add support for prefetching instructions, in the form of both per-thread prefetching (which works on all CUDA architectures), and TMA-based prefetching, which is Hopper+ only. Prefetching type is set using QUDA_DSLASH_PREFETCH CMake parameter, with 0=per-thread, 1=TMA bulk, and 2=TMA descriptor
  • Add an experimental L1 prefetch (using LDGSTS). Disabled, but left for future experiments.
  • Add single-threaded execution region helper function target::is_thread_zero() which should be used for TMA issuance.
  • Optionally store the backward shifted gauge field. This simplifies all dslash indexing, as all spatial indices thus correspond to "this" site. Enabled with QUDA_DSLASH_DOUBLE_STORE=ON which is required for TMA-based prefetching (for alignment reasons).* Prefetching is exposed for both ColorSpinorFields and GaugeFields, though only latter actually used at present.
  • Added prefetching support to both Wilson and Staggered dslash kernels, parameterized using QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED CMake parameters.
  • Optimization of the neighbor indexing for the dslash kernels. This reduced integer instruction overheads.
  • Reduction in pointer arithmetic overhead (use more 32-bit integer operations where possible). Added three operand and four operand variants of vector_load and vector_store to this end (respectively).
  • Optimize FFMA2 issuance to reduce total number of floating point instructions on Blackwell
  • Optimization of short <-> float conversation to reduce instruction overheads
  • Optimization of staggered packing kernels (replace division by int with division by fast_intdiv)
  • Extends some OpenMP parallelization on the host that was missing.

The end result of this work is that both Staggered and Wilson dslash kernels can saturate over 90% memory bandwidth for most variants. Outstanding are half precision variants using reconstruction, that are still lagging. These will be the focus of a subsequent PR.

…tead of logic operations when computing the neighboring index; this is branch free and less operations
…for executing single-thread regions of code. On CUDA install the latest version of CCCL via CPM since we need some new features
…slash kernels. Disabled by default (set with with Arg::prefetch_distance parameter), and TMA prefetch will be added in next push
…ith QUDA_DSLASH_PREFETCH_BULK=ON). Prefetch distance is now set via CMake (QUDA_DSLASH_PREFETCH_DISTANCE_WILSON and QUDA_DSLASH_PREFETCH_DISTANCE_STAGGERED)
…ants of vector_load and vector_store: these allow for hte pointer offset and the index to be computed together first in 32-bit, before accumulation to the pointer in 64-bit, reducing pointer arithmetic overheads
…d and vector_store to reduce indexing overheads
TMA (Tensor Memory Accelerator) is only available on Hopper (sm_90+) and
later architectures. This commit wraps the cuTensorMapEncodeTiled calls
with a compile-time guard to prevent runtime errors on Volta/Ampere GPUs.
CMakeLists.txt Outdated

option(QUDA_DSLASH_DOUBLE_STORE "store a forwards shifted copy of the gauge fields for simplified Dslash indexing" OFF)
mark_as_advanced(QUDA_DSLASH_DOUBLE_STORE)
set(QUDA_DSLASH_PREFETCH_TMA "0" CACHE STRING "enable TMA prefetching (Hopper+, 0 - disable, 1 - bulk, 2 - tensor)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sticking a pin in our offline discussion about replacing numbers with string descriptors (iirc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…, placing the end face (which is otherwise lost) into the ghost
…the gauge field from the ghost region - ensures coalesced access regarding less of partitioning
… - comms partitioning were effectively disabled for testing
…ow created unless TENSOR prefetching type is enabled
@havogt
Copy link
Contributor

havogt commented Feb 6, 2026

cscs-ci run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants