feat: Implement Paged KV Cache and CUDA Paged Attention #14070

celsowm · 2025-06-08T18:34:40Z

This commit introduces an initial implementation of a paged key-value (KV) cache and corresponding paged attention mechanisms for CUDA-enabled GPUs in llama.cpp. The primary goal is to improve memory efficiency for handling long or multiple sequences by mitigating KV cache fragmentation.

Key Components:

CPU Paged KV Cache:
- llama_kv_page.h: Defines struct llama_kv_page.
- llama_paged_kv_cells.h/.cpp: Implements llama_paged_kv_cells for
  managing fixed-size memory pages allocated from a larger GGML pool.
  Handles token-to-page/offset mapping.
- llama_paged_kv_cache.h/.cpp: Implements llama_paged_kv_cache
  (inheriting from llama_memory_i). This class allocates its main
  page pool via GGML (intended to use a paged allocator) and uses
  llama_paged_kv_cells for page management. Sequence operations
  (seq_add, seq_rm, seq_cp, seq_div) and state serialization
  (state_write, state_read) are implemented.
GGML Allocator Modifications:
- ggml-alloc.c/.h:
  - ggml_dyn_tallocr now supports a paged mode, managing its
    memory in page-sized units.
  - ggml_gallocr can now instantiate paged ggml_dyn_tallocr
    instances for specific buffer types via a new
    get_page_size interface method in ggml_backend_buffer_type_i.
- llama.cpp is updated to enable paged allocation for the KV cache
  buffer type when use_paged_kv_cache is true.
CUDA Paged Attention Kernels:
- ggml-cuda/paged_attn_common.cuh: Defines GPU data structures
  (paged_kv_token_mapping_gpu, paged_kv_sequence_view_gpu) and
  a device helper (get_paged_kv_data_ptr_cuda) for paged access.
- ggml-cuda/fattn-mma-f16.cuh: Implemented paged versions of MMA F16
  attention kernels. Supports F16 and Q8_0 K/V data (Q8_0 is
  dequantized to F16 in shared memory). Includes data gather from
  pages and integration of computation logic.
- ggml-cuda/fattn-tile-f16.cuh: Implemented paged versions of Tile F16
  attention kernels, including data gather and computation.
- ggml-cuda.cu: The main Flash Attention dispatcher
  (ggml_cuda_flash_attn_ext) now uses an op_params flag and
  ggml_tensor->extra to differentiate paged calls and pass necessary
  view information to the paged CUDA kernels.
Unit Tests (tests/test-paged-kv-cache.cpp):
- Comprehensive checks for CPU-side llama_paged_kv_cells and
  llama_paged_kv_cache functionalities (allocation, sequence ops,
  state R/W).
- Correctness checks for CUDA MMA F16/Q8_0 and Tile F16 paged
  attention paths, comparing outputs against non-paged reference
  implementations. Includes GPU memory management for test data.

Current Status & Limitations:

CUDA Focus: This implementation primarily targets CUDA.
Metal Deferred: Metal paged attention implementation was blocked by
persistent tooling issues and is not included.
Performance: While functional, the CUDA paged attention kernels have
not undergone specific performance profiling or optimization beyond initial
sensible structuring. The data gather step, in particular, might
introduce overhead compared to contiguous access.
Documentation: Essential comments have been added to key new structures
and logic, but comprehensive documentation across all modified components
is not yet complete.
CUDA Variants: Core MMA and Tile F16/Q8_0 paths are covered. Other
CUDA variants (e.g., WMMA for older GPUs, specific F32 paths if they
don't reuse F16 logic with type changes) may not have paged versions.

This change provides a foundational implementation of paged KV cache and CUDA paged attention, paving the way for further enhancements and broader GPU support.

Make sure to read the contributing guidelines before submitting a PR

This commit introduces an initial implementation of a paged key-value (KV) cache and corresponding paged attention mechanisms for CUDA-enabled GPUs in llama.cpp. The primary goal is to improve memory efficiency for handling long or multiple sequences by mitigating KV cache fragmentation. Key Components: 1. **CPU Paged KV Cache:** * `llama_kv_page.h`: Defines `struct llama_kv_page`. * `llama_paged_kv_cells.h/.cpp`: Implements `llama_paged_kv_cells` for managing fixed-size memory pages allocated from a larger GGML pool. Handles token-to-page/offset mapping. * `llama_paged_kv_cache.h/.cpp`: Implements `llama_paged_kv_cache` (inheriting from `llama_memory_i`). This class allocates its main page pool via GGML (intended to use a paged allocator) and uses `llama_paged_kv_cells` for page management. Sequence operations (`seq_add`, `seq_rm`, `seq_cp`, `seq_div`) and state serialization (`state_write`, `state_read`) are implemented. 2. **GGML Allocator Modifications:** * `ggml-alloc.c/.h`: * `ggml_dyn_tallocr` now supports a `paged` mode, managing its memory in page-sized units. * `ggml_gallocr` can now instantiate paged `ggml_dyn_tallocr` instances for specific buffer types via a new `get_page_size` interface method in `ggml_backend_buffer_type_i`. * `llama.cpp` is updated to enable paged allocation for the KV cache buffer type when `use_paged_kv_cache` is true. 3. **CUDA Paged Attention Kernels:** * `ggml-cuda/paged_attn_common.cuh`: Defines GPU data structures (`paged_kv_token_mapping_gpu`, `paged_kv_sequence_view_gpu`) and a device helper (`get_paged_kv_data_ptr_cuda`) for paged access. * `ggml-cuda/fattn-mma-f16.cuh`: Implemented paged versions of MMA F16 attention kernels. Supports F16 and Q8_0 K/V data (Q8_0 is dequantized to F16 in shared memory). Includes data gather from pages and integration of computation logic. * `ggml-cuda/fattn-tile-f16.cuh`: Implemented paged versions of Tile F16 attention kernels, including data gather and computation. * `ggml-cuda.cu`: The main Flash Attention dispatcher (`ggml_cuda_flash_attn_ext`) now uses an `op_params` flag and `ggml_tensor->extra` to differentiate paged calls and pass necessary view information to the paged CUDA kernels. 4. **Unit Tests (`tests/test-paged-kv-cache.cpp`):** * Comprehensive checks for CPU-side `llama_paged_kv_cells` and `llama_paged_kv_cache` functionalities (allocation, sequence ops, state R/W). * Correctness checks for CUDA MMA F16/Q8_0 and Tile F16 paged attention paths, comparing outputs against non-paged reference implementations. Includes GPU memory management for test data. **Current Status & Limitations:** * **CUDA Focus**: This implementation primarily targets CUDA. * **Metal Deferred**: Metal paged attention implementation was blocked by persistent tooling issues and is not included. * **Performance**: While functional, the CUDA paged attention kernels have not undergone specific performance profiling or optimization beyond initial sensible structuring. The data gather step, in particular, might introduce overhead compared to contiguous access. * **Documentation**: Essential comments have been added to key new structures and logic, but comprehensive documentation across all modified components is not yet complete. * **CUDA Variants**: Core MMA and Tile F16/Q8_0 paths are covered. Other CUDA variants (e.g., WMMA for older GPUs, specific F32 paths if they don't reuse F16 logic with type changes) may not have paged versions. This change provides a foundational implementation of paged KV cache and CUDA paged attention, paving the way for further enhancements and broader GPU support.

JohannesGaessler · 2025-06-08T18:51:06Z

Don't just submit 100% machinegenerated PRs. This code doesn't even compile.

celsowm requested a review from JohannesGaessler as a code owner June 8, 2025 18:34

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 8, 2025

celsowm closed this Jun 8, 2025

celsowm deleted the paged-attention-cuda branch June 8, 2025 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Implement Paged KV Cache and CUDA Paged Attention #14070

feat: Implement Paged KV Cache and CUDA Paged Attention #14070

Uh oh!

celsowm commented Jun 8, 2025

Uh oh!

JohannesGaessler commented Jun 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Implement Paged KV Cache and CUDA Paged Attention #14070

feat: Implement Paged KV Cache and CUDA Paged Attention #14070

Uh oh!

Conversation

celsowm commented Jun 8, 2025

Uh oh!

JohannesGaessler commented Jun 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants