Skip to content

Conversation

@celsowm
Copy link

@celsowm celsowm commented Jun 8, 2025

This commit introduces an initial implementation of a paged key-value (KV) cache and corresponding paged attention mechanisms for CUDA-enabled GPUs in llama.cpp. The primary goal is to improve memory efficiency for handling long or multiple sequences by mitigating KV cache fragmentation.

Key Components:

  1. CPU Paged KV Cache:

    • llama_kv_page.h: Defines struct llama_kv_page.
    • llama_paged_kv_cells.h/.cpp: Implements llama_paged_kv_cells for
      managing fixed-size memory pages allocated from a larger GGML pool.
      Handles token-to-page/offset mapping.
    • llama_paged_kv_cache.h/.cpp: Implements llama_paged_kv_cache
      (inheriting from llama_memory_i). This class allocates its main
      page pool via GGML (intended to use a paged allocator) and uses
      llama_paged_kv_cells for page management. Sequence operations
      (seq_add, seq_rm, seq_cp, seq_div) and state serialization
      (state_write, state_read) are implemented.
  2. GGML Allocator Modifications:

    • ggml-alloc.c/.h:
      • ggml_dyn_tallocr now supports a paged mode, managing its
        memory in page-sized units.
      • ggml_gallocr can now instantiate paged ggml_dyn_tallocr
        instances for specific buffer types via a new
        get_page_size interface method in ggml_backend_buffer_type_i.
    • llama.cpp is updated to enable paged allocation for the KV cache
      buffer type when use_paged_kv_cache is true.
  3. CUDA Paged Attention Kernels:

    • ggml-cuda/paged_attn_common.cuh: Defines GPU data structures
      (paged_kv_token_mapping_gpu, paged_kv_sequence_view_gpu) and
      a device helper (get_paged_kv_data_ptr_cuda) for paged access.
    • ggml-cuda/fattn-mma-f16.cuh: Implemented paged versions of MMA F16
      attention kernels. Supports F16 and Q8_0 K/V data (Q8_0 is
      dequantized to F16 in shared memory). Includes data gather from
      pages and integration of computation logic.
    • ggml-cuda/fattn-tile-f16.cuh: Implemented paged versions of Tile F16
      attention kernels, including data gather and computation.
    • ggml-cuda.cu: The main Flash Attention dispatcher
      (ggml_cuda_flash_attn_ext) now uses an op_params flag and
      ggml_tensor->extra to differentiate paged calls and pass necessary
      view information to the paged CUDA kernels.
  4. Unit Tests (tests/test-paged-kv-cache.cpp):

    • Comprehensive checks for CPU-side llama_paged_kv_cells and
      llama_paged_kv_cache functionalities (allocation, sequence ops,
      state R/W).
    • Correctness checks for CUDA MMA F16/Q8_0 and Tile F16 paged
      attention paths, comparing outputs against non-paged reference
      implementations. Includes GPU memory management for test data.

Current Status & Limitations:

  • CUDA Focus: This implementation primarily targets CUDA.
  • Metal Deferred: Metal paged attention implementation was blocked by
    persistent tooling issues and is not included.
  • Performance: While functional, the CUDA paged attention kernels have
    not undergone specific performance profiling or optimization beyond initial
    sensible structuring. The data gather step, in particular, might
    introduce overhead compared to contiguous access.
  • Documentation: Essential comments have been added to key new structures
    and logic, but comprehensive documentation across all modified components
    is not yet complete.
  • CUDA Variants: Core MMA and Tile F16/Q8_0 paths are covered. Other
    CUDA variants (e.g., WMMA for older GPUs, specific F32 paths if they
    don't reuse F16 logic with type changes) may not have paged versions.

This change provides a foundational implementation of paged KV cache and CUDA paged attention, paving the way for further enhancements and broader GPU support.

Make sure to read the contributing guidelines before submitting a PR

This commit introduces an initial implementation of a paged key-value (KV) cache
and corresponding paged attention mechanisms for CUDA-enabled GPUs in llama.cpp.
The primary goal is to improve memory efficiency for handling long or multiple
sequences by mitigating KV cache fragmentation.

Key Components:

1.  **CPU Paged KV Cache:**
    *   `llama_kv_page.h`: Defines `struct llama_kv_page`.
    *   `llama_paged_kv_cells.h/.cpp`: Implements `llama_paged_kv_cells` for
        managing fixed-size memory pages allocated from a larger GGML pool.
        Handles token-to-page/offset mapping.
    *   `llama_paged_kv_cache.h/.cpp`: Implements `llama_paged_kv_cache`
        (inheriting from `llama_memory_i`). This class allocates its main
        page pool via GGML (intended to use a paged allocator) and uses
        `llama_paged_kv_cells` for page management. Sequence operations
        (`seq_add`, `seq_rm`, `seq_cp`, `seq_div`) and state serialization
        (`state_write`, `state_read`) are implemented.

2.  **GGML Allocator Modifications:**
    *   `ggml-alloc.c/.h`:
        *   `ggml_dyn_tallocr` now supports a `paged` mode, managing its
          memory in page-sized units.
        *   `ggml_gallocr` can now instantiate paged `ggml_dyn_tallocr`
          instances for specific buffer types via a new
          `get_page_size` interface method in `ggml_backend_buffer_type_i`.
    *   `llama.cpp` is updated to enable paged allocation for the KV cache
        buffer type when `use_paged_kv_cache` is true.

3.  **CUDA Paged Attention Kernels:**
    *   `ggml-cuda/paged_attn_common.cuh`: Defines GPU data structures
        (`paged_kv_token_mapping_gpu`, `paged_kv_sequence_view_gpu`) and
        a device helper (`get_paged_kv_data_ptr_cuda`) for paged access.
    *   `ggml-cuda/fattn-mma-f16.cuh`: Implemented paged versions of MMA F16
        attention kernels. Supports F16 and Q8_0 K/V data (Q8_0 is
        dequantized to F16 in shared memory). Includes data gather from
        pages and integration of computation logic.
    *   `ggml-cuda/fattn-tile-f16.cuh`: Implemented paged versions of Tile F16
        attention kernels, including data gather and computation.
    *   `ggml-cuda.cu`: The main Flash Attention dispatcher
        (`ggml_cuda_flash_attn_ext`) now uses an `op_params` flag and
        `ggml_tensor->extra` to differentiate paged calls and pass necessary
        view information to the paged CUDA kernels.

4.  **Unit Tests (`tests/test-paged-kv-cache.cpp`):**
    *   Comprehensive checks for CPU-side `llama_paged_kv_cells` and
        `llama_paged_kv_cache` functionalities (allocation, sequence ops,
        state R/W).
    *   Correctness checks for CUDA MMA F16/Q8_0 and Tile F16 paged
        attention paths, comparing outputs against non-paged reference
        implementations. Includes GPU memory management for test data.

**Current Status & Limitations:**

*   **CUDA Focus**: This implementation primarily targets CUDA.
*   **Metal Deferred**: Metal paged attention implementation was blocked by
    persistent tooling issues and is not included.
*   **Performance**: While functional, the CUDA paged attention kernels have
    not undergone specific performance profiling or optimization beyond initial
    sensible structuring. The data gather step, in particular, might
    introduce overhead compared to contiguous access.
*   **Documentation**: Essential comments have been added to key new structures
    and logic, but comprehensive documentation across all modified components
    is not yet complete.
*   **CUDA Variants**: Core MMA and Tile F16/Q8_0 paths are covered. Other
    CUDA variants (e.g., WMMA for older GPUs, specific F32 paths if they
    don't reuse F16 logic with type changes) may not have paged versions.

This change provides a foundational implementation of paged KV cache and
CUDA paged attention, paving the way for further enhancements and broader
GPU support.
@celsowm celsowm requested a review from JohannesGaessler as a code owner June 8, 2025 18:34
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 8, 2025
@JohannesGaessler
Copy link
Collaborator

Don't just submit 100% machinegenerated PRs. This code doesn't even compile.

@celsowm celsowm closed this Jun 8, 2025
@celsowm celsowm deleted the paged-attention-cuda branch June 8, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants