feat: Implement Paged KV Cache and CUDA Paged Attention #14070
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit introduces an initial implementation of a paged key-value (KV) cache and corresponding paged attention mechanisms for CUDA-enabled GPUs in llama.cpp. The primary goal is to improve memory efficiency for handling long or multiple sequences by mitigating KV cache fragmentation.
Key Components:
CPU Paged KV Cache:
llama_kv_page.h: Definesstruct llama_kv_page.llama_paged_kv_cells.h/.cpp: Implementsllama_paged_kv_cellsformanaging fixed-size memory pages allocated from a larger GGML pool.
Handles token-to-page/offset mapping.
llama_paged_kv_cache.h/.cpp: Implementsllama_paged_kv_cache(inheriting from
llama_memory_i). This class allocates its mainpage pool via GGML (intended to use a paged allocator) and uses
llama_paged_kv_cellsfor page management. Sequence operations(
seq_add,seq_rm,seq_cp,seq_div) and state serialization(
state_write,state_read) are implemented.GGML Allocator Modifications:
ggml-alloc.c/.h:ggml_dyn_tallocrnow supports apagedmode, managing itsmemory in page-sized units.
ggml_gallocrcan now instantiate pagedggml_dyn_tallocrinstances for specific buffer types via a new
get_page_sizeinterface method inggml_backend_buffer_type_i.llama.cppis updated to enable paged allocation for the KV cachebuffer type when
use_paged_kv_cacheis true.CUDA Paged Attention Kernels:
ggml-cuda/paged_attn_common.cuh: Defines GPU data structures(
paged_kv_token_mapping_gpu,paged_kv_sequence_view_gpu) anda device helper (
get_paged_kv_data_ptr_cuda) for paged access.ggml-cuda/fattn-mma-f16.cuh: Implemented paged versions of MMA F16attention kernels. Supports F16 and Q8_0 K/V data (Q8_0 is
dequantized to F16 in shared memory). Includes data gather from
pages and integration of computation logic.
ggml-cuda/fattn-tile-f16.cuh: Implemented paged versions of Tile F16attention kernels, including data gather and computation.
ggml-cuda.cu: The main Flash Attention dispatcher(
ggml_cuda_flash_attn_ext) now uses anop_paramsflag andggml_tensor->extrato differentiate paged calls and pass necessaryview information to the paged CUDA kernels.
Unit Tests (
tests/test-paged-kv-cache.cpp):llama_paged_kv_cellsandllama_paged_kv_cachefunctionalities (allocation, sequence ops,state R/W).
attention paths, comparing outputs against non-paged reference
implementations. Includes GPU memory management for test data.
Current Status & Limitations:
persistent tooling issues and is not included.
not undergone specific performance profiling or optimization beyond initial
sensible structuring. The data gather step, in particular, might
introduce overhead compared to contiguous access.
and logic, but comprehensive documentation across all modified components
is not yet complete.
CUDA variants (e.g., WMMA for older GPUs, specific F32 paths if they
don't reuse F16 logic with type changes) may not have paged versions.
This change provides a foundational implementation of paged KV cache and CUDA paged attention, paving the way for further enhancements and broader GPU support.
Make sure to read the contributing guidelines before submitting a PR