Commit fb29ea2
committed
feat: Implement Paged KV Cache and CUDA Paged Attention
This commit introduces an initial implementation of a paged key-value (KV) cache
and corresponding paged attention mechanisms for CUDA-enabled GPUs in llama.cpp.
The primary goal is to improve memory efficiency for handling long or multiple
sequences by mitigating KV cache fragmentation.
Key Components:
1. **CPU Paged KV Cache:**
* `llama_kv_page.h`: Defines `struct llama_kv_page`.
* `llama_paged_kv_cells.h/.cpp`: Implements `llama_paged_kv_cells` for
managing fixed-size memory pages allocated from a larger GGML pool.
Handles token-to-page/offset mapping.
* `llama_paged_kv_cache.h/.cpp`: Implements `llama_paged_kv_cache`
(inheriting from `llama_memory_i`). This class allocates its main
page pool via GGML (intended to use a paged allocator) and uses
`llama_paged_kv_cells` for page management. Sequence operations
(`seq_add`, `seq_rm`, `seq_cp`, `seq_div`) and state serialization
(`state_write`, `state_read`) are implemented.
2. **GGML Allocator Modifications:**
* `ggml-alloc.c/.h`:
* `ggml_dyn_tallocr` now supports a `paged` mode, managing its
memory in page-sized units.
* `ggml_gallocr` can now instantiate paged `ggml_dyn_tallocr`
instances for specific buffer types via a new
`get_page_size` interface method in `ggml_backend_buffer_type_i`.
* `llama.cpp` is updated to enable paged allocation for the KV cache
buffer type when `use_paged_kv_cache` is true.
3. **CUDA Paged Attention Kernels:**
* `ggml-cuda/paged_attn_common.cuh`: Defines GPU data structures
(`paged_kv_token_mapping_gpu`, `paged_kv_sequence_view_gpu`) and
a device helper (`get_paged_kv_data_ptr_cuda`) for paged access.
* `ggml-cuda/fattn-mma-f16.cuh`: Implemented paged versions of MMA F16
attention kernels. Supports F16 and Q8_0 K/V data (Q8_0 is
dequantized to F16 in shared memory). Includes data gather from
pages and integration of computation logic.
* `ggml-cuda/fattn-tile-f16.cuh`: Implemented paged versions of Tile F16
attention kernels, including data gather and computation.
* `ggml-cuda.cu`: The main Flash Attention dispatcher
(`ggml_cuda_flash_attn_ext`) now uses an `op_params` flag and
`ggml_tensor->extra` to differentiate paged calls and pass necessary
view information to the paged CUDA kernels.
4. **Unit Tests (`tests/test-paged-kv-cache.cpp`):**
* Comprehensive checks for CPU-side `llama_paged_kv_cells` and
`llama_paged_kv_cache` functionalities (allocation, sequence ops,
state R/W).
* Correctness checks for CUDA MMA F16/Q8_0 and Tile F16 paged
attention paths, comparing outputs against non-paged reference
implementations. Includes GPU memory management for test data.
**Current Status & Limitations:**
* **CUDA Focus**: This implementation primarily targets CUDA.
* **Metal Deferred**: Metal paged attention implementation was blocked by
persistent tooling issues and is not included.
* **Performance**: While functional, the CUDA paged attention kernels have
not undergone specific performance profiling or optimization beyond initial
sensible structuring. The data gather step, in particular, might
introduce overhead compared to contiguous access.
* **Documentation**: Essential comments have been added to key new structures
and logic, but comprehensive documentation across all modified components
is not yet complete.
* **CUDA Variants**: Core MMA and Tile F16/Q8_0 paths are covered. Other
CUDA variants (e.g., WMMA for older GPUs, specific F32 paths if they
don't reuse F16 logic with type changes) may not have paged versions.
This change provides a foundational implementation of paged KV cache and
CUDA paged attention, paving the way for further enhancements and broader
GPU support.1 parent 5787b5d commit fb29ea2
File tree
18 files changed
+6720
-1413
lines changed- ggml
- include
- src
- ggml-cuda
- ggml-metal
- include
- src
- tests
18 files changed
+6720
-1413
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | | - | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
60 | | - | |
61 | | - | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
62 | 74 | | |
63 | 75 | | |
64 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
41 | | - | |
42 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
43 | 46 | | |
44 | 47 | | |
45 | 48 | | |
| |||
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
29 | 32 | | |
30 | 33 | | |
31 | 34 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
50 | 71 | | |
51 | 72 | | |
52 | 73 | | |
| |||
879 | 900 | | |
880 | 901 | | |
881 | 902 | | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
| 1041 | + | |
| 1042 | + | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
| 1057 | + | |
| 1058 | + | |
| 1059 | + | |
| 1060 | + | |
| 1061 | + | |
| 1062 | + | |
| 1063 | + | |
| 1064 | + | |
0 commit comments