ktransformers/doc/en/prefix_cache.md at main · SCDESPERTATE/ktransformers

Enabling Prefix Cache Mode in KTransformers

Balance serve now supports prefix cache reuse! To enable Prefix Cache Mode in KTransformers, you need to modify the configuration file and recompile the project.

Step 1: Modify the Configuration File

Edit the ./ktransformers/configs/config.yaml file with the following content (you can adjust the values according to your needs):

attn:
  page_size: 16 # Size of a page in KV Cache.
  chunk_size: 256
kvc2:
  gpu_only: false # Set to false to enable prefix cache mode (Disk + CPU + GPU KV storage)
  utilization_percentage: 1.0
  cpu_memory_size_GB: 500 # Amount of CPU memory allocated for KV Cache
  disk_path: /mnt/data/kvc # Path to store KV Cache on disk

Step 2: Update Submodules and Recompile

If this is your first time using prefix cache mode, please update the submodules first:

git submodule update --init --recursive # Update PhotonLibOS submodule

Then recompile the project:

# Install single NUMA dependencies
USE_BALANCE_SERVE=1  bash ./install.sh
# For those who have two cpu and 1T RAM（Dual NUMA）:
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh

Note

Balance serve utilizes a 3-layer (GPU-CPU-Disk) scheme to store and reuse KVCache. Deleting KVCache is not supported now. If you have too much KVCache, you can simply delete them by remove kvcache files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling Prefix Cache Mode in KTransformers

Step 1: Modify the Configuration File

Step 2: Update Submodules and Recompile

Note

FilesExpand file tree

prefix_cache.md

Latest commit

History

prefix_cache.md

File metadata and controls

Enabling Prefix Cache Mode in KTransformers

Step 1: Modify the Configuration File

Step 2: Update Submodules and Recompile

Note