FlexKV and Dynamo Integration Guide

This document demonstrates how to integrate FlexKV with NVIDIA's Dynamo framework and complete performance testing.

Dynamo is a framework designed by NVIDIA for large-scale distributed deployment, supporting multiple backend engines including TensorRT-LLM, vLLM, and SGLang. The KV Router is an intelligent request routing component that tracks and manages KV caches stored on different workers. It intelligently assigns requests to the most suitable worker based on the overlap between requests and KV cache, as well as the current worker load, thereby reducing expensive KV cache recomputations and improving inference efficiency. This document also explains how to integrate FlexKV into Dynamo when the KV Router is enabled.

Caution

This feature conflicts with namespace isolation.
This feature intends not to be used with distributed KV cache reuse.

1. Environment Setup

Install vLLM

Refer to vLLM adaptation README.

Install Dynamo

# 1. Clone Dynamo repo
git clone https://github.com/ai-dynamo/dynamo.git

# 2. Apply PR #5858
gh pr checkout 5858 # Make sure GitHub CLI is installed first

# 3. Install NIXL
uv pip install 'nixl[cu12]' # Or 'nixl[cu13]'

# 4. Install Dynamo
cd $DYNAMO_WORKSPACE/lib/bindings/python
maturin develop --uv
cd ../../..
uv pip install -e . # No need to specify backend as vLLM is already installed

# 5. Install nats-server and etcd

FlexKV Code Preparation

git clone https://github.com/taco-project/FlexKV

Install FlexKV

apt update && apt install liburing-dev

cd FlexKV && ./build.sh

Refer to GPUDirect Storage (GDS) README to enable GDS.
Refer to KV cache reuse README to enable KV cache sharing between peer nodes in a distributed setup.

FlexKV Verification

Please refer to the test scripts in vLLM online serving.

2. Starting and Verifying Dynamo Services

Starting Dynamo + FlexKV

The following example starts 4 Dynamo vLLM workers on an 8-GPU compute node with KV router enabled.

# Start NATS with JetStream
nats-server -js -a 127.0.0.1 -p 4222 --store_dir $NATS_DIR &

# Start etcd
etcd --data-dir /tmp/etcd \
  --listen-client-urls http://127.0.0.1:2379 \
  --advertise-client-urls http://YOUR_IP:2379 & # YOUR_IP is the IP address of this node.

sleep 3

export NATS_SERVER="nats://127.0.0.1:4222"
export ETCD_ENDPOINTS="http://127.0.0.1:2379"

# Start Dynamo frontend
python -m dynamo.frontend --router-mode kv &

# Define number of worker nodes
NUM_WORKERS=4

# Enable collecting KV events in FlexKV
export DYNAMO_USE_FLEXKV=1
# Configure FlexKV using environment variables, disabling config file
unset FLEXKV_CONFIG_PATH
# Adjust CPU and SSD space sizes according to your server configuration
export FLEXKV_CPU_CACHE_GB=32
export FLEXKV_SSD_CACHE_GB=128
# Use a loop to start worker nodes
for i in $(seq 0 $((NUM_WORKERS-1))); do
    # Calculate GPU device IDs
    GPU_START=$((i*2))
    GPU_END=$((i*2+1))

    if [ $i -lt $((NUM_WORKERS-1)) ]; then
        # When using multiple workers, ensure FlexKV ports are different to avoid hanging at flexkv init
        # Set FlexKV port via the `FLEXKV_SERVER_RECV_PORT` environment variable
        FLEXKV_SSD_CACHE_DIR="/data/flexkv_ssd/worker_${i}" \
        FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_${i}" \
        KV_ENDPOINT="tcp://*:2008${i}" \
        KV_EVENTS_CONFIG="$(printf '{"publisher":"zmq","topic":"kv-events","endpoint":"%s","enable_kv_cache_events":true}' "$KV_ENDPOINT")" \
        CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} \
        python3 -m dynamo.vllm \
        --model $YOUR_MODEL \
        --tensor-parallel-size 2 \
        --connector flexkv \
        --kv-events-config "$KV_EVENTS_CONFIG" &
    else
        FLEXKV_SSD_CACHE_DIR="/data/flexkv_ssd/worker_${i}" \
        FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_${i}" \
        KV_ENDPOINT="tcp://*:2008${i}" \
        KV_EVENTS_CONFIG="$(printf '{"publisher":"zmq","topic":"kv-events","endpoint":"%s","enable_kv_cache_events":true}' "$KV_ENDPOINT")" \
        CUDA_VISIBLE_DEVICES=${GPU_START},${GPU_END} \
        python3 -m dynamo.vllm \
        --model $YOUR_MODEL \
        --tensor-parallel-size 2 \
        --connector flexkv \
        --kv-events-config "$KV_EVENTS_CONFIG"
    fi
done

Note

You can configure FlexKV using YAML or JSON files. The above configuration is provided as a simple example only. For full parameter options, please refer to docs/flexkv_config_reference/README_en.md

Verification

You can verify that the Dynamo service has started correctly with the following command:

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": YOUR MODEL,
    "messages": [
    {
        "role": "user",
        "content": "Tell me a joke."
    }
    ],
    "stream":false,
    "max_tokens": 30
  }'

4. Benchmark

We use aiperf as our benchmarking tool and mooncake trace as our dataset to evaluate the performance of Dynamo + FlexKV.

Mooncake Trace is an open-source request file saved in jsonl format. It records timestamps of request arrivals, ISL, OSL, and KV cache-related hash IDs, containing 23,608 requests over a 1-hour period. For our experiment with 4 LLaMA-70B workers, the concurrency in the mooncake trace was too high, so we sampled every 6th request from the trace to build our benchmark dataset.

aiperf can send requests according to the timestamps in the trace file and calculate metrics such as TTFT (Time To First Token) and TPOT (Tokens Per Output Token) for the LLM service. The command is as follows.

aiperf profile \
  --model $YOUR_MODEL \
  --tokenizer $YOUR_TOKENIZER \
  --endpoint-type 'chat' \
  --endpoint '/v1/chat/completions' \
  --streaming \
  --url http://localhost:8000 \
  --input-file $YOUR_TRACE \
  --random-seed 100 \
  -H 'Authorization: Bearer NOT USED'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlexKV and Dynamo Integration Guide

1. Environment Setup

Install vLLM

Install Dynamo

FlexKV Code Preparation

Install FlexKV

FlexKV Verification

2. Starting and Verifying Dynamo Services

Starting Dynamo + FlexKV

Verification

4. Benchmark

FilesExpand file tree

README_en.md

Latest commit

History

README_en.md

File metadata and controls

FlexKV and Dynamo Integration Guide

1. Environment Setup

Install vLLM

Install Dynamo

FlexKV Code Preparation

Install FlexKV

FlexKV Verification

2. Starting and Verifying Dynamo Services

Starting Dynamo + FlexKV

Verification

4. Benchmark