Skip to content

Commit 6a5c7dd

Browse files
committed
add test scripts to examples
1 parent 0c74189 commit 6a5c7dd

File tree

6 files changed

+537
-0
lines changed

6 files changed

+537
-0
lines changed

examples/lmcache/hpu/README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# LMCache Examples
2+
Please Note: HPU integration for LMCache will be upstreamed. After that, the following test cases can be used.
3+
4+
This folder demonstrates how to use LMCache for disaggregated prefilling, CPU offloading and KV cache sharing.
5+
6+
## 1. Disaggregated Prefill in vLLM v1
7+
8+
This example demonstrates how to run LMCache with disaggregated prefill using lm or redis on a single node.
9+
10+
### Prerequisites
11+
- At least 2 HPU cards
12+
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct
13+
- https://github.com/LMCache/LMCache/pull/1066 needed for lmcache
14+
15+
### Usage
16+
17+
Run
18+
`cd disagg_prefill_lmcache_v1`
19+
to get into `disagg_prefill_lmcache_v1` folder, and then run
20+
21+
```bash
22+
PT_HPU_GPU_MIGRATION=1 VLLM_USE_V1=1 VLLM_SKIP_WARMUP=True PT_HPU_ENABLE_LAZY_COLLECTIVES=true bash disagg_example.sh
23+
```
24+
25+
to run disaggregated prefill and benchmark the performance.
26+
27+
lmserver is default and it's configurable as well as tensor_parallel_size and model name.
28+
29+
Example) redis server, tensor_parallel_size 4 and Llama-3.1-70B-Instruct model
30+
31+
```
32+
PT_HPU_GPU_MIGRATION=1 VLLM_USE_V1=1 VLLM_SKIP_WARMUP=True PT_HPU_ENABLE_LAZY_COLLECTIVES=true bash disagg_example.sh -s redis -t 4 -m meta-llama/Llama-3.1-70B-Instruct
33+
```
34+
35+
### Components
36+
37+
#### Server Scripts
38+
- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
39+
- `../disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
40+
- `disagg_prefill_lmcache_v1/disagg_example.sh` - Main script to run the example through lm/redis remote server
41+
42+
#### Configuration
43+
- `disagg_prefill_lmcache_v1/configs/lmcache-config-lm.yaml` - Configuration for prefiller/decoder server through lm server
44+
- `disagg_prefill_lmcache_v1/configs/lmcache-config-redis.yaml` - Configuration for prefill/decoder server through redis server
45+
46+
#### Log Files
47+
The main script generates several log files:
48+
- `prefiller.log` - Logs from the prefill server
49+
- `decoder.log` - Logs from the decode server
50+
- `proxy.log` - Logs from the proxy server
51+
52+
## 2. KV Cache Sharing
53+
54+
The `kv_cache_sharing_lmcache_v1.py` example demonstrates how to share KV caches between vLLM v1 instances.
55+
56+
### Usage
57+
58+
```bash
59+
PT_HPU_GPU_MIGRATION=1 VLLM_USE_V1=1 VLLM_SKIP_WARMUP=True PT_HPU_ENABLE_LAZY_COLLECTIVES=true python kv_cache_sharing_lmcache_v1.py
60+
```
61+
62+
lmserver is default and it's configurable as well as tensor_parallel_size.
63+
64+
Example 1) redis server with port 6380
65+
66+
```bash
67+
PT_HPU_GPU_MIGRATION=1 VLLM_USE_V1=1 VLLM_SKIP_WARMUP=True PT_HPU_ENABLE_LAZY_COLLECTIVES=true python kv_cache_sharing_lmcache_v1.py --remote_server redis --redis_port 6380
68+
```
69+
70+
Example 2) lmserver with port 8108 and tensor_parallel_size 2
71+
72+
```bash
73+
PT_HPU_GPU_MIGRATION=1 VLLM_USE_V1=1 VLLM_SKIP_WARMUP=True PT_HPU_ENABLE_LAZY_COLLECTIVES=true python kv_cache_sharing_lmcache_v1.py --lm_port 8108 --tp_size 2
74+
```
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
chunk_size: 256
2+
local_cpu: False
3+
max_local_cpu_size: 5.0
4+
#local_disk:
5+
max_local_disk_size: 0
6+
remote_serde: naive
7+
remote_url: "lm://localhost:8100"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
chunk_size: 256
2+
local_cpu: False
3+
max_local_cpu_size: 5.0
4+
#local_disk:
5+
max_local_disk_size: 0
6+
remote_serde: naive
7+
remote_url: "redis://localhost:6379"
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
#!/bin/bash
2+
3+
echo "Warning: LMCache disaggregated prefill support for vLLM v1 is experimental and subject to change."
4+
5+
#!/bin/bash
6+
7+
usage() {
8+
echo``
9+
echo "Runs simple request check on multimodal models using vllm"
10+
echo
11+
echo "usage: ${0} <options>"
12+
echo
13+
echo " -s - remote_server (redis/lm). default:lm"
14+
echo " -t - tensor parallel size. default:1"
15+
echo " -m - model. default:meta-llama/Llama-3.1-8B-Instruct"
16+
echo
17+
}
18+
19+
PIDS=()
20+
21+
# Switch to the directory of the current script
22+
cd "$(dirname "${BASH_SOURCE[0]}")"
23+
24+
check_hf_token() {
25+
if [ -z "$HF_TOKEN" ]; then
26+
echo "HF_TOKEN is not set. Please set it to your Hugging Face token."
27+
exit 1
28+
fi
29+
if [[ "$HF_TOKEN" != hf_* ]]; then
30+
echo "HF_TOKEN is not a valid Hugging Face token. Please set it to your Hugging Face token."
31+
exit 1
32+
fi
33+
echo "HF_TOKEN is set and valid."
34+
}
35+
36+
check_num_gpus() {
37+
# can you check if the number of GPUs are >=2 via nvidia-smi?
38+
num_gpus=$(hl-smi --query-gpu=name --format=csv,noheader | wc -l)
39+
if [ "$num_gpus" -lt 2 ]; then
40+
echo "You need at least 2 GPUs to run disaggregated prefill."
41+
exit 1
42+
else
43+
echo "Found $num_gpus GPUs."
44+
fi
45+
}
46+
47+
ensure_python_library_installed() {
48+
echo "Checking if $1 is installed..."
49+
python -c "import $1" > /dev/null 2>&1
50+
if [ $? -ne 0 ]; then
51+
if [ "$1" == "nixl" ]; then
52+
echo "$1 is not installed. Please refer to https://github.com/ai-dynamo/nixl for installation."
53+
else
54+
echo "$1 is not installed. Please install it via pip install $1."
55+
fi
56+
exit 1
57+
else
58+
echo "$1 is installed."
59+
fi
60+
}
61+
62+
cleanup() {
63+
echo "Stopping everything…"
64+
trap - INT TERM # prevent re-entrancy
65+
kill -- -$$ # negative PID == “this whole process-group”
66+
wait # reap children so we don't leave zombies
67+
exit 0
68+
}
69+
70+
wait_for_server() {
71+
local port=$1
72+
local timeout_seconds=1200
73+
local start_time=$(date +%s)
74+
75+
echo "Waiting for server on port $port..."
76+
77+
while true; do
78+
if curl -s "localhost:${port}/v1/completions" > /dev/null; then
79+
return 0
80+
fi
81+
82+
local now=$(date +%s)
83+
if (( now - start_time >= timeout_seconds )); then
84+
echo "Timeout waiting for server"
85+
return 1
86+
fi
87+
88+
sleep 1
89+
done
90+
}
91+
92+
93+
SERVER="lm"
94+
TP_SIZE=1
95+
MODEL="llama3.1/Meta-Llama-3.1-8B-Instruct"
96+
97+
main() {
98+
while [[ "$#" -gt 0 ]]; do
99+
case $1 in
100+
-s) SERVER="$2"; shift ;;
101+
-t) TP_SIZE="$2"; shift ;;
102+
-m) MODEL="$2"; shift ;;
103+
*) echo "Unknown parameter passed: $1"; exit 1 ;;
104+
esac
105+
shift
106+
done
107+
108+
echo "server: $SERVER"
109+
echo "tensor parallel size: $TP_SIZE"
110+
echo "model: $MODEL"
111+
112+
#check_hf_token
113+
check_num_gpus
114+
ensure_python_library_installed lmcache
115+
ensure_python_library_installed pandas
116+
ensure_python_library_installed datasets
117+
ensure_python_library_installed vllm
118+
119+
trap cleanup INT
120+
trap cleanup USR1
121+
trap cleanup TERM
122+
123+
echo "Launching prefiller, decoder and proxy..."
124+
echo "Please check prefiller.log, decoder.log and proxy.log for logs."
125+
126+
if [[ $SERVER == "lm" ]]; then
127+
echo "starting lmcache "
128+
python -m lmcache.v1.server localhost 8100 2>&1 &
129+
elif [[ $SERVER == "redis" ]]; then
130+
echo "starting redis-server "
131+
redis-server --port 6379 &
132+
else
133+
echo "Invalid server: $SERVER"
134+
exit 1
135+
fi
136+
137+
echo "start prefiller "
138+
bash disagg_vllm_launcher.sh prefiller $SERVER $TP_SIZE $MODEL \
139+
> >(tee prefiller.log) 2>&1 &
140+
prefiller_pid=$!
141+
PIDS+=($prefiller_pid)
142+
echo "start decoder "
143+
bash disagg_vllm_launcher.sh decoder $SERVER $TP_SIZE $MODEL \
144+
> >(tee decoder.log) 2>&1 &
145+
decoder_pid=$!
146+
PIDS+=($decoder_pid)
147+
148+
python3 ../../disagg_prefill_lmcache_v1/disagg_proxy_server.py \
149+
--host localhost \
150+
--port 1000 \
151+
--prefiller-host localhost \
152+
--prefiller-port 1100 \
153+
--decoder-host localhost \
154+
--decoder-port 1200 \
155+
> >(tee proxy.log) 2>&1 &
156+
proxy_pid=$!
157+
PIDS+=($proxy_pid)
158+
159+
wait_for_server 1100
160+
wait_for_server 1200
161+
wait_for_server 1000
162+
163+
echo "All servers are up. Starting benchmark..."
164+
165+
# begin benchmark
166+
cd ../../../../../benchmarks/
167+
python benchmark_serving.py --port 1000 --seed 12345 \
168+
--model $MODEL \
169+
--dataset-name random --random-input-len 8000 --random-output-len 200 \
170+
--num-prompts 200 --burstiness 100 --request-rate 3.6 | tee benchmark.log
171+
172+
echo "Benchmarking done. Cleaning up..."
173+
174+
cleanup
175+
176+
}
177+
178+
main "$@"
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
#!/bin/bash
2+
3+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
4+
5+
if [[ $# -lt 1 ]]; then
6+
echo "Usage: $0 <prefiller | decoder> [server] [tp] [model]"
7+
exit 1
8+
fi
9+
10+
SERVER="lm"
11+
TP_SIZE=1
12+
MODEL="llama3.1/Meta-Llama-3.1-8B-Instruct"
13+
14+
if [[ $# -eq 1 ]]; then
15+
echo "Using default server: $SERVER"
16+
echo "Using default tp: $TP_SIZE"
17+
echo "Using default model: $MODEL"
18+
else
19+
SERVER=$2
20+
TP_SIZE=$3
21+
MODEL=$4
22+
echo "Using server: $SERVER"
23+
echo "Using tp: $TP_SIZE"
24+
echo "Using model: $MODEL"
25+
fi
26+
27+
28+
if [[ $1 == "prefiller" ]]; then
29+
if [[ $SERVER == "lm" ]]; then
30+
# Prefiller listens on port 8100
31+
prefill_config_file=$SCRIPT_DIR/configs/lmcache-config-lm.yaml
32+
elif [[ $SERVER == "redis" ]]; then
33+
# Prefiller listens on port 6379
34+
prefill_config_file=$SCRIPT_DIR/configs/lmcache-config-redis.yaml
35+
else
36+
echo "Invalid server: $2"
37+
exit 1
38+
fi
39+
40+
#UCX_TLS=tcp \
41+
LMCACHE_CONFIG_FILE=$prefill_config_file \
42+
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
43+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
44+
RANK=0 \
45+
vllm serve $MODEL \
46+
--port 1100 \
47+
--disable-log-requests \
48+
--tensor_parallel_size $TP_SIZE \
49+
--kv-transfer-config \
50+
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'
51+
52+
53+
elif [[ $1 == "decoder" ]]; then
54+
if [[ $SERVER == "lm" ]]; then
55+
# Decoder listens on port 8100
56+
decode_config_file=$SCRIPT_DIR/configs/lmcache-config-lm.yaml
57+
elif [[ $SERVER == "redis" ]]; then
58+
# Decoder listens on port 6379
59+
decode_config_file=$SCRIPT_DIR/configs/lmcache-config-redis.yaml
60+
else
61+
echo "Invalid server: $2"
62+
exit 1
63+
fi
64+
65+
#UCX_TLS=tcp \
66+
LMCACHE_CONFIG_FILE=$decode_config_file \
67+
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
68+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
69+
RANK=1 \
70+
vllm serve $MODEL \
71+
--port 1200 \
72+
--disable-log-requests \
73+
--tensor_parallel_size $TP_SIZE \
74+
--kv-transfer-config \
75+
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}'
76+
77+
78+
else
79+
echo "Invalid role: $1"
80+
echo "Should be either prefill, decode"
81+
exit 1
82+
fi

0 commit comments

Comments
 (0)