Skip to content

Commit 0101665

Browse files
authored
[docs]add doc for pipeline store (#612)
1 parent 6556bfd commit 0101665

File tree

5 files changed

+287
-17
lines changed

5 files changed

+287
-17
lines changed

docs/source/getting-started/quickstart_vllm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,8 @@ Download the pre-built `vllm/vllm-openai:v0.9.2` docker image and build unified-
9191
export PLATFORM=cuda
9292
pip install uc-manager
9393
```
94+
> **Note:** If installing via `pip install`, you need to manually add the `config.yaml` file, similar to `unified-cache-management/examples/ucm_config_example.yaml`, because PyPI packages do not include YAML files.
95+
9496

9597
## Step 2: Configuration
9698

docs/source/getting-started/quickstart_vllm_ascend.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ Install by pip or find the pre-build wheels on [Pypi](https://pypi.org/project/u
3030
export PLATFORM=ascend
3131
pip install uc-manager
3232
```
33+
> **Note:** If installing via `pip install`, you need to manually add the `config.yaml` file, similar to `unified-cache-management/examples/ucm_config_example.yaml`, because PyPI packages do not include YAML files.
34+
3335

3436
### Option 3: Setup from docker
3537
Download the pre-built `vllm-ascend` docker image and build unified-cache-management docker image by commands below:

docs/source/user-guide/prefix-cache/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,4 +80,5 @@ performance.
8080
:::{toctree}
8181
:maxdepth: 1
8282
nfs_store
83+
pipeline_store
8384
:::

docs/source/user-guide/prefix-cache/nfs_store.md

Lines changed: 33 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -86,26 +86,40 @@ ucm_connectors:
8686
- ucm_connector_name: "UcmNfsStore"
8787
ucm_connector_config:
8888
storage_backends: "/mnt/test"
89-
use_direct: false
89+
io_direct: false
9090

9191
load_only_first_rank: false
9292
```
93+
### Required Parameters
9394
94-
Explanation:
95-
96-
* ucm_connector_name: "UcmNfsStore":
95+
* **ucm_connector_name**:
9796
Specifies `UcmNfsStore` as the UCM connector.
9897

99-
* storage_backends:
100-
Specify the directory used for storing KV blocks. It can be a local directory or an NFS-mounted path. UCM will store KV blocks here.
101-
**⚠️ Make sure to replace `"/mnt/test"` with your actual storage directory.**
98+
* **storage_backends**:
99+
Directory used for storing KV blocks. Can be a local path or an NFS-mounted path.
100+
**⚠️ Replace `"/mnt/test"` with your actual storage directory.**
101+
102+
### Optional Parameters
103+
104+
* **io_direct** (optional, default: `false`):
105+
Whether to enable direct I/O.
106+
107+
* **stream_number** *(optional, default: 8)*
108+
Number of concurrent streams used for data transfer.
109+
110+
* **timeout_ms** *(optional, default: 30000)*
111+
Timeout in milliseconds for external interfaces.
112+
113+
* **buffer_number** *(optional, default: 4096)*
114+
The number of intermediate buffers for data transfer.
115+
116+
* **shard_data_dir** *(optional, default: true)*
117+
Whether files are spread across subdirectories or stored in a single directory.
102118

103-
* use_direct:
104-
Whether to enable direct I/O (optional). Default is `false`.
119+
### Must-be-Set Parameters
105120

106-
* load_only_first_rank:
107-
Controls whether only rank 0 loads KV cache and broadcasts it to other ranks.
108-
This feature is currently not supported on Ascend, so it must be set to `false` (all ranks load/dump independently).
121+
* **load_only_first_rank** (must be `false`):
122+
This feature is currently disabled.
109123

110124
## Launching Inference
111125

@@ -185,7 +199,7 @@ Running the same benchmark again produces:
185199

186200
```
187201
---------------Time to First Token----------------
188-
Mean TTFT (ms): 1920.68
202+
Mean TTFT (ms): 3183.97
189203
```
190204

191205
The vLLM server logs now contain similar entries:
@@ -194,16 +208,18 @@ The vLLM server logs now contain similar entries:
194208
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
195209
```
196210

197-
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
211+
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
198212

199213
### Log Message Structure
214+
> If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.
200215
```text
201-
[UCMNFSSTORE] [I] Task(<task_id>,<direction>,<task_count>,<size>) finished, elapsed <time>s
216+
[UC][D] Task(<task_id>,<direction>,<task_count>,<size>) finished, costs=<time>s, bw={speed}GB/s
202217
```
203218
| Component | Description |
204219
|--------------|-----------------------------------------------------------------------------|
205220
| `task_id` | Unique identifier for the task |
206-
| `direction` | `D2S`: Dump to Storage (Device → SSD)<br>`S2D`: Load from Storage (SSD → Device) |
221+
| `direction` | `PC::D2S`: Dump to Storage (Device → SSD)<br>`PC::S2D`: Load from Storage (SSD → Device) |
207222
| `task_count` | Number of tasks executed in this operation |
208223
| `size` | Total size of data transferred in bytes (across all tasks) |
209-
| `time` | Time taken for the complete operation in seconds |
224+
| `time` | Time taken for the complete operation in seconds |
225+
| `speed` | Task transfer speed between Device and Storage |
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# PipelineStore
2+
3+
**PipelineStore** is a composite store built by **chaining multiple Store implementations** together to form a data transfer pipeline.
4+
5+
Currently, the pipeline supports a chain composed of **Cache Store** and **Posix Store**.
6+
7+
In this chained pipeline:
8+
- **Cache Store** handles data transfer between the **Device and Host**.
9+
- Once the data flows from the Device to the Host, **Posix Store** is responsible for transferring the data between the **Host and POSIX-compliant persistent storage**, such as local disks, SSDs, or remote NFS (including NFS over RDMA) mount points.
10+
11+
At present, only this Store chain is supported.
12+
Additional Store implementations will be developed in the future and **chained** into the pipeline to enable more flexible and extensible transfer paths.
13+
14+
15+
## Performance
16+
17+
### Overview
18+
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM.
19+
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD.
20+
21+
Here, Full Compute refers to pure VLLM inference, while SSD80% indicates that after UCM pooling, the SSD hit rate of the KV cache is 80%.
22+
23+
The following table shows the results on the QwQ-32B model(**4 x H100 GPUs**):
24+
| **QwQ-32B** | | | | |
25+
| ---------------: | -------------: | -------------------: | -------------: | :------------ |
26+
| **Input length** | **Concurrent** | **Full Compute (ms)** | **SSD80% (ms)** | **Speedup (%)** |
27+
| 4 000 | 1 | 223.05 | 156.54 | **+42.5%** |
28+
| 8 000 | 1 | 350.47 | 228.27 | **+53.5%** |
29+
| 16 000 | 1 | 708.94 | 349.17 | **+103.0%** |
30+
| 32 000 | 1 | 1512.04 | 635.18 | **+138.0%** |
31+
| 4 000 | 8 | 908.52 | 625.92 | **+45.1%** |
32+
| 8 000 | 8 | 1578.72 | 955.25 | **+65.3%** |
33+
| 16 000 | 8 | 3139.03 | 1647.72 | **+90.5%** |
34+
| 32 000 | 8 | 6735.25 | 3025.23 | **+122.6%** |
35+
| 4 000 | 16 | 1509.79 | 919.53 | **+64.2%** |
36+
| 8 000 | 16 | 2602.34 | 1480.30 | **+75.8%** |
37+
| 16 000 | 16 | 5732.49 | 2393.54 | **+139.5%** |
38+
| 32 000 | 16 | 11891.61 | 4790.00 | **+148.3%** |
39+
40+
41+
The following table shows the results on the DeepSeek-R1-awq model (**8 × H100 GPUs**):
42+
|**DeepSeek-R1-awq**| | | | |
43+
| -----------------:| -------------: | -------------------: | -------------: | :------------ |
44+
| **Input length** | **Concurrent** | **Full Compute (ms)** | **SSD80% (ms)** | **Speedup (%)** |
45+
| 4 000 | 1 | 429.30 | 261.34 | **+64.3%** |
46+
| 8 000 | 1 | 762.23 | 363.37 | **+109.8%** |
47+
| 16 000 | 1 | 1426.06 | 586.17 | **+143.3%** |
48+
| 32 000 | 1 | 3086.85 | 1073.25 | **+187.6%** |
49+
| 4 000 | 8 | 1823.55 | 1017.72 | **+79.2%** |
50+
| 8 000 | 8 | 3214.76 | 1511.16 | **+112.7%** |
51+
| 16 000 | 8 | 6417.81 | 2596.70 | **+147.2%** |
52+
| 32 000 | 8 | 14278.00 | 5111.67 | **+179.3%** |
53+
| 4 000 | 16 | 3205.22 | 1534.00 | **+108.9%** |
54+
| 8 000 | 16 | 5813.09 | 2208.60 | **+163.2%** |
55+
| 16 000 | 16 | 11752.48 | 4000.46 | **+193.8%** |
56+
| 32 000 | 16 | 38643.73 | 19910.41 | **+94.1%** |
57+
58+
59+
60+
## Configuration for Prefix Caching
61+
62+
Modify the UCM configuration file to specify which UCM connector to use and where KV blocks should be stored.
63+
You may directly edit the example file at:
64+
65+
`unified-cache-management/examples/ucm_config_example.yaml`
66+
67+
A minimal configuration looks like this:
68+
69+
```yaml
70+
ucm_connectors:
71+
- ucm_connector_name: "UcmPipelineStore"
72+
ucm_connector_config:
73+
store_pipeline: "Cache|Posix"
74+
storage_backends: "/mnt/test"
75+
76+
load_only_first_rank: false
77+
```
78+
79+
### Required Parameters
80+
81+
* **ucm_connector_name**:
82+
Specifies `UcmPipelineStore` as the UCM connector.
83+
84+
* **store_pipeline: "Cache|Posix"**
85+
Specifies a pipeline built by **chaining the Cache Store and the Posix Store**.
86+
In this chained pipeline, the Cache Store handles data transfer between the **Device and Host**,
87+
and once the data reaches the Host, the Posix Store transfers it between the **Host and POSIX-compliant persistent storage**.
88+
89+
The pipeline must be registered in advance in
90+
`unified-cache-management/ucm/store/pipeline/connector.py` under `PIPELINE_REGISTRY`.
91+
92+
Currently, **only this Store chain is supported**.
93+
94+
95+
* **storage_backends**:
96+
Directory used for storing KV blocks. Can be a local path or an NFS-mounted path.
97+
**⚠️ Replace `"/mnt/test"` with your actual storage directory.**
98+
99+
### Optional Parameters
100+
101+
* **io_direct** (optional, default: `false`):
102+
Whether to enable direct I/O.
103+
104+
* **stream_number** *(optional, default: 8)*
105+
Number of concurrent streams used for data transfer.
106+
107+
* **waiting_queue_depth** *(optional, default: 1024)*
108+
Depth of the waiting queue for transfer tasks.
109+
110+
* **running_queue_depth** *(optional, default: 32768)*
111+
Depth of the running queue for transfer tasks.
112+
113+
* **timeout_ms** *(optional, default: 30000)*
114+
Timeout in milliseconds for external interfaces.
115+
116+
* **buffer_size** *(optional, default: 64GB)*
117+
Amount of dram pinned memory used by a single worker process.
118+
119+
### Must-be-Set Parameters
120+
121+
* **load_only_first_rank** (must be `false`):
122+
This feature is currently disabled.
123+
124+
125+
## Launching Inference
126+
127+
In this guide, we describe **online inference** using vLLM with the UCM connector, deployed as an OpenAI-compatible server. For best performance with UCM, it is recommended to set `block_size` to 128.
128+
129+
To start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model, run:
130+
131+
```bash
132+
vllm serve Qwen/Qwen2.5-14B-Instruct \
133+
--max-model-len 20000 \
134+
--tensor-parallel-size 2 \
135+
--gpu_memory_utilization 0.87 \
136+
--block_size 128 \
137+
--trust-remote-code \
138+
--port 7800 \
139+
--enforce-eager \
140+
--no-enable-prefix-caching \
141+
--kv-transfer-config \
142+
'{
143+
"kv_connector": "UCMConnector",
144+
"kv_role": "kv_both",
145+
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
146+
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"}
147+
}'
148+
```
149+
150+
**⚠️ Make sure to replace `"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
151+
152+
If you see log as below:
153+
154+
```bash
155+
INFO: Started server process [1049932]
156+
INFO: Waiting for application startup.
157+
INFO: Application startup complete.
158+
```
159+
160+
Congratulations, you have successfully started the vLLM server with UCM connector!
161+
162+
## Evaluating UCM Prefix Caching Performance
163+
After launching the vLLM server with `UCMConnector` enabled, the easiest way to observe the prefix caching effect is to run the built-in `vllm bench` CLI. Executing the following command **twice** in a separate terminal shows the improvement clearly.
164+
165+
```bash
166+
vllm bench serve \
167+
--backend vllm \
168+
--model Qwen/Qwen2.5-14B-Instruct \
169+
--host 127.0.0.1 \
170+
--port 7800 \
171+
--dataset-name random \
172+
--num-prompts 12 \
173+
--random-input-len 16000 \
174+
--random-output-len 2 \
175+
--request-rate inf \
176+
--seed 123456 \
177+
--percentile-metrics "ttft,tpot,itl,e2el" \
178+
--metric-percentiles "90,99" \
179+
--ignore-eos
180+
```
181+
182+
### After the first execution
183+
The `vllm bench` terminal prints the benchmark result:
184+
185+
```
186+
---------------Time to First Token----------------
187+
Mean TTFT (ms): 15001.64
188+
```
189+
190+
Inspecting the vLLM server logs reveals entries like:
191+
192+
```
193+
INFO ucm_connector.py:317: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 0
194+
```
195+
196+
This indicates that for the first inference request, UCM did not hit any cached KV blocks. As a result, the full 16K-token prefill must be computed, leading to a relatively large TTFT.
197+
198+
### After the second execution
199+
Running the same benchmark again produces:
200+
201+
```
202+
---------------Time to First Token----------------
203+
Mean TTFT (ms): 2874.21
204+
```
205+
206+
The vLLM server logs now contain similar entries:
207+
208+
```
209+
INFO ucm_connector.py:317: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
210+
```
211+
212+
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
213+
214+
### Log Message Structure
215+
> If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.
216+
217+
You may see the following typical log messages in the logs.
218+
219+
```text
220+
[UC][D] Cache task({task_id},{operation},{subtask_number},{size}) dispatching. [PID,TID]
221+
```
222+
This log indicates that the **Cache Store** has received a **load or dump task**
223+
| Component | Description |
224+
|--------------|-----------------------------------------------------------------------------|
225+
| `task_id` | Unique identifier for the Cache Ctore task |
226+
| `operation` | `DUMP`: Dump to Host(Device → Host) <br>`LOAD`: Load from Host (Host → Device) |
227+
| `subtask_number` | Number of subtasks executed in this operation |
228+
| `size` | Total size of data transferred in bytes (across all tasks) |
229+
230+
```text
231+
[UC][D] Cache task({task_id},{operation},{subtask_number},{size}) finished, cost {time}ms. [PID,TID]
232+
```
233+
This log indicates that a load or dump task in the **Cache Store** has completed, along with its execution time **in ms**.
234+
235+
```text
236+
[UC][D] Posix task({task_id},{operation},{subtask_number},{size}) dispatching. [PID,TID]
237+
```
238+
This log indicates that the **Posix Store** has received a **load or dump task**
239+
| Component | Description |
240+
|--------------|-----------------------------------------------------------------------------|
241+
| `task_id` | Unique identifier for the Posix Store task |
242+
| `operation` | `Cache2Backend`: Dump data from Cache Store to Posix Store.<br>`Backend2Cache`: Load data from Posix Store back to Cache Store. |
243+
| `subtask_number` | Number of subtasks executed in this operation |
244+
| `size` | Total size of data transferred in bytes (across all tasks) |
245+
246+
```text
247+
[UC][D] Posix task({task_id},{operation},{subtask_number},{size}) finished, cost {time}ms. [PID,TID]
248+
```
249+
This log indicates that a load or dump task in the **Posix Store** has completed, along with its execution time in **in ms**.

0 commit comments

Comments
 (0)