Skip to content

Commit 54dca3d

Browse files
committed
update docs for pipeline store
1 parent 89dedc5 commit 54dca3d

File tree

2 files changed

+23
-59
lines changed

2 files changed

+23
-59
lines changed

docs/source/user-guide/prefix-cache/nfs_store.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ Running the same benchmark again produces:
198198

199199
```
200200
---------------Time to First Token----------------
201-
Mean TTFT (ms): 1920.68
201+
Mean TTFT (ms): 3183.97
202202
```
203203

204204
The vLLM server logs now contain similar entries:
@@ -207,7 +207,7 @@ The vLLM server logs now contain similar entries:
207207
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
208208
```
209209

210-
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
210+
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
211211

212212
### Log Message Structure
213213
> If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.

docs/source/user-guide/prefix-cache/pipline_store.md

Lines changed: 21 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -15,63 +15,27 @@ Additional Store implementations will be developed in the future and **chained**
1515
## Performance
1616

1717
### Overview
18-
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
19-
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD.
20-
21-
In the QwQ-32B model, the test used one H20 server with 2 GPUs.
22-
In the DeepSeek-V3 model, the test used two H20 servers with 16 GPUs.
18+
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM in the QwQ-32B model.
19+
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD. the test used **4 x H100 GPUs**.
2320

2421
Here, Full Compute refers to pure VLLM inference, while SSD80% indicates that after UCM pooling, the SSD hit rate of the KV cache is 80%.
2522

2623
The following table shows the results on the QwQ-32B model:
27-
| **QwQ-32B** | | | | |
28-
| ---------------: | -------------: | ------------------: | -------------: | :----------- |
29-
| **Input length** | **Concurrent** | **Full Compute(s)** | **SSD80%(s)** | **Speedup** |
30-
| 2 000 | 1 | 0.5311 | 0.2053 | **+158.7 %** |
31-
| 4 000 | 1 | 1.0269 | 0.3415 | **+200.7 %** |
32-
| 8 000 | 1 | 2.0902 | 0.6429 | **+225.1 %** |
33-
| 16 000 | 1 | 4.4852 | 1.3598 | **+229.8 %** |
34-
| 32 000 | 1 | 10.2037 | 3.0713 | **+232.2 %** |
35-
| 2 000 | 2 | 0.7938 | 0.3039 | **+161.2 %** |
36-
| 4 000 | 2 | 1.5383 | 0.4968 | **+209.6 %** |
37-
| 8 000 | 2 | 3.1323 | 0.9544 | **+228.2 %** |
38-
| 16 000 | 2 | 6.7984 | 2.0149 | **+237.4 %** |
39-
| 32 000 | 2 | 15.3395 | 4.5619 | **+236.3 %** |
40-
| 2 000 | 4 | 1.6572 | 0.5998 | **+176.3 %** |
41-
| 4 000 | 4 | 2.8173 | 1.2657 | **+122.6 %** |
42-
| 8 000 | 4 | 5.2643 | 1.9829 | **+165.5 %** |
43-
| 16 000 | 4 | 11.3651 | 3.9776 | **+185.7 %** |
44-
| 32 000 | 4 | 25.6718 | 8.2881 | **+209.7 %** |
45-
| 2 000 | 8 | 2.8559 | 1.2250 | **+133.1 %** |
46-
| 4 000 | 8 | 5.0003 | 2.0995 | **+138.2 %** |
47-
| 8 000 | 8 | 9.5365 | 3.6584 | **+160.7 %** |
48-
| 16 000 | 8 | 20.3839 | 6.8949 | **+195.6 %** |
49-
| 32 000 | 8 | 46.2107 | 14.8704 | **+210.8 %** |
50-
51-
The following table shows the results on the DeepSeek-V3 model:
52-
| **DeepSeek-V3** | | | | |
53-
| ---------------: | -------------: | ------------------: | -------------: | :----------- |
54-
| **Input length** | **Concurrent** | **Full Compute(s)** | **SSD80%(s)** | **Speedup** |
55-
| 2 000 | 1 | 0.66971 | 0.33960 | **+97.2 %** |
56-
| 4 000 | 1 | 1.73146 | 0.48720 | **+255.4 %** |
57-
| 8 000 | 1 | 3.33155 | 0.86782 | **+283.9 %** |
58-
| 16 000 | 1 | 6.71235 | 2.09067 | **+221.1 %** |
59-
| 32 000 | 1 | 14.16003 | 4.26111 | **+232.3 %** |
60-
| 2 000 | 2 | 0.94628 | 0.50635 | **+86.9 %** |
61-
| 4 000 | 2 | 2.56590 | 0.71750 | **+257.6 %** |
62-
| 8 000 | 2 | 4.98428 | 1.32238 | **+276.9 %** |
63-
| 16 000 | 2 | 10.08294 | 3.10009 | **+225.2 %** |
64-
| 32 000 | 2 | 21.11799 | 6.35784 | **+232.2 %** |
65-
| 2 000 | 4 | 2.86674 | 0.84273 | **+240.2 %** |
66-
| 4 000 | 4 | 5.42761 | 1.35695 | **+300.0 %** |
67-
| 8 000 | 4 | 10.90076 | 3.02942 | **+259.8 %** |
68-
| 16 000 | 4 | 22.43841 | 6.59230 | **+240.4 %** |
69-
| 32 000 | 4 | 43.29353 | 14.51481 | **+198.3 %** |
70-
| 2 000 | 8 | 5.69329 | 1.82275 | **+212.3 %** |
71-
| 4 000 | 8 | 11.80801 | 3.36708 | **+250.7 %** |
72-
| 8 000 | 8 | 23.93016 | 7.01634 | **+241.1 %** |
73-
| 16 000 | 8 | 42.04222 | 14.78947 | **+184.3 %** |
74-
| 32 000 | 8 | 78.55850 | 35.63042 | **+120.5 %** |
24+
| **QwQ-32B** | | | | |
25+
| ---------------: | -------------: | -------------------: | -------------: | :------------ |
26+
| **Input length** | **Concurrent** | **Full Compute (ms)** | **SSD80% (ms)** | **Speedup (%)** |
27+
| 4 000 | 1 | 223.05 | 156.54 | **+42.5%** |
28+
| 8 000 | 1 | 350.47 | 228.27 | **+53.5%** |
29+
| 16 000 | 1 | 708.94 | 349.17 | **+103.0%** |
30+
| 32 000 | 1 | 1512.04 | 635.18 | **+138.0%** |
31+
| 4 000 | 8 | 908.52 | 625.92 | **+45.1%** |
32+
| 8 000 | 8 | 1578.72 | 955.25 | **+65.3%** |
33+
| 16 000 | 8 | 3139.03 | 1647.72 | **+90.5%** |
34+
| 32 000 | 8 | 6735.25 | 3025.23 | **+122.6%** |
35+
| 4 000 | 16 | 1509.79 | 919.53 | **+64.2%** |
36+
| 8 000 | 16 | 2602.34 | 1480.30 | **+75.8%** |
37+
| 16 000 | 16 | 5732.49 | 2393.54 | **+139.5%** |
38+
| 32 000 | 16 | 11891.61 | 4790.00 | **+148.3%** |
7539

7640
## Configuration for Prefix Caching
7741

@@ -130,7 +94,7 @@ load_only_first_rank: false
13094
Timeout in milliseconds for external interfaces.
13195

13296
* **buffer_size** *(optional, default: 64GB)*
133-
Amount of HBM memory used by a single worker process.
97+
Amount of dram pinned memory used by a single worker process.
13498

13599
### Must-be-Set Parameters
136100

@@ -200,7 +164,7 @@ The `vllm bench` terminal prints the benchmark result:
200164

201165
```
202166
---------------Time to First Token----------------
203-
Mean TTFT (ms): 15323.87
167+
Mean TTFT (ms): 15001.64
204168
```
205169

206170
Inspecting the vLLM server logs reveals entries like:
@@ -216,7 +180,7 @@ Running the same benchmark again produces:
216180

217181
```
218182
---------------Time to First Token----------------
219-
Mean TTFT (ms): 1920.68
183+
Mean TTFT (ms): 2874.21
220184
```
221185

222186
The vLLM server logs now contain similar entries:
@@ -225,7 +189,7 @@ The vLLM server logs now contain similar entries:
225189
INFO ucm_connector.py:317: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
226190
```
227191

228-
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
192+
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
229193

230194
### Log Message Structure
231195
> If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.

0 commit comments

Comments
 (0)