You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/user-guide/prefix-cache/nfs_store.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -198,7 +198,7 @@ Running the same benchmark again produces:
198
198
199
199
```
200
200
---------------Time to First Token----------------
201
-
Mean TTFT (ms): 1920.68
201
+
Mean TTFT (ms): 3183.97
202
202
```
203
203
204
204
The vLLM server logs now contain similar entries:
@@ -207,7 +207,7 @@ The vLLM server logs now contain similar entries:
207
207
INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
208
208
```
209
209
210
-
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
210
+
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
211
211
212
212
### Log Message Structure
213
213
> If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.
Copy file name to clipboardExpand all lines: docs/source/user-guide/prefix-cache/pipline_store.md
+21-57Lines changed: 21 additions & 57 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,63 +15,27 @@ Additional Store implementations will be developed in the future and **chained**
15
15
## Performance
16
16
17
17
### Overview
18
-
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
19
-
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD.
20
-
21
-
In the QwQ-32B model, the test used one H20 server with 2 GPUs.
22
-
In the DeepSeek-V3 model, the test used two H20 servers with 16 GPUs.
18
+
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM in the QwQ-32B model.
19
+
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD. the test used **4 x H100 GPUs**.
23
20
24
21
Here, Full Compute refers to pure VLLM inference, while SSD80% indicates that after UCM pooling, the SSD hit rate of the KV cache is 80%.
25
22
26
23
The following table shows the results on the QwQ-32B model:
Amount of HBM memory used by a single worker process.
97
+
Amount of dram pinned memory used by a single worker process.
134
98
135
99
### Must-be-Set Parameters
136
100
@@ -200,7 +164,7 @@ The `vllm bench` terminal prints the benchmark result:
200
164
201
165
```
202
166
---------------Time to First Token----------------
203
-
Mean TTFT (ms): 15323.87
167
+
Mean TTFT (ms): 15001.64
204
168
```
205
169
206
170
Inspecting the vLLM server logs reveals entries like:
@@ -216,7 +180,7 @@ Running the same benchmark again produces:
216
180
217
181
```
218
182
---------------Time to First Token----------------
219
-
Mean TTFT (ms): 1920.68
183
+
Mean TTFT (ms): 2874.21
220
184
```
221
185
222
186
The vLLM server logs now contain similar entries:
@@ -225,7 +189,7 @@ The vLLM server logs now contain similar entries:
225
189
INFO ucm_connector.py:317: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
226
190
```
227
191
228
-
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
192
+
This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
229
193
230
194
### Log Message Structure
231
195
> If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.
0 commit comments