update docs for pipeline store

qyh111 · qyh111 · commit 54dca3d8635f · 2025-12-30T03:46:03.000-08:00
diff --git a/docs/source/user-guide/prefix-cache/nfs_store.md b/docs/source/user-guide/prefix-cache/nfs_store.md
@@ -198,7 +198,7 @@ Running the same benchmark again produces:
 
 ```
 ---------------Time to First Token----------------
-Mean TTFT (ms):                            1920.68
+Mean TTFT (ms):                            3183.97
 ```
 
 The vLLM server logs now contain similar entries:
@@ -207,7 +207,7 @@ The vLLM server logs now contain similar entries:
 INFO ucm_connector.py:228: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
 ```
 
-This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
+This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
 
 ### Log Message Structure
 > If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.
diff --git a/docs/source/user-guide/prefix-cache/pipline_store.md b/docs/source/user-guide/prefix-cache/pipline_store.md
@@ -15,63 +15,27 @@ Additional Store implementations will be developed in the future and **chained**
 ## Performance
 
 ### Overview
-The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
-During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD.
-
-In the QwQ-32B model, the test used one H20 server with 2 GPUs.
-In the DeepSeek-V3 model, the test used two H20 servers with 16 GPUs.
+The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM in the QwQ-32B model.
+During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD. the test used **4 x H100 GPUs**.
 
 Here, Full Compute refers to pure VLLM inference, while SSD80% indicates that after UCM pooling, the SSD hit rate of the KV cache is 80%.
 
 The following table shows the results on the QwQ-32B model:
-|      **QwQ-32B** |                |                     |                |              |
-| ---------------: | -------------: | ------------------: | -------------: | :----------- |
-| **Input length** | **Concurrent** | **Full Compute(s)** | **SSD80%(s)** | **Speedup**  |
-|            2 000 |              1 |              0.5311 |         0.2053 | **+158.7 %** |
-|            4 000 |              1 |              1.0269 |         0.3415 | **+200.7 %** |
-|            8 000 |              1 |              2.0902 |         0.6429 | **+225.1 %** |
-|           16 000 |              1 |              4.4852 |         1.3598 | **+229.8 %** |
-|           32 000 |              1 |             10.2037 |         3.0713 | **+232.2 %** |
-|            2 000 |              2 |              0.7938 |         0.3039 | **+161.2 %** |
-|            4 000 |              2 |              1.5383 |         0.4968 | **+209.6 %** |
-|            8 000 |              2 |              3.1323 |         0.9544 | **+228.2 %** |
-|           16 000 |              2 |              6.7984 |         2.0149 | **+237.4 %** |
-|           32 000 |              2 |             15.3395 |         4.5619 | **+236.3 %** |
-|            2 000 |              4 |              1.6572 |         0.5998 | **+176.3 %** |
-|            4 000 |              4 |              2.8173 |         1.2657 | **+122.6 %** |
-|            8 000 |              4 |              5.2643 |         1.9829 | **+165.5 %** |
-|           16 000 |              4 |             11.3651 |         3.9776 | **+185.7 %** |
-|           32 000 |              4 |             25.6718 |         8.2881 | **+209.7 %** |
-|            2 000 |              8 |              2.8559 |         1.2250 | **+133.1 %** |
-|            4 000 |              8 |              5.0003 |         2.0995 | **+138.2 %** |
-|            8 000 |              8 |              9.5365 |         3.6584 | **+160.7 %** |
-|           16 000 |              8 |             20.3839 |         6.8949 | **+195.6 %** |
-|           32 000 |              8 |             46.2107 |        14.8704 | **+210.8 %** |
-
-The following table shows the results on the DeepSeek-V3 model:
-|  **DeepSeek-V3** |                |                     |                |              |
-| ---------------: | -------------: | ------------------: | -------------: | :----------- |
-| **Input length** | **Concurrent** | **Full Compute(s)** | **SSD80%(s)** | **Speedup**  |
-|            2 000 |              1 |             0.66971 |        0.33960 | **+97.2 %**  |
-|            4 000 |              1 |             1.73146 |        0.48720 | **+255.4 %** |
-|            8 000 |              1 |             3.33155 |        0.86782 | **+283.9 %** |
-|           16 000 |              1 |             6.71235 |        2.09067 | **+221.1 %** |
-|           32 000 |              1 |            14.16003 |        4.26111 | **+232.3 %** |
-|            2 000 |              2 |             0.94628 |        0.50635 | **+86.9 %**  |
-|            4 000 |              2 |             2.56590 |        0.71750 | **+257.6 %** |
-|            8 000 |              2 |             4.98428 |        1.32238 | **+276.9 %** |
-|           16 000 |              2 |            10.08294 |        3.10009 | **+225.2 %** |
-|           32 000 |              2 |            21.11799 |        6.35784 | **+232.2 %** |
-|            2 000 |              4 |             2.86674 |        0.84273 | **+240.2 %** |
-|            4 000 |              4 |             5.42761 |        1.35695 | **+300.0 %** |
-|            8 000 |              4 |            10.90076 |        3.02942 | **+259.8 %** |
-|           16 000 |              4 |            22.43841 |        6.59230 | **+240.4 %** |
-|           32 000 |              4 |            43.29353 |       14.51481 | **+198.3 %** |
-|            2 000 |              8 |             5.69329 |        1.82275 | **+212.3 %** |
-|            4 000 |              8 |            11.80801 |        3.36708 | **+250.7 %** |
-|            8 000 |              8 |            23.93016 |        7.01634 | **+241.1 %** |
-|           16 000 |              8 |            42.04222 |       14.78947 | **+184.3 %** |
-|           32 000 |              8 |            78.55850 |       35.63042 | **+120.5 %** |
+|      **QwQ-32B** |                |                      |                |               |
+| ---------------: | -------------: | -------------------: | -------------: | :------------ |
+| **Input length** | **Concurrent** | **Full Compute (ms)** | **SSD80% (ms)** | **Speedup (%)** |
+|            4 000 |              1 |              223.05 |         156.54 | **+42.5%**   |
+|            8 000 |              1 |              350.47 |         228.27 | **+53.5%**   |
+|           16 000 |              1 |              708.94 |         349.17 | **+103.0%**  |
+|           32 000 |              1 |             1512.04 |         635.18 | **+138.0%**  |
+|            4 000 |              8 |              908.52 |         625.92 | **+45.1%**   |
+|            8 000 |              8 |             1578.72 |         955.25 | **+65.3%**   |
+|           16 000 |              8 |             3139.03 |        1647.72 | **+90.5%**   |
+|           32 000 |              8 |             6735.25 |        3025.23 | **+122.6%**  |
+|            4 000 |             16 |             1509.79 |         919.53 | **+64.2%**   |
+|            8 000 |             16 |             2602.34 |        1480.30 | **+75.8%**   |
+|           16 000 |             16 |             5732.49 |        2393.54 | **+139.5%**  |
+|           32 000 |             16 |            11891.61 |        4790.00 | **+148.3%**  |
 
 ## Configuration for Prefix Caching
 
@@ -130,7 +94,7 @@ load_only_first_rank: false
   Timeout in milliseconds for external interfaces.
 
 * **buffer_size** *(optional, default: 64GB)*  
-  Amount of HBM memory used by a single worker process.
+  Amount of dram pinned memory used by a single worker process.
 
 ### Must-be-Set Parameters
 
@@ -200,7 +164,7 @@ The `vllm bench` terminal prints the benchmark result:
 
 ```
 ---------------Time to First Token----------------
-Mean TTFT (ms):                           15323.87
+Mean TTFT (ms):                           15001.64
 ```
 
 Inspecting the vLLM server logs reveals entries like:
@@ -216,7 +180,7 @@ Running the same benchmark again produces:
 
 ```
 ---------------Time to First Token----------------
-Mean TTFT (ms):                            1920.68
+Mean TTFT (ms):                            2874.21
 ```
 
 The vLLM server logs now contain similar entries:
@@ -225,7 +189,7 @@ The vLLM server logs now contain similar entries:
 INFO ucm_connector.py:317: request_id: xxx, total_blocks_num: 125, hit hbm: 0, hit external: 125
 ```
 
-This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **8× improvement in TTFT** compared to the initial run.
+This indicates that during the second request, UCM successfully retrieved all 125 cached KV blocks from the storage backend. Leveraging the fully cached prefix significantly reduces the initial latency observed by the model, yielding an approximate **5× improvement in TTFT** compared to the initial run.
 
 ### Log Message Structure
 > If you want to view detailed transfer information, set the environment variable `UC_LOGGER_LEVEL` to `debug`.