You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -117,7 +117,7 @@ To address this, AIBrix provides a production-grade **[KVCache Offloading Framew
117
117
118
118
### PrisKV clusters: from individual nodes to a shared memory pool
119
119
120
-
In the above examples, PrisKV is deployed as a single node. On a larger scale, you can make it as **cluster-level shared KV memory pool**.AIBrix's orchestration layer takes care of turning multiple PrisKV servers into a coherent cluster:
120
+
In the above examples, PrisKV is deployed as a single node. On a larger scale, you can make it as a **cluster-level shared KV memory pool**.AIBrix's orchestration layer takes care of turning multiple PrisKV servers into a coherent cluster:
121
121
122
122
* Cluster specs (capacity, number of nodes, tiers) are described declaratively via CRDs.
123
123
* PrisKV servers are sharded using a consistent-hash–style scheme, and membership/routing metadata is kept in a small control component.
For cluster setup, feel free to refer to [documentation](https://github.com/aibrix/PrisKV/tree/main/samples/cluster) for more references.
259
259
260
-
## How PrisKV performs?
260
+
## PrisKV Performance
261
261
262
262
Before diving into end-to-end engine benchmarks, we micro-benchmark PrisKV with **value sizes 512KB, 1MB, 2MB, 4MB, and 8MB** on H20 with 400Gbps RDMA Network, which roughly match the KV footprint of 16-64 tokens for 8B/30B/70B-class models (such as Llama-8B, Qwen-32B, and Llama-70B). Under this setting, a single PrisKV node sustains **tens of thousands of QPS with sub-millisecond average latency** over RDMA, indicating that the KV store itself has ample headroom and is unlikely to be the bottleneck for the L2 KVCache path.
263
263
@@ -267,7 +267,7 @@ Before diving into end-to-end engine benchmarks, we micro-benchmark PrisKV with
267
267
268
268
### End-to-End Benchmarking
269
269
270
-
Across end-to-end vLLM benchmarking on Nvidia H20 GPUs with Qwen3-32B (TP=4) on 8k-token prompts and 200-token outputs at 16 and 32 concurrent requests, PrisKV-powered KVCache offloading consistently delivers substantial throughput and latency improvements over the baseline: at 16 concurrency, request and token throughputs increase by about 4.8x while mean TTFT drops by \~90%; TPOT also falls by \~75%. At 32 concurrency, gains are even larger: throughput improves by roughly 6.35x mean TTFT decreases by \~90.7% (4842ms→450ms)with TPOT reductions of 83–84%, as shown in following figures.
270
+
Across end-to-end vLLM benchmarking on Nvidia H20 GPUs with Qwen3-32B (TP=4) on 8k-token prompts and 200-token outputs at 16 and 32 concurrent requests, PrisKV-powered KVCache offloading consistently delivers substantial throughput and latency improvements over the baseline: at 16 concurrency, request and token throughputs increase by about 4.8x while mean TTFT drops by ~90%; TPOT also falls by ~75%. At 32 concurrency, gains are even larger: throughput improves by roughly 6.35x, mean TTFT decreases by ~90.7% (4842ms→450ms)with TPOT reductions of 83–84%, as shown in the following figures.
0 commit comments