Skip to content

Commit 61cb683

Browse files
authored
Fix some typos in PrisKV blog post (#29)
1 parent 12112af commit 61cb683

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

content/posts/2025-11-26-priskv-intro.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
date: '2025-11-26T12:00:00-00:00'
33
draft: false
4-
title: 'PrisKV: An Colocated Tiered KVCache Store for LLM Serving'
4+
title: 'PrisKV: A Colocated Tiered KVCache Store for LLM Serving'
55
author: ["Xu Wang", "Jinlong Xuan", "Yi Wang", "Haiyang Shi", "Bo Liu", "Jiaxin Shan"]
66

77
disableShare: true
@@ -117,7 +117,7 @@ To address this, AIBrix provides a production-grade **[KVCache Offloading Framew
117117

118118
### PrisKV clusters: from individual nodes to a shared memory pool
119119

120-
In the above examples, PrisKV is deployed as a single node. On a larger scale, you can make it as **cluster-level shared KV memory pool**.AIBrix's orchestration layer takes care of turning multiple PrisKV servers into a coherent cluster:
120+
In the above examples, PrisKV is deployed as a single node. On a larger scale, you can make it as a **cluster-level shared KV memory pool**. AIBrix's orchestration layer takes care of turning multiple PrisKV servers into a coherent cluster:
121121

122122
* Cluster specs (capacity, number of nodes, tiers) are described declaratively via CRDs.
123123
* PrisKV servers are sharded using a consistent-hash–style scheme, and membership/routing metadata is kept in a small control component.
@@ -257,7 +257,7 @@ curl localhost:18000/v1/chat/completions \
257257

258258
For cluster setup, feel free to refer to [documentation](https://github.com/aibrix/PrisKV/tree/main/samples/cluster) for more references.
259259

260-
## How PrisKV performs?
260+
## PrisKV Performance
261261

262262
Before diving into end-to-end engine benchmarks, we micro-benchmark PrisKV with **value sizes 512KB, 1MB, 2MB, 4MB, and 8MB** on H20 with 400Gbps RDMA Network, which roughly match the KV footprint of 16-64 tokens for 8B/30B/70B-class models (such as Llama-8B, Qwen-32B, and Llama-70B). Under this setting, a single PrisKV node sustains **tens of thousands of QPS with sub-millisecond average latency** over RDMA, indicating that the KV store itself has ample headroom and is unlikely to be the bottleneck for the L2 KVCache path.
263263

@@ -267,7 +267,7 @@ Before diving into end-to-end engine benchmarks, we micro-benchmark PrisKV with
267267

268268
### End-to-End Benchmarking
269269

270-
Across end-to-end vLLM benchmarking on Nvidia H20 GPUs with Qwen3-32B (TP=4) on 8k-token prompts and 200-token outputs at 16 and 32 concurrent requests, PrisKV-powered KVCache offloading consistently delivers substantial throughput and latency improvements over the baseline: at 16 concurrency, request and token throughputs increase by about 4.8x while mean TTFT drops by \~90%; TPOT also falls by \~75%. At 32 concurrency, gains are even larger: throughput improves by roughly 6.35x mean TTFT decreases by \~90.7% (4842ms→450ms)with TPOT reductions of 83–84%, as shown in following figures.
270+
Across end-to-end vLLM benchmarking on Nvidia H20 GPUs with Qwen3-32B (TP=4) on 8k-token prompts and 200-token outputs at 16 and 32 concurrent requests, PrisKV-powered KVCache offloading consistently delivers substantial throughput and latency improvements over the baseline: at 16 concurrency, request and token throughputs increase by about 4.8x while mean TTFT drops by ~90%; TPOT also falls by ~75%. At 32 concurrency, gains are even larger: throughput improves by roughly 6.35x, mean TTFT decreases by ~90.7% (4842ms→450ms) with TPOT reductions of 83–84%, as shown in the following figures.
271271

272272
<p align="center">
273273
<img src="/images/priskv/priskv-e2e-benchmark-throughput.png" width="30%" style="display:inline-block; margin-right:1%" />

0 commit comments

Comments
 (0)