You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -16,41 +21,41 @@ See the [Project Northstar](https://docs.google.com/document/d/1EM1QtDUaw7pVRkbH
16
21
17
22
## KV-Cache Indexer Overview
18
23
19
-
One of the major component of this project is the **KVCache Indexer**: a high-performance Go service that maintains a global, near-real-time view of KV-Cache block locality.
24
+
The major component of this project is the **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
20
25
21
26
It is powered by `KVEvents` streamed from vLLM, which provide structured metadata as KV-blocks are created or evicted from a vLLM instance's KV-cache.
22
27
This allows the indexer to track which blocks reside on which nodes and on which tier (e.g., GPU or CPU).
23
-
This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, cache-aware placement decisions.
28
+
This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, KV-cache-aware placement decisions.
24
29
25
30
The diagram below shows the primary data flows: the **Read Path** (scoring) and the **Write Path** (event ingestion).
26
31
27
32
```mermaid
28
33
graph TD
29
-
subgraph Scheduler / Router
34
+
subgraph "Scheduler"
30
35
A[Scheduler]
31
36
end
32
-
33
-
subgraph KVCacheManager["KV-Cache Manager"]
37
+
38
+
subgraph "KV-Cache Manager"
34
39
B[KVCache Indexer API]
35
40
C[KV-Block Index]
36
41
D[Event Subscriber]
37
42
end
38
43
39
-
subgraph vLLM Fleet
44
+
subgraph "vLLM Fleet"
40
45
E[vLLM Pod 1]
41
46
F[vLLM Pod 2]
42
47
G[...]
43
48
end
44
49
45
-
A -- "1. Score(prompt, pods)" --> B
46
-
B -- "2. Query Index" --> C
47
-
B -- "3. Return Scores" --> A
48
-
49
-
E -- "4. Emit KVEvents" --> D
50
-
F -- "4. Emit KVEvents" --> D
51
-
D -- "5. Update Index" --> C
50
+
A -->|"1. Score(prompt, pods)"| B
51
+
B -->|2. Query Index| C
52
+
B -->|3. Return Scores| A
52
53
54
+
E -->|A. Emit KVEvents| D
55
+
F -->|A. Emit KVEvents| D
56
+
D -->|B. Update Index| C
53
57
```
58
+
_Note: 1-3 represent the Read Path for scoring pods, while A-B represent the Write Path for ingesting KVEvents._
54
59
55
60
1.**Scoring Request**: A scheduler asks the **KVCache Indexer** to score a set of pods for a given prompt
56
61
2.**Index Query**: The indexer calculates the necessary KV-block keys from the prompt and queries the **KV-Block Index** to see which pods have those blocks
Copy file name to clipboardExpand all lines: docs/architecture.md
+38-25Lines changed: 38 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,29 +1,23 @@
1
-
# KV-Cache Indexer: A Technical Architecture
1
+
# KV-Cache Indexer: Architecture
2
2
3
-
The **KV-Cache Indexer** is a high-performance Go service that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
4
-
Its purpose is to enable smart routing and scheduling by telling request routers which pods are best-equipped to handle an incoming prompt with the lowest possible latency.
5
-
6
-
### Core Responsibilities
7
-
8
-
***Global Cache Tracking**: Maintains a central index of KV-block locations across all vLLM pods.
9
-
***Intelligent Pod Scoring**: Scores candidate pods for incoming prompts based on how much of the prompt's prefix they already have cached.
10
-
***Real-Time Event Processing**: Ingests a high-throughput stream of cache events (`BlockStored`, `BlockRemoved`) from vLLM pods to keep the index fresh.
11
-
***Ultra-Low-Latency Lookups**: Delivers pod scoring results in sub-millisecond time to ensure scheduling decisions are fast.
3
+
The **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
4
+
Its purpose is the enablement of smart routing and scheduling by exposing a fast, intelligent scoring mechanism for vLLM pods based on their cached KV-blocks.
12
5
13
6
-----
14
7
15
8
## System Architecture
16
9
17
-
The Indexer is built from several modules that work together, each with a clear job.
10
+
The Indexer is built from several modules that work together, each with clear responsibilities.
11
+
Separating concerns is a guiding principle in the design of this system.
18
12
19
-
| Module | Purpose | Default Implementation |
20
-
| :--- |:---|:---|
21
-
|**`kvcache.Indexer`**| The main orchestrator that handles scoring requests. | Coordinates all internal modules.|
22
-
|**`kvevents.Pool`**| Ingests and processes KV-cache events from vLLM pods. | A sharded worker pool using ZMQ for event subscription.|
23
-
|**`kvblock.Index`**| The core data store mapping KV-block hashes to pod locations.| An in-memory, two-level LRU cache.|
24
-
|**`tokenization.PrefixStore`**| Caches tokenized prompt prefixes to avoid re-work. | An LRU cache storing text chunks and their corresponding tokens.|
25
-
|**`kvblock.TokenProcessor`**| Converts token sequences into content-addressable block keys. | Uses a chunking and hashing algorithm compatible with vLLM.|
26
-
|**`kvblock.Scorer`**| Scores pods based on the sequence of cache hits. | Implements a longest consecutive prefix matching strategy.|
|**`kvcache.Indexer`**| The main orchestrator that handles scoring requests| Coordinates all internal modules|
16
+
|**`kvevents.Pool`**| Ingests and processes KV-cache events from vLLM pods| A sharded worker pool using ZMQ for event subscription|
17
+
|**`kvblock.Index`**| The core data store mapping KV-block hashes to pod locations | An in-memory, two-level LRU cache|
18
+
|**`tokenization.PrefixStore`**| Caches tokenized prompt prefixes to avoid re-work| An LRU cache storing text chunks and their corresponding tokens |
19
+
|**`kvblock.TokenProcessor`**| Converts token sequences into KV-block keys| Uses a chunking and hashing algorithm compatible with vLLM|
20
+
|**`kvblock.Scorer`**| Scores pods based on the sequence of cache hits| Implements a longest consecutive prefix matching strategy|
27
21
28
22
-----
29
23
@@ -33,7 +27,9 @@ The system has two primary data flows: the **Read Path** for scoring pods and th
33
27
34
28
### Read Path: Scoring a Prompt
35
29
36
-
When a router needs to pick the best pod for a new prompt, it triggers the Read Path. The goal is to find the pod that has the longest sequence of relevant KV-blocks already in its cache.
30
+
When a router needs to pick the best pod for a new prompt, it triggers the Read Path.
31
+
The goal is to find the pod that has the longest sequence of relevant KV-blocks already in its cache.
32
+
A list of pods with their scores is returned to the router.
37
33
38
34
```mermaid
39
35
sequenceDiagram
@@ -70,6 +66,9 @@ sequenceDiagram
70
66
4.**Scoring**: The `Scorer` takes the hit data and scores each pod based on its number of consecutive matching blocks.
71
67
5.**Response**: A final map of pod scores is sent back to the router.
72
68
69
+
Note: step (1) means that the first time a prompt is scored, it may return an empty result while the tokenization happens in the background.
70
+
It is assumed that this cache will be populated with common prompts, so the first scoring request is an edge case.
71
+
73
72
### Write Path: Processing Cache Events
74
73
75
74
The Write Path keeps the index up-to-date by processing a constant stream of events from the vLLM fleet.
@@ -118,16 +117,16 @@ sequenceDiagram
118
117
119
118
To guarantee compatibility, the indexer perfectly matches vLLM's content-addressing logic.
120
119
121
-
***Token Chunking**: Prompts are converted to tokens, which are then grouped into fixed-size chunks (default: 256).
122
-
***Hash Algorithm**: A chained hash is computed. Each block's key is the **lower 64 bits of a SHA-256 hash**, generated from the CBOR-encoded `[parentHash, tokenChunk]` tuple.
123
-
***Initialization**: The hash chain starts with a configurable `HashSeed`. This value **must** align with the `PYTHONHASHSEED` environment variable in the vLLM pods to ensure hashes are consistent across the entire system.
120
+
***Token Chunking**: Prompts are converted to tokens, which are then grouped into fixed-size chunks (default: 16).
121
+
***Hash Algorithm**: A chained hash is computed. Each block's key is the **lower 64 bits of a SHA-256 hash**, generated from the CBOR-encoded `[parentHash, tokenChunk, extraKeys]` tuple.
122
+
***Initialization**: The hash chain starts with a configurable `HashSeed`. This value's source**must** align with the `PYTHONHASHSEED` environment variable in the vLLM pods to ensure hashes are consistent across the entire system.
124
123
125
124
#### Index Backends
126
125
127
126
The `kvblock.Index` is an interface with swappable backends.
128
127
129
128
***In-Memory (Default)**: A very fast, thread-safe, two-level LRU cache using `hashicorp/golang-lru`. The first level maps a block key to a second-level cache of pods that have the block. It prioritizes speed over persistence, which is usually the right trade-off for ephemeral cache data.
130
-
***Redis (Optional)**: A distributed backend that can be shared by multiple indexer replicas. It offers scalability and persistence, but this may be overkill given the short lifetime of most KV-cache blocks.
129
+
***Redis (Optional)**: A distributed backend that can be shared by multiple indexer replicas. It can offer scalability and persistence, but this may be overkill given the short lifetime of most KV-cache blocks.
131
130
132
131
#### Tokenization Subsystem
133
132
@@ -137,4 +136,18 @@ Efficiently handling tokenization is critical for performance. The system is des
137
136
***Tokenizer Caching**: The actual tokenization is handled by a `CachedHFTokenizer`, which wraps Hugging Face's high-performance Rust tokenizers. To avoid the overhead of repeatedly loading tokenizer models from disk, it maintains an LRU cache of active tokenizer instances.
138
137
***PrefixStore Backends**: The token cache (`PrefixStore`) is an interface with two available implementations:
139
138
***`LRUTokenStore` (Default)**: This implementation chunks incoming text, hashes it, and stores blocks of tokens in an LRU cache. It's fast and memory-bounded, making it a reliable default. It's designed to find the longest chain of *blocks* that match a prompt's prefix.
140
-
***`TrieTokenStore`**: An alternative implementation that uses a character-based trie. Each node in the trie stores information about the last token that was fully contained within the prefix leading to that node. This approach can be more memory-efficient for prompts with highly repetitive or overlapping prefixes.
139
+
***`TrieTokenStore`**: An alternative implementation that uses a character-based trie. Each node in the trie stores information about the last token that was fully contained within the prefix leading to that node. This approach can be more memory-efficient for prompts with highly repetitive or overlapping prefixes, but is generally slower than the LRU-based store.
140
+
It is not the default due to its higher complexity and lower performance in most scenarios.
141
+
142
+
-----
143
+
144
+
## Dependencies
145
+
146
+
The Indexer relies on several libraries and tools:
147
+
***[daulet/tokenizers](https://github.com/daulet/tokenizers)**: Go bindings for the HuggingFace Tokenizers library.
148
+
* Used for tokenization of prompts.
149
+
***[pebbe/zmq4](https://github.com/pebbe/zmq4)**: Go bindings for ZeroMQ.
150
+
* Used for the event processing pool and communication between components.
151
+
* Requires `libzmq` library to be installed on the system.
152
+
***Python**: Required to run a CGO binding for the `chat_completions_template` package.
153
+
* Used for jinja2 templating of chat completions requests.
Configures how tokens are converted to KVblock keys.
123
+
Configures how tokens are converted to KV-block keys.
122
124
123
125
```json
124
126
{
@@ -221,15 +223,15 @@ For the ZMQ event processing pool:
221
223
---
222
224
## Notes
223
225
224
-
1.**Hash Seed Alignment**: The `hash_seed` in `TokenProcessorConfig` should be aligned with vLLM's `PYTHONHASHSEED` environment variable to ensure consistent hashing across the system.
226
+
1.**Hash Seed Alignment**: The `hashSeed` in `TokenProcessorConfig` should be aligned with vLLM's `PYTHONHASHSEED` environment variable to ensure consistent hashing across the system.
225
227
226
228
2.**Memory Considerations**: The `size` parameter in `InMemoryIndexConfig` directly affects memory usage. Each key-value pair consumes memory proportional to the number of associated pods.
227
229
228
230
3.**Performance Tuning**:
229
-
- Increase `workers_count` in tokenization config for higher tokenization throughput
231
+
- Increase `workersCount` in tokenization config for higher tokenization throughput
230
232
- Adjust `concurrency` in event processing for better event handling performance
231
233
- Tune cache sizes based on available memory and expected workload
232
234
233
-
4.**Cache Directories**: Ensure the `tokenizers_cache_dir` has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
235
+
4.**Cache Directories**: If used, ensure the `tokenizersCacheDir` has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
234
236
235
237
5.**Redis Configuration**: When using Redis backend, ensure Redis server is accessible and has sufficient memory. The `address` field supports full Redis URLs including authentication: `redis://user:pass@host:port/db`.
0 commit comments