Skip to content

Commit 7b075d6

Browse files
authored
Doc Enhancements (#73)
* - doc updates - zmq-local setup fix Signed-off-by: Maroon Ayoub <[email protected]> * typo fix Signed-off-by: Maroon Ayoub <[email protected]> --------- Signed-off-by: Maroon Ayoub <[email protected]>
1 parent 24e5251 commit 7b075d6

File tree

8 files changed

+105
-71
lines changed

8 files changed

+105
-71
lines changed

LICENSE

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,3 @@
1-
Copyright 2025-present The llm-d Authors.
2-
3-
Licensed under the Apache License, Version 2.0 (the "License");
4-
you may not use these files except in compliance with the License.
5-
You may obtain a copy of the License at
6-
7-
http://www.apache.org/licenses/LICENSE-2.0
8-
9-
Unless required by applicable law or agreed to in writing, software
10-
distributed under the License is distributed on an "AS IS" BASIS,
11-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12-
See the License for the specific language governing permissions and
13-
limitations under the License.
14-
151
Apache License
162
Version 2.0, January 2004
173
http://www.apache.org/licenses/

Makefile

Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,19 +75,19 @@ verify-boilerplate: $(TOOLS_DIR)/verify_boilerplate.py
7575
$(TOOLS_DIR)/verify_boilerplate.py --boilerplate-dir=hack/boilerplate --skip docs
7676

7777
.PHONY: unit-test
78-
unit-test: download-tokenizer
78+
unit-test: download-tokenizer download-zmq
7979
@printf "\033[33;1m==== Running unit tests ====\033[0m\n"
8080
go test -ldflags="$(LDFLAGS)" ./pkg/...
8181

8282
.PHONY: e2e-test
83-
e2e-test: download-tokenizer
83+
e2e-test: download-tokenizer download-zmq
8484
@printf "\033[33;1m==== Running unit tests ====\033[0m\n"
8585
go test -v -ldflags="$(LDFLAGS)" ./tests/...
8686

8787
##@ Build
8888

8989
.PHONY: build
90-
build: check-go download-tokenizer ##
90+
build: check-go download-tokenizer download-zmq
9191
@printf "\033[33;1m==== Building ====\033[0m\n"
9292
go build -ldflags="$(LDFLAGS)" -o bin/$(PROJECT_NAME) examples/kv_cache_index/main.go
9393

@@ -354,3 +354,36 @@ print-project-name: ## Print the current project name
354354
.PHONY: install-hooks
355355
install-hooks: ## Install git hooks
356356
git config core.hooksPath hooks
357+
358+
359+
##@ ZMQ Setup
360+
361+
.PHONY: download-zmq
362+
download-zmq: ## Install ZMQ dependencies based on OS/ARCH
363+
@echo "Checking if ZMQ is already installed..."
364+
@if pkg-config --exists libzmq; then \
365+
echo "✅ ZMQ is already installed."; \
366+
else \
367+
echo "Installing ZMQ dependencies..."; \
368+
if [ "$(TARGETOS)" = "linux" ]; then \
369+
if [ -x "$(command -v apt)" ]; then \
370+
apt update && apt install -y libzmq3-dev; \
371+
elif [ -x "$(command -v dnf)" ]; then \
372+
dnf install -y zeromq-devel; \
373+
else \
374+
echo "Unsupported Linux package manager. Install libzmq manually."; \
375+
exit 1; \
376+
fi; \
377+
elif [ "$(TARGETOS)" = "darwin" ]; then \
378+
if [ -x "$(command -v brew)" ]; then \
379+
brew install zeromq; \
380+
else \
381+
echo "Homebrew is not installed and is required to install zeromq. Install it from https://brew.sh/"; \
382+
exit 1; \
383+
fi; \
384+
else \
385+
echo "Unsupported OS: $(TARGETOS). Install libzmq manually - check https://zeromq.org/download/ for guidance."; \
386+
exit 1; \
387+
fi; \
388+
echo "✅ ZMQ dependencies installed."; \
389+
fi

README.md

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
[![Go Report Card](https://goreportcard.com/badge/github.com/llm-d/llm-d-kv-cache-manager)](https://goreportcard.com/report/github.com/llm-d/llm-d-kv-cache-manager)
2+
[![Go Reference](https://pkg.go.dev/badge/github.com/llm-d/llm-d-kv-cache-manager.svg)](https://pkg.go.dev/github.com/llm-d/llm-d-kv-cache-manager)
3+
[![License](https://img.shields.io/github/license/llm-d/llm-d-kv-cache-manager)](LICENSE)
4+
[![Join Slack](https://img.shields.io/badge/Join_Slack-blue?logo=slack)](https://llm-d.slack.com/archives/C08TB7ZDV7S)
5+
16
# KV-Cache Manager
27

38
### Introduction
@@ -16,41 +21,41 @@ See the [Project Northstar](https://docs.google.com/document/d/1EM1QtDUaw7pVRkbH
1621

1722
## KV-Cache Indexer Overview
1823

19-
One of the major component of this project is the **KVCache Indexer**: a high-performance Go service that maintains a global, near-real-time view of KV-Cache block locality.
24+
The major component of this project is the **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
2025

2126
It is powered by `KVEvents` streamed from vLLM, which provide structured metadata as KV-blocks are created or evicted from a vLLM instance's KV-cache.
2227
This allows the indexer to track which blocks reside on which nodes and on which tier (e.g., GPU or CPU).
23-
This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, cache-aware placement decisions.
28+
This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, KV-cache-aware placement decisions.
2429

2530
The diagram below shows the primary data flows: the **Read Path** (scoring) and the **Write Path** (event ingestion).
2631

2732
```mermaid
2833
graph TD
29-
subgraph Scheduler / Router
34+
subgraph "Scheduler"
3035
A[Scheduler]
3136
end
32-
33-
subgraph KVCacheManager["KV-Cache Manager"]
37+
38+
subgraph "KV-Cache Manager"
3439
B[KVCache Indexer API]
3540
C[KV-Block Index]
3641
D[Event Subscriber]
3742
end
3843
39-
subgraph vLLM Fleet
44+
subgraph "vLLM Fleet"
4045
E[vLLM Pod 1]
4146
F[vLLM Pod 2]
4247
G[...]
4348
end
4449
45-
A -- "1. Score(prompt, pods)" --> B
46-
B -- "2. Query Index" --> C
47-
B -- "3. Return Scores" --> A
48-
49-
E -- "4. Emit KVEvents" --> D
50-
F -- "4. Emit KVEvents" --> D
51-
D -- "5. Update Index" --> C
50+
A -->|"1. Score(prompt, pods)"| B
51+
B -->|2. Query Index| C
52+
B -->|3. Return Scores| A
5253
54+
E -->|A. Emit KVEvents| D
55+
F -->|A. Emit KVEvents| D
56+
D -->|B. Update Index| C
5357
```
58+
_Note: 1-3 represent the Read Path for scoring pods, while A-B represent the Write Path for ingesting KVEvents._
5459

5560
1. **Scoring Request**: A scheduler asks the **KVCache Indexer** to score a set of pods for a given prompt
5661
2. **Index Query**: The indexer calculates the necessary KV-block keys from the prompt and queries the **KV-Block Index** to see which pods have those blocks

docs/architecture.md

Lines changed: 38 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,23 @@
1-
# KV-Cache Indexer: A Technical Architecture
1+
# KV-Cache Indexer: Architecture
22

3-
The **KV-Cache Indexer** is a high-performance Go service that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
4-
Its purpose is to enable smart routing and scheduling by telling request routers which pods are best-equipped to handle an incoming prompt with the lowest possible latency.
5-
6-
### Core Responsibilities
7-
8-
* **Global Cache Tracking**: Maintains a central index of KV-block locations across all vLLM pods.
9-
* **Intelligent Pod Scoring**: Scores candidate pods for incoming prompts based on how much of the prompt's prefix they already have cached.
10-
* **Real-Time Event Processing**: Ingests a high-throughput stream of cache events (`BlockStored`, `BlockRemoved`) from vLLM pods to keep the index fresh.
11-
* **Ultra-Low-Latency Lookups**: Delivers pod scoring results in sub-millisecond time to ensure scheduling decisions are fast.
3+
The **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
4+
Its purpose is the enablement of smart routing and scheduling by exposing a fast, intelligent scoring mechanism for vLLM pods based on their cached KV-blocks.
125

136
-----
147

158
## System Architecture
169

17-
The Indexer is built from several modules that work together, each with a clear job.
10+
The Indexer is built from several modules that work together, each with clear responsibilities.
11+
Separating concerns is a guiding principle in the design of this system.
1812

19-
| Module | Purpose | Default Implementation |
20-
| :--- | :--- | :--- |
21-
| **`kvcache.Indexer`** | The main orchestrator that handles scoring requests. | Coordinates all internal modules. |
22-
| **`kvevents.Pool`** | Ingests and processes KV-cache events from vLLM pods. | A sharded worker pool using ZMQ for event subscription. |
23-
| **`kvblock.Index`** | The core data store mapping KV-block hashes to pod locations. | An in-memory, two-level LRU cache. |
24-
| **`tokenization.PrefixStore`**| Caches tokenized prompt prefixes to avoid re-work. | An LRU cache storing text chunks and their corresponding tokens. |
25-
| **`kvblock.TokenProcessor`**| Converts token sequences into content-addressable block keys. | Uses a chunking and hashing algorithm compatible with vLLM. |
26-
| **`kvblock.Scorer`** | Scores pods based on the sequence of cache hits. | Implements a longest consecutive prefix matching strategy. |
13+
| Module | Purpose | Default Implementation |
14+
| :--- |:-------------------------------------------------------------|:----------------------------------------------------------------|
15+
| **`kvcache.Indexer`** | The main orchestrator that handles scoring requests | Coordinates all internal modules |
16+
| **`kvevents.Pool`** | Ingests and processes KV-cache events from vLLM pods | A sharded worker pool using ZMQ for event subscription |
17+
| **`kvblock.Index`** | The core data store mapping KV-block hashes to pod locations | An in-memory, two-level LRU cache |
18+
| **`tokenization.PrefixStore`**| Caches tokenized prompt prefixes to avoid re-work | An LRU cache storing text chunks and their corresponding tokens |
19+
| **`kvblock.TokenProcessor`**| Converts token sequences into KV-block keys | Uses a chunking and hashing algorithm compatible with vLLM |
20+
| **`kvblock.Scorer`** | Scores pods based on the sequence of cache hits | Implements a longest consecutive prefix matching strategy |
2721

2822
-----
2923

@@ -33,7 +27,9 @@ The system has two primary data flows: the **Read Path** for scoring pods and th
3327

3428
### Read Path: Scoring a Prompt
3529

36-
When a router needs to pick the best pod for a new prompt, it triggers the Read Path. The goal is to find the pod that has the longest sequence of relevant KV-blocks already in its cache.
30+
When a router needs to pick the best pod for a new prompt, it triggers the Read Path.
31+
The goal is to find the pod that has the longest sequence of relevant KV-blocks already in its cache.
32+
A list of pods with their scores is returned to the router.
3733

3834
```mermaid
3935
sequenceDiagram
@@ -70,6 +66,9 @@ sequenceDiagram
7066
4. **Scoring**: The `Scorer` takes the hit data and scores each pod based on its number of consecutive matching blocks.
7167
5. **Response**: A final map of pod scores is sent back to the router.
7268

69+
Note: step (1) means that the first time a prompt is scored, it may return an empty result while the tokenization happens in the background.
70+
It is assumed that this cache will be populated with common prompts, so the first scoring request is an edge case.
71+
7372
### Write Path: Processing Cache Events
7473

7574
The Write Path keeps the index up-to-date by processing a constant stream of events from the vLLM fleet.
@@ -118,16 +117,16 @@ sequenceDiagram
118117

119118
To guarantee compatibility, the indexer perfectly matches vLLM's content-addressing logic.
120119

121-
* **Token Chunking**: Prompts are converted to tokens, which are then grouped into fixed-size chunks (default: 256).
122-
* **Hash Algorithm**: A chained hash is computed. Each block's key is the **lower 64 bits of a SHA-256 hash**, generated from the CBOR-encoded `[parentHash, tokenChunk]` tuple.
123-
* **Initialization**: The hash chain starts with a configurable `HashSeed`. This value **must** align with the `PYTHONHASHSEED` environment variable in the vLLM pods to ensure hashes are consistent across the entire system.
120+
* **Token Chunking**: Prompts are converted to tokens, which are then grouped into fixed-size chunks (default: 16).
121+
* **Hash Algorithm**: A chained hash is computed. Each block's key is the **lower 64 bits of a SHA-256 hash**, generated from the CBOR-encoded `[parentHash, tokenChunk, extraKeys]` tuple.
122+
* **Initialization**: The hash chain starts with a configurable `HashSeed`. This value's source **must** align with the `PYTHONHASHSEED` environment variable in the vLLM pods to ensure hashes are consistent across the entire system.
124123

125124
#### Index Backends
126125

127126
The `kvblock.Index` is an interface with swappable backends.
128127

129128
* **In-Memory (Default)**: A very fast, thread-safe, two-level LRU cache using `hashicorp/golang-lru`. The first level maps a block key to a second-level cache of pods that have the block. It prioritizes speed over persistence, which is usually the right trade-off for ephemeral cache data.
130-
* **Redis (Optional)**: A distributed backend that can be shared by multiple indexer replicas. It offers scalability and persistence, but this may be overkill given the short lifetime of most KV-cache blocks.
129+
* **Redis (Optional)**: A distributed backend that can be shared by multiple indexer replicas. It can offer scalability and persistence, but this may be overkill given the short lifetime of most KV-cache blocks.
131130

132131
#### Tokenization Subsystem
133132

@@ -137,4 +136,18 @@ Efficiently handling tokenization is critical for performance. The system is des
137136
* **Tokenizer Caching**: The actual tokenization is handled by a `CachedHFTokenizer`, which wraps Hugging Face's high-performance Rust tokenizers. To avoid the overhead of repeatedly loading tokenizer models from disk, it maintains an LRU cache of active tokenizer instances.
138137
* **PrefixStore Backends**: The token cache (`PrefixStore`) is an interface with two available implementations:
139138
* **`LRUTokenStore` (Default)**: This implementation chunks incoming text, hashes it, and stores blocks of tokens in an LRU cache. It's fast and memory-bounded, making it a reliable default. It's designed to find the longest chain of *blocks* that match a prompt's prefix.
140-
* **`TrieTokenStore`**: An alternative implementation that uses a character-based trie. Each node in the trie stores information about the last token that was fully contained within the prefix leading to that node. This approach can be more memory-efficient for prompts with highly repetitive or overlapping prefixes.
139+
* **`TrieTokenStore`**: An alternative implementation that uses a character-based trie. Each node in the trie stores information about the last token that was fully contained within the prefix leading to that node. This approach can be more memory-efficient for prompts with highly repetitive or overlapping prefixes, but is generally slower than the LRU-based store.
140+
It is not the default due to its higher complexity and lower performance in most scenarios.
141+
142+
-----
143+
144+
## Dependencies
145+
146+
The Indexer relies on several libraries and tools:
147+
* **[daulet/tokenizers](https://github.com/daulet/tokenizers)**: Go bindings for the HuggingFace Tokenizers library.
148+
* Used for tokenization of prompts.
149+
* **[pebbe/zmq4](https://github.com/pebbe/zmq4)**: Go bindings for ZeroMQ.
150+
* Used for the event processing pool and communication between components.
151+
* Requires `libzmq` library to be installed on the system.
152+
* **Python**: Required to run a CGO binding for the `chat_completions_template` package.
153+
* Used for jinja2 templating of chat completions requests.

docs/configuration.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
# Configuration Documentation
1+
# Configuration
22

33
This document describes all configuration options available in the llm-d KV Cache Manager.
4-
All configurations are JSON-serializable and can be provided via configuration files or environment variables.
4+
All configurations are JSON-serializable.
55

66
## Main Configuration
77

88
This package consists of two components:
99
1. **KV Cache Indexer**: Manages the KV cache index, allowing efficient retrieval of cached blocks.
1010
2. **KV Event Processing**: Handles events from vLLM to update the cache index.
1111

12+
See the [Architecture Overview](architecture.md) for a high-level view of how these components work and interact.
13+
1214
The two components are configured separately, but share the index backend for storing KV block localities.
1315
The latter is configured via the `kvBlockIndexConfig` field in the KV Cache Indexer configuration.
1416

@@ -118,7 +120,7 @@ Configures the Redis-backed KV block index implementation.
118120

119121
### Token Processor Configuration (`TokenProcessorConfig`)
120122

121-
Configures how tokens are converted to KV block keys.
123+
Configures how tokens are converted to KV-block keys.
122124

123125
```json
124126
{
@@ -221,15 +223,15 @@ For the ZMQ event processing pool:
221223
---
222224
## Notes
223225

224-
1. **Hash Seed Alignment**: The `hash_seed` in `TokenProcessorConfig` should be aligned with vLLM's `PYTHONHASHSEED` environment variable to ensure consistent hashing across the system.
226+
1. **Hash Seed Alignment**: The `hashSeed` in `TokenProcessorConfig` should be aligned with vLLM's `PYTHONHASHSEED` environment variable to ensure consistent hashing across the system.
225227

226228
2. **Memory Considerations**: The `size` parameter in `InMemoryIndexConfig` directly affects memory usage. Each key-value pair consumes memory proportional to the number of associated pods.
227229

228230
3. **Performance Tuning**:
229-
- Increase `workers_count` in tokenization config for higher tokenization throughput
231+
- Increase `workersCount` in tokenization config for higher tokenization throughput
230232
- Adjust `concurrency` in event processing for better event handling performance
231233
- Tune cache sizes based on available memory and expected workload
232234

233-
4. **Cache Directories**: Ensure the `tokenizers_cache_dir` has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
235+
4. **Cache Directories**: If used, ensure the `tokenizersCacheDir` has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
234236

235237
5. **Redis Configuration**: When using Redis backend, ensure Redis server is accessible and has sufficient memory. The `address` field supports full Redis URLs including authentication: `redis://user:pass@host:port/db`.

docs/context-aware-routing.md

Lines changed: 0 additions & 9 deletions
This file was deleted.

docs/deployment/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Deployment
2+
3+
See the [vLLM Deployment Chart](../../vllm-setup-helm/README.md) for deploying vLLM with KV-cache indexer.
4+
Also see the [Examples](../../examples) for examples.

docs/deployment/setup.md renamed to docs/deployment/v0.1.0/setup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ You should see:
7272

7373
### Configuration Options
7474

75-
The Helm chart supports various configuration options. See [values.yaml](../../vllm-setup-helm/values.yaml) for all available options.
75+
The Helm chart supports various configuration options. See [values.yaml](../../../vllm-setup-helm/values.yaml) for all available options.
7676

7777
Key configuration parameters:
7878

0 commit comments

Comments
 (0)