Doc Enhancements (#73)

vMaroon · web-flow · commit 7b075d66a320 · 2025-07-31T15:21:14.000+03:00
* - doc updates
- zmq-local setup fix

Signed-off-by: Maroon Ayoub &lt;maroon.ayoub@ibm.com&gt;

* typo fix

Signed-off-by: Maroon Ayoub &lt;maroon.ayoub@ibm.com&gt;

---------

Signed-off-by: Maroon Ayoub &lt;maroon.ayoub@ibm.com&gt;
diff --git a/LICENSE b/LICENSE
@@ -1,17 +1,3 @@
-Copyright 2025-present The llm-d Authors.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use these files except in compliance with the License.
-You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-
                                  Apache License
                            Version 2.0, January 2004
                         http://www.apache.org/licenses/
diff --git a/Makefile b/Makefile
@@ -75,19 +75,19 @@ verify-boilerplate: $(TOOLS_DIR)/verify_boilerplate.py
 	$(TOOLS_DIR)/verify_boilerplate.py --boilerplate-dir=hack/boilerplate --skip docs
 
 .PHONY: unit-test
-unit-test: download-tokenizer
+unit-test: download-tokenizer download-zmq
 	@printf "\033[33;1m==== Running unit tests ====\033[0m\n"
 	go test -ldflags="$(LDFLAGS)" ./pkg/...
 
 .PHONY: e2e-test
-e2e-test: download-tokenizer
+e2e-test: download-tokenizer download-zmq
 	@printf "\033[33;1m==== Running unit tests ====\033[0m\n"
 	go test -v -ldflags="$(LDFLAGS)" ./tests/...
 
 ##@ Build
 
 .PHONY: build
-build: check-go download-tokenizer ##
+build: check-go download-tokenizer download-zmq
 	@printf "\033[33;1m==== Building ====\033[0m\n"
 	go build -ldflags="$(LDFLAGS)" -o bin/$(PROJECT_NAME) examples/kv_cache_index/main.go
 
@@ -354,3 +354,36 @@ print-project-name: ## Print the current project name
 .PHONY: install-hooks
 install-hooks: ## Install git hooks
 	git config core.hooksPath hooks
+
+
+##@ ZMQ Setup
+
+.PHONY: download-zmq
+download-zmq: ## Install ZMQ dependencies based on OS/ARCH
+	@echo "Checking if ZMQ is already installed..."
+	@if pkg-config --exists libzmq; then \
+	  echo "✅ ZMQ is already installed."; \
+	else \
+	  echo "Installing ZMQ dependencies..."; \
+	  if [ "$(TARGETOS)" = "linux" ]; then \
+	    if [ -x "$(command -v apt)" ]; then \
+	      apt update && apt install -y libzmq3-dev; \
+	    elif [ -x "$(command -v dnf)" ]; then \
+	      dnf install -y zeromq-devel; \
+	    else \
+	      echo "Unsupported Linux package manager. Install libzmq manually."; \
+	      exit 1; \
+	    fi; \
+	  elif [ "$(TARGETOS)" = "darwin" ]; then \
+	    if [ -x "$(command -v brew)" ]; then \
+	      brew install zeromq; \
+	    else \
+	      echo "Homebrew is not installed and is required to install zeromq. Install it from https://brew.sh/"; \
+	      exit 1; \
+	    fi; \
+	  else \
+	    echo "Unsupported OS: $(TARGETOS). Install libzmq manually - check https://zeromq.org/download/ for guidance."; \
+	    exit 1; \
+	  fi; \
+	  echo "✅ ZMQ dependencies installed."; \
+	fi
diff --git a/README.md b/README.md
@@ -1,3 +1,8 @@
+[![Go Report Card](https://goreportcard.com/badge/github.com/llm-d/llm-d-kv-cache-manager)](https://goreportcard.com/report/github.com/llm-d/llm-d-kv-cache-manager)
+[![Go Reference](https://pkg.go.dev/badge/github.com/llm-d/llm-d-kv-cache-manager.svg)](https://pkg.go.dev/github.com/llm-d/llm-d-kv-cache-manager)
+[![License](https://img.shields.io/github/license/llm-d/llm-d-kv-cache-manager)](LICENSE)
+[![Join Slack](https://img.shields.io/badge/Join_Slack-blue?logo=slack)](https://llm-d.slack.com/archives/C08TB7ZDV7S)
+
 # KV-Cache Manager
 
 ### Introduction
@@ -16,41 +21,41 @@ See the [Project Northstar](https://docs.google.com/document/d/1EM1QtDUaw7pVRkbH
 
 ## KV-Cache Indexer Overview
 
-One of the major component of this project is the **KVCache Indexer**: a high-performance Go service that maintains a global, near-real-time view of KV-Cache block locality.
+The major component of this project is the **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
 
 It is powered by `KVEvents` streamed from vLLM, which provide structured metadata as KV-blocks are created or evicted from a vLLM instance's KV-cache. 
 This allows the indexer to track which blocks reside on which nodes and on which tier (e.g., GPU or CPU). 
-This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, cache-aware placement decisions.
+This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, KV-cache-aware placement decisions.
 
 The diagram below shows the primary data flows: the **Read Path** (scoring) and the **Write Path** (event ingestion).
 
 ```mermaid
 graph TD
-    subgraph Scheduler / Router
+    subgraph "Scheduler"
         A[Scheduler]
     end
-    
-    subgraph KVCacheManager["KV-Cache Manager"]
+
+    subgraph "KV-Cache Manager"
         B[KVCache Indexer API]
         C[KV-Block Index]
         D[Event Subscriber]
     end
 
-    subgraph vLLM Fleet
+    subgraph "vLLM Fleet"
         E[vLLM Pod 1]
         F[vLLM Pod 2]
         G[...]
     end
 
-    A -- "1. Score(prompt, pods)" --> B
-    B -- "2. Query Index" --> C
-    B -- "3. Return Scores" --> A
-    
-    E -- "4. Emit KVEvents" --> D
-    F -- "4. Emit KVEvents" --> D
-    D -- "5. Update Index" --> C
+    A -->|"1. Score(prompt, pods)"| B
+    B -->|2. Query Index| C
+    B -->|3. Return Scores| A
     
+    E -->|A. Emit KVEvents| D
+    F -->|A. Emit KVEvents| D
+    D -->|B. Update Index| C
 ```
+_Note: 1-3 represent the Read Path for scoring pods, while A-B represent the Write Path for ingesting KVEvents._
 
 1.  **Scoring Request**: A scheduler asks the **KVCache Indexer** to score a set of pods for a given prompt
 2.  **Index Query**: The indexer calculates the necessary KV-block keys from the prompt and queries the **KV-Block Index** to see which pods have those blocks
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -1,29 +1,23 @@
-# KV-Cache Indexer: A Technical Architecture
+# KV-Cache Indexer: Architecture
 
-The **KV-Cache Indexer** is a high-performance Go service that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods. 
-Its purpose is to enable smart routing and scheduling by telling request routers which pods are best-equipped to handle an incoming prompt with the lowest possible latency.
-
-### Core Responsibilities
-
-* **Global Cache Tracking**: Maintains a central index of KV-block locations across all vLLM pods.
-* **Intelligent Pod Scoring**: Scores candidate pods for incoming prompts based on how much of the prompt's prefix they already have cached.
-* **Real-Time Event Processing**: Ingests a high-throughput stream of cache events (`BlockStored`, `BlockRemoved`) from vLLM pods to keep the index fresh.
-* **Ultra-Low-Latency Lookups**: Delivers pod scoring results in sub-millisecond time to ensure scheduling decisions are fast.
+The **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods. 
+Its purpose is the enablement of smart routing and scheduling by exposing a fast, intelligent scoring mechanism for vLLM pods based on their cached KV-blocks.
 
 -----
 
 ## System Architecture
 
-The Indexer is built from several modules that work together, each with a clear job.
+The Indexer is built from several modules that work together, each with clear responsibilities.
+Separating concerns is a guiding principle in the design of this system.
 
-| Module | Purpose | Default Implementation |
-| :--- | :--- | :--- |
-| **`kvcache.Indexer`** | The main orchestrator that handles scoring requests. | Coordinates all internal modules. |
-| **`kvevents.Pool`** | Ingests and processes KV-cache events from vLLM pods. | A sharded worker pool using ZMQ for event subscription. |
-| **`kvblock.Index`** | The core data store mapping KV-block hashes to pod locations. | An in-memory, two-level LRU cache. |
-| **`tokenization.PrefixStore`**| Caches tokenized prompt prefixes to avoid re-work. | An LRU cache storing text chunks and their corresponding tokens. |
-| **`kvblock.TokenProcessor`**| Converts token sequences into content-addressable block keys. | Uses a chunking and hashing algorithm compatible with vLLM. |
-| **`kvblock.Scorer`** | Scores pods based on the sequence of cache hits. | Implements a longest consecutive prefix matching strategy. |
+| Module | Purpose                                                      | Default Implementation                                          |
+| :--- |:-------------------------------------------------------------|:----------------------------------------------------------------|
+| **`kvcache.Indexer`** | The main orchestrator that handles scoring requests          | Coordinates all internal modules                                |
+| **`kvevents.Pool`** | Ingests and processes KV-cache events from vLLM pods         | A sharded worker pool using ZMQ for event subscription          |
+| **`kvblock.Index`** | The core data store mapping KV-block hashes to pod locations | An in-memory, two-level LRU cache                               |
+| **`tokenization.PrefixStore`**| Caches tokenized prompt prefixes to avoid re-work            | An LRU cache storing text chunks and their corresponding tokens |
+| **`kvblock.TokenProcessor`**| Converts token sequences into KV-block keys                  | Uses a chunking and hashing algorithm compatible with vLLM      |
+| **`kvblock.Scorer`** | Scores pods based on the sequence of cache hits              | Implements a longest consecutive prefix matching strategy       |
 
 -----
 
@@ -33,7 +27,9 @@ The system has two primary data flows: the **Read Path** for scoring pods and th
 
 ### Read Path: Scoring a Prompt
 
-When a router needs to pick the best pod for a new prompt, it triggers the Read Path. The goal is to find the pod that has the longest sequence of relevant KV-blocks already in its cache.
+When a router needs to pick the best pod for a new prompt, it triggers the Read Path. 
+The goal is to find the pod that has the longest sequence of relevant KV-blocks already in its cache.
+A list of pods with their scores is returned to the router.
 
 ```mermaid
 sequenceDiagram
@@ -70,6 +66,9 @@ sequenceDiagram
 4.  **Scoring**: The `Scorer` takes the hit data and scores each pod based on its number of consecutive matching blocks.
 5.  **Response**: A final map of pod scores is sent back to the router.
 
+Note: step (1) means that the first time a prompt is scored, it may return an empty result while the tokenization happens in the background.
+It is assumed that this cache will be populated with common prompts, so the first scoring request is an edge case.
+
 ### Write Path: Processing Cache Events
 
 The Write Path keeps the index up-to-date by processing a constant stream of events from the vLLM fleet.
@@ -118,16 +117,16 @@ sequenceDiagram
 
 To guarantee compatibility, the indexer perfectly matches vLLM's content-addressing logic.
 
-* **Token Chunking**: Prompts are converted to tokens, which are then grouped into fixed-size chunks (default: 256).
-* **Hash Algorithm**: A chained hash is computed. Each block's key is the **lower 64 bits of a SHA-256 hash**, generated from the CBOR-encoded `[parentHash, tokenChunk]` tuple.
-* **Initialization**: The hash chain starts with a configurable `HashSeed`. This value **must** align with the `PYTHONHASHSEED` environment variable in the vLLM pods to ensure hashes are consistent across the entire system.
+* **Token Chunking**: Prompts are converted to tokens, which are then grouped into fixed-size chunks (default: 16).
+* **Hash Algorithm**: A chained hash is computed. Each block's key is the **lower 64 bits of a SHA-256 hash**, generated from the CBOR-encoded `[parentHash, tokenChunk, extraKeys]` tuple.
+* **Initialization**: The hash chain starts with a configurable `HashSeed`. This value's source **must** align with the `PYTHONHASHSEED` environment variable in the vLLM pods to ensure hashes are consistent across the entire system.
 
 #### Index Backends
 
 The `kvblock.Index` is an interface with swappable backends.
 
 * **In-Memory (Default)**: A very fast, thread-safe, two-level LRU cache using `hashicorp/golang-lru`. The first level maps a block key to a second-level cache of pods that have the block. It prioritizes speed over persistence, which is usually the right trade-off for ephemeral cache data.
-* **Redis (Optional)**: A distributed backend that can be shared by multiple indexer replicas. It offers scalability and persistence, but this may be overkill given the short lifetime of most KV-cache blocks.
+* **Redis (Optional)**: A distributed backend that can be shared by multiple indexer replicas. It can offer scalability and persistence, but this may be overkill given the short lifetime of most KV-cache blocks.
 
 #### Tokenization Subsystem
 
@@ -137,4 +136,18 @@ Efficiently handling tokenization is critical for performance. The system is des
 * **Tokenizer Caching**: The actual tokenization is handled by a `CachedHFTokenizer`, which wraps Hugging Face's high-performance Rust tokenizers. To avoid the overhead of repeatedly loading tokenizer models from disk, it maintains an LRU cache of active tokenizer instances.
 * **PrefixStore Backends**: The token cache (`PrefixStore`) is an interface with two available implementations:
     * **`LRUTokenStore` (Default)**: This implementation chunks incoming text, hashes it, and stores blocks of tokens in an LRU cache. It's fast and memory-bounded, making it a reliable default. It's designed to find the longest chain of *blocks* that match a prompt's prefix.
-    * **`TrieTokenStore`**: An alternative implementation that uses a character-based trie. Each node in the trie stores information about the last token that was fully contained within the prefix leading to that node. This approach can be more memory-efficient for prompts with highly repetitive or overlapping prefixes.
+    * **`TrieTokenStore`**: An alternative implementation that uses a character-based trie. Each node in the trie stores information about the last token that was fully contained within the prefix leading to that node. This approach can be more memory-efficient for prompts with highly repetitive or overlapping prefixes, but is generally slower than the LRU-based store. 
+    It is not the default due to its higher complexity and lower performance in most scenarios.
+
+-----
+
+## Dependencies
+
+The Indexer relies on several libraries and tools:
+* **[daulet/tokenizers](https://github.com/daulet/tokenizers)**: Go bindings for the HuggingFace Tokenizers library.
+  * Used for tokenization of prompts. 
+* **[pebbe/zmq4](https://github.com/pebbe/zmq4)**: Go bindings for ZeroMQ.
+  * Used for the event processing pool and communication between components.
+  * Requires `libzmq` library to be installed on the system.
+* **Python**: Required to run a CGO binding for the `chat_completions_template` package.
+  * Used for jinja2 templating of chat completions requests.
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -1,14 +1,16 @@
-# Configuration Documentation
+# Configuration
 
 This document describes all configuration options available in the llm-d KV Cache Manager. 
-All configurations are JSON-serializable and can be provided via configuration files or environment variables.
+All configurations are JSON-serializable.
 
 ## Main Configuration
 
 This package consists of two components:
 1. **KV Cache Indexer**: Manages the KV cache index, allowing efficient retrieval of cached blocks.
 2. **KV Event Processing**: Handles events from vLLM to update the cache index.
 
+See the [Architecture Overview](architecture.md) for a high-level view of how these components work and interact.
+
 The two components are configured separately, but share the index backend for storing KV block localities.
 The latter is configured via the `kvBlockIndexConfig` field in the KV Cache Indexer configuration.
 
@@ -118,7 +120,7 @@ Configures the Redis-backed KV block index implementation.
 
 ### Token Processor Configuration (`TokenProcessorConfig`)
 
-Configures how tokens are converted to KV block keys.
+Configures how tokens are converted to KV-block keys.
 
 ```json
 {
@@ -221,15 +223,15 @@ For the ZMQ event processing pool:
 ---
 ## Notes
 
-1. **Hash Seed Alignment**: The `hash_seed` in `TokenProcessorConfig` should be aligned with vLLM's `PYTHONHASHSEED` environment variable to ensure consistent hashing across the system.
+1. **Hash Seed Alignment**: The `hashSeed` in `TokenProcessorConfig` should be aligned with vLLM's `PYTHONHASHSEED` environment variable to ensure consistent hashing across the system.
 
 2. **Memory Considerations**: The `size` parameter in `InMemoryIndexConfig` directly affects memory usage. Each key-value pair consumes memory proportional to the number of associated pods.
 
 3. **Performance Tuning**: 
-   - Increase `workers_count` in tokenization config for higher tokenization throughput
+   - Increase `workersCount` in tokenization config for higher tokenization throughput
    - Adjust `concurrency` in event processing for better event handling performance
    - Tune cache sizes based on available memory and expected workload
 
-4. **Cache Directories**: Ensure the `tokenizers_cache_dir` has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
+4. **Cache Directories**: If used, ensure the `tokenizersCacheDir` has sufficient disk space and appropriate permissions for the application to read/write tokenizer files.
 
 5. **Redis Configuration**: When using Redis backend, ensure Redis server is accessible and has sufficient memory. The `address` field supports full Redis URLs including authentication: `redis://user:pass@host:port/db`.
diff --git a/docs/context-aware-routing.md b/docs/context-aware-routing.md
diff --git a/docs/deployment/README.md b/docs/deployment/README.md
@@ -0,0 +1,4 @@
+# Deployment
+
+See the [vLLM Deployment Chart](../../vllm-setup-helm/README.md) for deploying vLLM with KV-cache indexer.
+Also see the [Examples](../../examples) for examples.
diff --git a/docs/deployment/v0.1.0/setup.md b/docs/deployment/v0.1.0/setup.md
@@ -72,7 +72,7 @@ You should see:
 
 ### Configuration Options
 
-The Helm chart supports various configuration options. See [values.yaml](../../vllm-setup-helm/values.yaml) for all available options.
+The Helm chart supports various configuration options. See [values.yaml](../../../vllm-setup-helm/values.yaml) for all available options.
 
 Key configuration parameters: