llm-d
diff --git a/‎.gitignore
Lines changed: 7 additions & 0 deletions b/‎.gitignore
Lines changed: 7 additions & 0 deletions
diff --git a/‎Makefile
Lines changed: 0 additions & 5 deletions b/‎Makefile
Lines changed: 0 additions & 5 deletions
diff --git a/‎pkg/tokenization/chat_template_go/README_CHATCOMPLETIONS_KVCACHE.md
Lines changed: 139 additions & 130 deletions b/‎pkg/tokenization/chat_template_go/README_CHATCOMPLETIONS_KVCACHE.md
Lines changed: 139 additions & 130 deletions
diff --git a/‎pkg/tokenization/chat_template_go/__pycache__/chat_template_wrapper.cpython-311.pyc
-25.7 KB b/‎pkg/tokenization/chat_template_go/__pycache__/chat_template_wrapper.cpython-311.pyc
-25.7 KB
@@ -15,6 +15,12 @@
 # Dependency directories
 vendor/
 
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+
 # Go workspace file
 go.work
 go.work.sum
@@ -32,6 +38,7 @@ go.work.sum
 *.swo
 *.bak
 *.tmp
+*.code-workspace
 
 # OS-specific
 .DS_Store
 
@@ -23,12 +23,7 @@ help: ## Print help
 	@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n  make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf "  \033[36m%-15s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
 
 ##@ Tokenizer & Linking
-##
-
-
-##
 LDFLAGS ?= -extldflags '-L$(shell pwd)/lib'
-#LDFLAGS ?= -extldflags '-L$(shell pwd)/lib'
 CGO_ENABLED=1
 TOKENIZER_LIB = lib/libtokenizers.a
 TOKENIZER_RELEASE = v1.20.2
 
@@ -1,158 +1,167 @@
-# HOWTO: Using `GetCompletionsPodScores` for OpenAI-API ChatCompletions Requests with kv-cache-manager
+# Chat Template Integration for OpenAI-API Compatibility
 
-## Overview
+## Why Templating is Needed
 
-`GetCompletionsPodScores` in `indexer.go` enables the kv-cache-manager to support OpenAI-compatible ChatCompletions requests by rendering the full message structure (including tools and documents) into a prompt using a Python Jinja2 template, before tokenization and KV block key calculation.
+When processing OpenAI ChatCompletions requests, the model doesn't see the raw message structure. Instead, it sees a **flattened prompt** created by applying a model-specific chat template (Jinja2 format) that converts messages, tools, and documents into the exact text the model will tokenize.
 
----
-
-## What struct do I need to receive from the router?
-
-You must provide a `chattemplatego.ChatTemplateRequest` with the following fields:
-
-```go
-// ChatTemplateRequest represents the request to render a chat template
- type ChatTemplateRequest struct {
-     Conversations             [][]ChatMessage        `json:"conversations"`
-     Tools                     []interface{}          `json:"tools,omitempty"`
-     Documents                 []interface{}          `json:"documents,omitempty"`
-     ChatTemplate              string                 `json:"chat_template,omitempty"`
-     ReturnAssistantTokensMask bool                   `json:"return_assistant_tokens_mask,omitempty"`
-     ContinueFinalMessage      bool                   `json:"continue_final_message,omitempty"`
-     AddGenerationPrompt       bool                   `json:"add_generation_prompt,omitempty"`
-     TemplateVars              map[string]interface{} `json:"template_vars,omitempty"`
- }
-```
-
-- **Conversations**: List of message lists (role/content pairs)
-- **Tools**: (Optional) List of tool schemas
-- **Documents**: (Optional) List of document dicts
-- **ChatTemplate**: (Optional) Override for the chat template
-- **ReturnAssistantTokensMask**: (Optional) Whether to return assistant token indices
-- **ContinueFinalMessage**: (Optional) Whether to continue from the final message
-- **AddGenerationPrompt**: (Optional) Whether to add a generation prompt
-- **TemplateVars**: (Optional) Special tokens for template rendering
-
-This struct mirrors the OpenAI ChatCompletions request, supporting messages, tools, documents, and advanced template options.
-
-### ChatMessage Struct
-
-The `ChatMessage` struct represents individual messages within conversations:
-
-```go
-// ChatMessage represents a single message in a conversation
-type ChatMessage struct {
-    Role    string `json:"role"`
-    Content string `json:"content"`
+**Example:**
+```json
+// Input: ChatCompletions request
+{
+  "messages": [
+    {"role": "user", "content": "What's 2+2?"},
+    {"role": "assistant", "content": "Let me calculate that."},
+    {"role": "user", "content": "Thanks!"}
+  ]
 }
 ```
 
-- **Role**: The role of the message sender (e.g., "user", "assistant", "system")
-- **Content**: The actual message content/text
+```jinja2
+<!-- Model template (e.g., Llama-2) -->
+{% for message in messages %}
+{% if message['role'] == 'user' %}
+{{ '<s>[INST] ' + message['content'] + ' [/INST]' }}
+{% elif message['role'] == 'assistant' %}
+{{ message['content'] + '</s>' }}
+{% endif %}
+{% endfor %}
+```
 
-**Example usage:**
-```go
-conversation := []chattemplatego.ChatMessage{
-    {Role: "user", Content: "What is the weather in Paris?"},
-    {Role: "assistant", Content: "Let me check that for you."},
-    {Role: "user", Content: "Thank you!"},
-}
+```text
+<!-- Flattened prompt the model actually sees -->
+<s>[INST] What's 2+2? [/INST]Let me calculate that.</s><s>[INST] Thanks! [/INST]
 ```
 
-This structure follows the OpenAI ChatCompletions API format, making it compatible with existing chat-based applications.
+**Without templating**, we'd tokenize the raw JSON structure, producing completely different tokens than what the model will actually process, leading to incorrect KV cache lookups.
 
----
+## Integration with Existing Pipeline
 
-## How do the three scoring functions differ?
+The chat template integration adds a **pre-processing step** to the existing KV cache pipeline:
 
-- **`GetPromptPodScores`**:  
-  Accepts a simple prompt string, tokenizes it, and calculates KV block keys directly.
+1. **Template Fetching**: Get model-specific chat template from Hugging Face
+2. **Template Rendering**: Apply Jinja2 template to flatten the request structure  
+3. **Continue with existing pipeline**: Tokenize → KV Block Keys → Pod Scoring
 
-- **`GetCompletionsPodScores`**:  
-  Accepts a full `ChatTemplateRequest` (with messages, tools, etc.), uses the Python Jinja2 template (via CGO) to flatten the structure into a prompt, then tokenizes and calculates KV block keys. This ensures the prompt matches what the model would actually see.
+See the main documentation for the complete pipeline details.
 
-- **`GetPodScores`**:  
-  A unified interface that automatically dispatches to either `GetPromptPodScores` or `GetCompletionsPodScores` based on the input type:
-  - If input is a `string` → calls `GetPromptPodScores`
-  - If input is a `ChatTemplateRequest` → calls `GetCompletionsPodScores`
-  - This provides a single entry point for both simple prompts and complex chat completions.
+## Usage
 
----
+### Unified API
 
-## Detailed Flow: `GetCompletionsPodScores` Pipeline
+The indexer provides a unified `GetPodScores()` function that handles both regular prompts and chat completion requests:
 
-When `indexer.go:GetCompletionsPodScores()` is called, here's the complete flow through files and functions:
+```go
+// For regular prompts (default behavior)
+scores, err := indexer.GetPodScores(ctx, prompt, modelName, podIdentifiers, false)
+// or use the convenience function
+scores, err := indexer.GetPodScoresDefault(ctx, prompt, modelName, podIdentifiers)
 
+// For chat completion requests
+scores, err := indexer.GetPodScores(ctx, prompt, modelName, podIdentifiers, true)
 ```
-1. indexer.go:GetCompletionsPodScores(ctx, req, modelName, podIdentifiers)
-   │
-   ├── 1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper()
-   │   └── cgo_functions.go:NewChatTemplateCGoWrapper()
-   │       └── Creates ChatTemplateCGoWrapper struct with initialized=false
-   │
-   ├── 1.2. **CGO Binding**: wrapper.GetModelChatTemplate(getReq)
-   │   ├── cgo_functions.go:GetModelChatTemplate(req)
-   │   │   ├── Initialize() Python interpreter via CGO
-   │   │   ├── executePythonCode() - **CGO Binding** to Python
-   │   │   └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template()
-   │   │       └── Uses Hugging Face AutoTokenizer to fetch model template
-   │   └── Returns: (template, template_vars)
-   │
-   ├── 1.3. **CGO Binding**: wrapper.RenderChatTemplate(req)
-   │   ├── cgo_functions.go:RenderChatTemplate(req)
-   │   │   ├── Initialize() Python interpreter via CGO (if not already done)
-   │   │   ├── executePythonCode() - **CGO Binding** to Python
-   │   │   └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template()
-   │   │       ├── _compile_jinja_template() - Compiles Jinja2 template
-   │   │       ├── AssistantTracker class - Tracks assistant token indices
-   │   │       └── Returns: (rendered_chats, generation_indices)
-   │   └── Returns: ChatTemplateResponse
-   │
-   ├── 1.4. Extract prompt from response
-   │   └── prompt := resp.RenderedChats[0]
-   │
-   ├── 1.5. **Tokenization**: k.tokenizersPool.AddTask(prompt, modelName)
-   │   └── tokenization/pool.go:AddTask() - Queues tokenization task
-   │
-   ├── 1.6. **Prefix Store**: k.tokensIndexer.FindLongestContainedTokens(prompt, modelName)
-   │   └── prefixstore/lru-store.go:FindLongestContainedTokens() - Finds cached tokens
-   │
-   ├── 1.7. **Token Processing**: k.tokensProcessor.TokensToKVBlockKeys(tokens, modelName)
-   │   └── kv-cache/token-processor.go:TokensToKVBlockKeys() - Converts tokens to block keys
-   │
-   ├── 1.8. **KV Block Indexing**: k.kvBlockIndexer.GetPodsForKeys(ctx, blockKeys, podSet)
-   │   └── kv-cache/kvblock-indexer.go:GetPodsForKeys() - Queries Redis for pod mappings
-   │
-   └── 1.9. **Scoring**: k.kvBlockScorer.Score(strBlockKeys, keyToPods)
-       └── kv-cache/kvblock-scorer.go:Score() - Calculates pod scores
+
+### For ChatCompletions Requests
+
+The router can receive a standard OpenAI ChatCompletions request and convert it to a JSON string representing our `ChatTemplateRequest`:
+
+**ChatTemplateRequest accepts these fields:**
+- `Conversations` - List of message lists (role/content pairs)
+- `Tools` - (Optional) List of tool schemas  
+- `Documents` - (Optional) List of document dicts
+- `ChatTemplate` - (Optional) Override for the chat template
+- `ReturnAssistantTokensMask` - (Optional) Whether to return assistant token indices
+- `ContinueFinalMessage` - (Optional) Whether to continue from the final message
+- `AddGenerationPrompt` - (Optional) Whether to add a generation prompt
+- `TemplateVars` - (Optional) Special tokens for template rendering
+
+```json
+// Input: OpenAI ChatCompletions request
+{
+  "model": "llama-2-7b-chat",
+  "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What's the weather in Paris?"},
+    {"role": "assistant", "content": "Let me check that for you."},
+    {"role": "user", "content": "Thanks!"}
+  ],
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "description": "Get weather information",
+        "parameters": {...}
+      }
+    }
+  ],
+  "documents": [
+    {
+      "type": "text",
+      "content": "Paris weather data..."
+    }
+  ]
+}
 ```
 
-### Key Components in the Pipeline:
+```go
+// Converted to ChatTemplateRequest and then to JSON string
+req := chattemplatego.ChatTemplateRequest{
+    Conversations: [][]chattemplatego.ChatMessage{
+        {
+            {Role: "system", Content: "You are a helpful assistant."},
+            {Role: "user", Content: "What's the weather in Paris?"},
+            {Role: "assistant", Content: "Let me check that for you."},
+            {Role: "user", Content: "Thanks!"},
+        },
+    },
+    Tools: []interface{}{...}, // From OpenAI request
+    Documents: []interface{}{...}, // From OpenAI request
+    // Other fields are optional and can be set as needed
+}
+
+// Convert to JSON string for the unified API
+reqJSON, err := json.Marshal(req)
+if err != nil {
+    return err
+}
 
-**🔗 CGO Bindings** (Go → Python):
-- `cgo_functions.go` - Provides the bridge between Go and Python
-- Uses Python's C API via CGO to call Python functions directly
-- Manages Python interpreter lifecycle (Initialize/Finalize)
+scores, err := indexer.GetPodScores(ctx, string(reqJSON), modelName, podIdentifiers, true)
+```
 
-**📦 Python Wrapper** (Python → Hugging Face):
-- `chat_template_wrapper.py` - Wraps Hugging Face's complex template system
-- Provides clean API for template rendering and model template fetching
-- Handles Jinja2 compilation, assistant tracking, and error handling
+### Template Processing Flow
 
-**🔄 Data Flow**:
-1. **Input**: `ChatTemplateRequest` (messages, tools, documents)
-2. **Template Fetching**: Model-specific chat template from Hugging Face
-3. **Template Rendering**: Jinja2 template processing with tools/documents
-4. **Tokenization**: Convert rendered prompt to tokens
-5. **KV Cache Lookup**: Find cached token blocks and associated pods
-6. **Scoring**: Calculate pod scores based on cache hits
+The templating process (steps 1.1-1.4) handles the conversion from structured request to flattened prompt:
 
-This pipeline ensures that chat completion requests are properly templated, tokenized, and scored against the KV cache, providing accurate pod recommendations for efficient request routing.
+```
+1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper()
+    └── cgo_functions.go:NewChatTemplateCGoWrapper()
+        └── Creates ChatTemplateCGoWrapper struct with initialized=false
+
+1.2. **Template Fetching**: wrapper.GetModelChatTemplate(getReq)
+    ├── cgo_functions.go:GetModelChatTemplate(req)
+    │   ├── Initialize() Python interpreter via CGO
+    │   ├── executePythonCode() - **CGO Binding** to Python
+    │   └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template()
+    │       └── Uses Hugging Face AutoTokenizer to fetch model template
+    └── Returns: (template, template_vars)
+
+1.3. **Template Rendering**: wrapper.RenderChatTemplate(req)
+    ├── cgo_functions.go:RenderChatTemplate(req)
+    │   ├── Initialize() Python interpreter via CGO (if not already done)
+    │   ├── executePythonCode() - **CGO Binding** to Python
+    │   └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template()
+    │       └── Imports render_jinja_template from transformers.utils.chat_template_utils
+    │           └── Uses transformers library's core template rendering functionality
+    └── Returns: ChatTemplateResponse
+
+1.4. **Extract Flattened Prompt**
+    └── prompt := resp.RenderedChats[0]
+    └── Continue with existing pipeline: Tokenize → KV Block Keys → Pod Scoring
+```
 
----
+### API Functions
 
-## Summary
+- **`GetPodScores(ctx, prompt, modelName, podIdentifiers, chatCompletion)`** - Unified function that handles both regular prompts and chat completions
+- **`GetPodScoresDefault(ctx, prompt, modelName, podIdentifiers)`** - Convenience function for regular prompts (equivalent to `GetPodScores` with `chatCompletion=false`)
 
-- The router should send a `ChatTemplateRequest` (not just a prompt string) to the indexer.
-- `GetCompletionsPodScores` will handle template rendering and tokenization internally, ensuring correct KV block key calculation for all supported models.
-- The integration uses a CGO bridge (`cgo_functions.go`) to call Python (`chat_template_wrapper.py`) for template rendering, matching vLLM and OpenAI API behavior.
+The integration ensures tokenization matches exactly what the model will process, enabling accurate KV cache lookups for chat completion requests.