|
| 1 | +# HOWTO: Using `GetCompletionsPodScores` for OpenAI-API ChatCompletions Requests with kv-cache-manager |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +`GetCompletionsPodScores` in `indexer.go` enables the kv-cache-manager to support OpenAI-compatible ChatCompletions requests by rendering the full message structure (including tools and documents) into a prompt using a Python Jinja2 template, before tokenization and KV block key calculation. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## What struct do I need to receive from the router? |
| 10 | + |
| 11 | +You must provide a `chattemplatego.ChatTemplateRequest` with the following fields: |
| 12 | + |
| 13 | +```go |
| 14 | +// ChatTemplateRequest represents the request to render a chat template |
| 15 | + type ChatTemplateRequest struct { |
| 16 | + Conversations [][]ChatMessage `json:"conversations"` |
| 17 | + Tools []interface{} `json:"tools,omitempty"` |
| 18 | + Documents []interface{} `json:"documents,omitempty"` |
| 19 | + ChatTemplate string `json:"chat_template,omitempty"` |
| 20 | + ReturnAssistantTokensMask bool `json:"return_assistant_tokens_mask,omitempty"` |
| 21 | + ContinueFinalMessage bool `json:"continue_final_message,omitempty"` |
| 22 | + AddGenerationPrompt bool `json:"add_generation_prompt,omitempty"` |
| 23 | + TemplateVars map[string]interface{} `json:"template_vars,omitempty"` |
| 24 | + } |
| 25 | +``` |
| 26 | + |
| 27 | +- **Conversations**: List of message lists (role/content pairs) |
| 28 | +- **Tools**: (Optional) List of tool schemas |
| 29 | +- **Documents**: (Optional) List of document dicts |
| 30 | +- **ChatTemplate**: (Optional) Override for the chat template |
| 31 | +- **ReturnAssistantTokensMask**: (Optional) Whether to return assistant token indices |
| 32 | +- **ContinueFinalMessage**: (Optional) Whether to continue from the final message |
| 33 | +- **AddGenerationPrompt**: (Optional) Whether to add a generation prompt |
| 34 | +- **TemplateVars**: (Optional) Special tokens for template rendering |
| 35 | + |
| 36 | +This struct mirrors the OpenAI ChatCompletions request, supporting messages, tools, documents, and advanced template options. |
| 37 | + |
| 38 | +### ChatMessage Struct |
| 39 | + |
| 40 | +The `ChatMessage` struct represents individual messages within conversations: |
| 41 | + |
| 42 | +```go |
| 43 | +// ChatMessage represents a single message in a conversation |
| 44 | +type ChatMessage struct { |
| 45 | + Role string `json:"role"` |
| 46 | + Content string `json:"content"` |
| 47 | +} |
| 48 | +``` |
| 49 | + |
| 50 | +- **Role**: The role of the message sender (e.g., "user", "assistant", "system") |
| 51 | +- **Content**: The actual message content/text |
| 52 | + |
| 53 | +**Example usage:** |
| 54 | +```go |
| 55 | +conversation := []chattemplatego.ChatMessage{ |
| 56 | + {Role: "user", Content: "What is the weather in Paris?"}, |
| 57 | + {Role: "assistant", Content: "Let me check that for you."}, |
| 58 | + {Role: "user", Content: "Thank you!"}, |
| 59 | +} |
| 60 | +``` |
| 61 | + |
| 62 | +This structure follows the OpenAI ChatCompletions API format, making it compatible with existing chat-based applications. |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## How do the three scoring functions differ? |
| 67 | + |
| 68 | +- **`GetPromptPodScores`**: |
| 69 | + Accepts a simple prompt string, tokenizes it, and calculates KV block keys directly. |
| 70 | + |
| 71 | +- **`GetCompletionsPodScores`**: |
| 72 | + Accepts a full `ChatTemplateRequest` (with messages, tools, etc.), uses the Python Jinja2 template (via CGO) to flatten the structure into a prompt, then tokenizes and calculates KV block keys. This ensures the prompt matches what the model would actually see. |
| 73 | + |
| 74 | +- **`GetPodScores`**: |
| 75 | + A unified interface that automatically dispatches to either `GetPromptPodScores` or `GetCompletionsPodScores` based on the input type: |
| 76 | + - If input is a `string` → calls `GetPromptPodScores` |
| 77 | + - If input is a `ChatTemplateRequest` → calls `GetCompletionsPodScores` |
| 78 | + - This provides a single entry point for both simple prompts and complex chat completions. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Detailed Flow: `GetCompletionsPodScores` Pipeline |
| 83 | + |
| 84 | +When `indexer.go:GetCompletionsPodScores()` is called, here's the complete flow through files and functions: |
| 85 | + |
| 86 | +``` |
| 87 | +1. indexer.go:GetCompletionsPodScores(ctx, req, modelName, podIdentifiers) |
| 88 | + │ |
| 89 | + ├── 1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper() |
| 90 | + │ └── cgo_functions.go:NewChatTemplateCGoWrapper() |
| 91 | + │ └── Creates ChatTemplateCGoWrapper struct with initialized=false |
| 92 | + │ |
| 93 | + ├── 1.2. **CGO Binding**: wrapper.GetModelChatTemplate(getReq) |
| 94 | + │ ├── cgo_functions.go:GetModelChatTemplate(req) |
| 95 | + │ │ ├── Initialize() Python interpreter via CGO |
| 96 | + │ │ ├── executePythonCode() - **CGO Binding** to Python |
| 97 | + │ │ └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template() |
| 98 | + │ │ └── Uses Hugging Face AutoTokenizer to fetch model template |
| 99 | + │ └── Returns: (template, template_vars) |
| 100 | + │ |
| 101 | + ├── 1.3. **CGO Binding**: wrapper.RenderChatTemplate(req) |
| 102 | + │ ├── cgo_functions.go:RenderChatTemplate(req) |
| 103 | + │ │ ├── Initialize() Python interpreter via CGO (if not already done) |
| 104 | + │ │ ├── executePythonCode() - **CGO Binding** to Python |
| 105 | + │ │ └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template() |
| 106 | + │ │ ├── _compile_jinja_template() - Compiles Jinja2 template |
| 107 | + │ │ ├── AssistantTracker class - Tracks assistant token indices |
| 108 | + │ │ └── Returns: (rendered_chats, generation_indices) |
| 109 | + │ └── Returns: ChatTemplateResponse |
| 110 | + │ |
| 111 | + ├── 1.4. Extract prompt from response |
| 112 | + │ └── prompt := resp.RenderedChats[0] |
| 113 | + │ |
| 114 | + ├── 1.5. **Tokenization**: k.tokenizersPool.AddTask(prompt, modelName) |
| 115 | + │ └── tokenization/pool.go:AddTask() - Queues tokenization task |
| 116 | + │ |
| 117 | + ├── 1.6. **Prefix Store**: k.tokensIndexer.FindLongestContainedTokens(prompt, modelName) |
| 118 | + │ └── prefixstore/lru-store.go:FindLongestContainedTokens() - Finds cached tokens |
| 119 | + │ |
| 120 | + ├── 1.7. **Token Processing**: k.tokensProcessor.TokensToKVBlockKeys(tokens, modelName) |
| 121 | + │ └── kv-cache/token-processor.go:TokensToKVBlockKeys() - Converts tokens to block keys |
| 122 | + │ |
| 123 | + ├── 1.8. **KV Block Indexing**: k.kvBlockIndexer.GetPodsForKeys(ctx, blockKeys, podSet) |
| 124 | + │ └── kv-cache/kvblock-indexer.go:GetPodsForKeys() - Queries Redis for pod mappings |
| 125 | + │ |
| 126 | + └── 1.9. **Scoring**: k.kvBlockScorer.Score(strBlockKeys, keyToPods) |
| 127 | + └── kv-cache/kvblock-scorer.go:Score() - Calculates pod scores |
| 128 | +``` |
| 129 | + |
| 130 | +### Key Components in the Pipeline: |
| 131 | + |
| 132 | +**🔗 CGO Bindings** (Go → Python): |
| 133 | +- `cgo_functions.go` - Provides the bridge between Go and Python |
| 134 | +- Uses Python's C API via CGO to call Python functions directly |
| 135 | +- Manages Python interpreter lifecycle (Initialize/Finalize) |
| 136 | + |
| 137 | +**📦 Python Wrapper** (Python → Hugging Face): |
| 138 | +- `chat_template_wrapper.py` - Wraps Hugging Face's complex template system |
| 139 | +- Provides clean API for template rendering and model template fetching |
| 140 | +- Handles Jinja2 compilation, assistant tracking, and error handling |
| 141 | + |
| 142 | +**🔄 Data Flow**: |
| 143 | +1. **Input**: `ChatTemplateRequest` (messages, tools, documents) |
| 144 | +2. **Template Fetching**: Model-specific chat template from Hugging Face |
| 145 | +3. **Template Rendering**: Jinja2 template processing with tools/documents |
| 146 | +4. **Tokenization**: Convert rendered prompt to tokens |
| 147 | +5. **KV Cache Lookup**: Find cached token blocks and associated pods |
| 148 | +6. **Scoring**: Calculate pod scores based on cache hits |
| 149 | + |
| 150 | +This pipeline ensures that chat completion requests are properly templated, tokenized, and scored against the KV cache, providing accurate pod recommendations for efficient request routing. |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +## Summary |
| 155 | + |
| 156 | +- The router should send a `ChatTemplateRequest` (not just a prompt string) to the indexer. |
| 157 | +- `GetCompletionsPodScores` will handle template rendering and tokenization internally, ensuring correct KV block key calculation for all supported models. |
| 158 | +- The integration uses a CGO bridge (`cgo_functions.go`) to call Python (`chat_template_wrapper.py`) for template rendering, matching vLLM and OpenAI API behavior. |
0 commit comments