|
1 |
| -# HOWTO: Using `GetCompletionsPodScores` for OpenAI-API ChatCompletions Requests with kv-cache-manager |
| 1 | +# Chat Template Integration for OpenAI-API Compatibility |
2 | 2 |
|
3 |
| -## Overview |
| 3 | +## Why Templating is Needed |
4 | 4 |
|
5 |
| -`GetCompletionsPodScores` in `indexer.go` enables the kv-cache-manager to support OpenAI-compatible ChatCompletions requests by rendering the full message structure (including tools and documents) into a prompt using a Python Jinja2 template, before tokenization and KV block key calculation. |
| 5 | +When processing OpenAI ChatCompletions requests, the model doesn't see the raw message structure. Instead, it sees a **flattened prompt** created by applying a model-specific chat template (Jinja2 format) that converts messages, tools, and documents into the exact text the model will tokenize. |
6 | 6 |
|
7 |
| ---- |
8 |
| - |
9 |
| -## What struct do I need to receive from the router? |
10 |
| - |
11 |
| -You must provide a `chattemplatego.ChatTemplateRequest` with the following fields: |
12 |
| - |
13 |
| -```go |
14 |
| -// ChatTemplateRequest represents the request to render a chat template |
15 |
| - type ChatTemplateRequest struct { |
16 |
| - Conversations [][]ChatMessage `json:"conversations"` |
17 |
| - Tools []interface{} `json:"tools,omitempty"` |
18 |
| - Documents []interface{} `json:"documents,omitempty"` |
19 |
| - ChatTemplate string `json:"chat_template,omitempty"` |
20 |
| - ReturnAssistantTokensMask bool `json:"return_assistant_tokens_mask,omitempty"` |
21 |
| - ContinueFinalMessage bool `json:"continue_final_message,omitempty"` |
22 |
| - AddGenerationPrompt bool `json:"add_generation_prompt,omitempty"` |
23 |
| - TemplateVars map[string]interface{} `json:"template_vars,omitempty"` |
24 |
| - } |
25 |
| -``` |
26 |
| - |
27 |
| -- **Conversations**: List of message lists (role/content pairs) |
28 |
| -- **Tools**: (Optional) List of tool schemas |
29 |
| -- **Documents**: (Optional) List of document dicts |
30 |
| -- **ChatTemplate**: (Optional) Override for the chat template |
31 |
| -- **ReturnAssistantTokensMask**: (Optional) Whether to return assistant token indices |
32 |
| -- **ContinueFinalMessage**: (Optional) Whether to continue from the final message |
33 |
| -- **AddGenerationPrompt**: (Optional) Whether to add a generation prompt |
34 |
| -- **TemplateVars**: (Optional) Special tokens for template rendering |
35 |
| - |
36 |
| -This struct mirrors the OpenAI ChatCompletions request, supporting messages, tools, documents, and advanced template options. |
37 |
| - |
38 |
| -### ChatMessage Struct |
39 |
| - |
40 |
| -The `ChatMessage` struct represents individual messages within conversations: |
41 |
| - |
42 |
| -```go |
43 |
| -// ChatMessage represents a single message in a conversation |
44 |
| -type ChatMessage struct { |
45 |
| - Role string `json:"role"` |
46 |
| - Content string `json:"content"` |
| 7 | +**Example:** |
| 8 | +```json |
| 9 | +// Input: ChatCompletions request |
| 10 | +{ |
| 11 | + "messages": [ |
| 12 | + {"role": "user", "content": "What's 2+2?"}, |
| 13 | + {"role": "assistant", "content": "Let me calculate that."}, |
| 14 | + {"role": "user", "content": "Thanks!"} |
| 15 | + ] |
47 | 16 | }
|
48 | 17 | ```
|
49 | 18 |
|
50 |
| -- **Role**: The role of the message sender (e.g., "user", "assistant", "system") |
51 |
| -- **Content**: The actual message content/text |
| 19 | +```jinja2 |
| 20 | +<!-- Model template (e.g., Llama-2) --> |
| 21 | +{% for message in messages %} |
| 22 | +{% if message['role'] == 'user' %} |
| 23 | +{{ '<s>[INST] ' + message['content'] + ' [/INST]' }} |
| 24 | +{% elif message['role'] == 'assistant' %} |
| 25 | +{{ message['content'] + '</s>' }} |
| 26 | +{% endif %} |
| 27 | +{% endfor %} |
| 28 | +``` |
52 | 29 |
|
53 |
| -**Example usage:** |
54 |
| -```go |
55 |
| -conversation := []chattemplatego.ChatMessage{ |
56 |
| - {Role: "user", Content: "What is the weather in Paris?"}, |
57 |
| - {Role: "assistant", Content: "Let me check that for you."}, |
58 |
| - {Role: "user", Content: "Thank you!"}, |
59 |
| -} |
| 30 | +```text |
| 31 | +<!-- Flattened prompt the model actually sees --> |
| 32 | +<s>[INST] What's 2+2? [/INST]Let me calculate that.</s><s>[INST] Thanks! [/INST] |
60 | 33 | ```
|
61 | 34 |
|
62 |
| -This structure follows the OpenAI ChatCompletions API format, making it compatible with existing chat-based applications. |
| 35 | +**Without templating**, we'd tokenize the raw JSON structure, producing completely different tokens than what the model will actually process, leading to incorrect KV cache lookups. |
63 | 36 |
|
64 |
| ---- |
| 37 | +## Integration with Existing Pipeline |
65 | 38 |
|
66 |
| -## How do the three scoring functions differ? |
| 39 | +The chat template integration adds a **pre-processing step** to the existing KV cache pipeline: |
67 | 40 |
|
68 |
| -- **`GetPromptPodScores`**: |
69 |
| - Accepts a simple prompt string, tokenizes it, and calculates KV block keys directly. |
| 41 | +1. **Template Fetching**: Get model-specific chat template from Hugging Face |
| 42 | +2. **Template Rendering**: Apply Jinja2 template to flatten the request structure |
| 43 | +3. **Continue with existing pipeline**: Tokenize → KV Block Keys → Pod Scoring |
70 | 44 |
|
71 |
| -- **`GetCompletionsPodScores`**: |
72 |
| - Accepts a full `ChatTemplateRequest` (with messages, tools, etc.), uses the Python Jinja2 template (via CGO) to flatten the structure into a prompt, then tokenizes and calculates KV block keys. This ensures the prompt matches what the model would actually see. |
| 45 | +See the main documentation for the complete pipeline details. |
73 | 46 |
|
74 |
| -- **`GetPodScores`**: |
75 |
| - A unified interface that automatically dispatches to either `GetPromptPodScores` or `GetCompletionsPodScores` based on the input type: |
76 |
| - - If input is a `string` → calls `GetPromptPodScores` |
77 |
| - - If input is a `ChatTemplateRequest` → calls `GetCompletionsPodScores` |
78 |
| - - This provides a single entry point for both simple prompts and complex chat completions. |
| 47 | +## Usage |
79 | 48 |
|
80 |
| ---- |
| 49 | +### Unified API |
81 | 50 |
|
82 |
| -## Detailed Flow: `GetCompletionsPodScores` Pipeline |
| 51 | +The indexer provides a unified `GetPodScores()` function that handles both regular prompts and chat completion requests: |
83 | 52 |
|
84 |
| -When `indexer.go:GetCompletionsPodScores()` is called, here's the complete flow through files and functions: |
| 53 | +```go |
| 54 | +// For regular prompts (default behavior) |
| 55 | +scores, err := indexer.GetPodScores(ctx, prompt, modelName, podIdentifiers, false) |
| 56 | +// or use the convenience function |
| 57 | +scores, err := indexer.GetPodScoresDefault(ctx, prompt, modelName, podIdentifiers) |
85 | 58 |
|
| 59 | +// For chat completion requests |
| 60 | +scores, err := indexer.GetPodScores(ctx, prompt, modelName, podIdentifiers, true) |
86 | 61 | ```
|
87 |
| -1. indexer.go:GetCompletionsPodScores(ctx, req, modelName, podIdentifiers) |
88 |
| - │ |
89 |
| - ├── 1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper() |
90 |
| - │ └── cgo_functions.go:NewChatTemplateCGoWrapper() |
91 |
| - │ └── Creates ChatTemplateCGoWrapper struct with initialized=false |
92 |
| - │ |
93 |
| - ├── 1.2. **CGO Binding**: wrapper.GetModelChatTemplate(getReq) |
94 |
| - │ ├── cgo_functions.go:GetModelChatTemplate(req) |
95 |
| - │ │ ├── Initialize() Python interpreter via CGO |
96 |
| - │ │ ├── executePythonCode() - **CGO Binding** to Python |
97 |
| - │ │ └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template() |
98 |
| - │ │ └── Uses Hugging Face AutoTokenizer to fetch model template |
99 |
| - │ └── Returns: (template, template_vars) |
100 |
| - │ |
101 |
| - ├── 1.3. **CGO Binding**: wrapper.RenderChatTemplate(req) |
102 |
| - │ ├── cgo_functions.go:RenderChatTemplate(req) |
103 |
| - │ │ ├── Initialize() Python interpreter via CGO (if not already done) |
104 |
| - │ │ ├── executePythonCode() - **CGO Binding** to Python |
105 |
| - │ │ └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template() |
106 |
| - │ │ ├── _compile_jinja_template() - Compiles Jinja2 template |
107 |
| - │ │ ├── AssistantTracker class - Tracks assistant token indices |
108 |
| - │ │ └── Returns: (rendered_chats, generation_indices) |
109 |
| - │ └── Returns: ChatTemplateResponse |
110 |
| - │ |
111 |
| - ├── 1.4. Extract prompt from response |
112 |
| - │ └── prompt := resp.RenderedChats[0] |
113 |
| - │ |
114 |
| - ├── 1.5. **Tokenization**: k.tokenizersPool.AddTask(prompt, modelName) |
115 |
| - │ └── tokenization/pool.go:AddTask() - Queues tokenization task |
116 |
| - │ |
117 |
| - ├── 1.6. **Prefix Store**: k.tokensIndexer.FindLongestContainedTokens(prompt, modelName) |
118 |
| - │ └── prefixstore/lru-store.go:FindLongestContainedTokens() - Finds cached tokens |
119 |
| - │ |
120 |
| - ├── 1.7. **Token Processing**: k.tokensProcessor.TokensToKVBlockKeys(tokens, modelName) |
121 |
| - │ └── kv-cache/token-processor.go:TokensToKVBlockKeys() - Converts tokens to block keys |
122 |
| - │ |
123 |
| - ├── 1.8. **KV Block Indexing**: k.kvBlockIndexer.GetPodsForKeys(ctx, blockKeys, podSet) |
124 |
| - │ └── kv-cache/kvblock-indexer.go:GetPodsForKeys() - Queries Redis for pod mappings |
125 |
| - │ |
126 |
| - └── 1.9. **Scoring**: k.kvBlockScorer.Score(strBlockKeys, keyToPods) |
127 |
| - └── kv-cache/kvblock-scorer.go:Score() - Calculates pod scores |
| 62 | + |
| 63 | +### For ChatCompletions Requests |
| 64 | + |
| 65 | +The router can receive a standard OpenAI ChatCompletions request and convert it to a JSON string representing our `ChatTemplateRequest`: |
| 66 | + |
| 67 | +**ChatTemplateRequest accepts these fields:** |
| 68 | +- `Conversations` - List of message lists (role/content pairs) |
| 69 | +- `Tools` - (Optional) List of tool schemas |
| 70 | +- `Documents` - (Optional) List of document dicts |
| 71 | +- `ChatTemplate` - (Optional) Override for the chat template |
| 72 | +- `ReturnAssistantTokensMask` - (Optional) Whether to return assistant token indices |
| 73 | +- `ContinueFinalMessage` - (Optional) Whether to continue from the final message |
| 74 | +- `AddGenerationPrompt` - (Optional) Whether to add a generation prompt |
| 75 | +- `TemplateVars` - (Optional) Special tokens for template rendering |
| 76 | + |
| 77 | +```json |
| 78 | +// Input: OpenAI ChatCompletions request |
| 79 | +{ |
| 80 | + "model": "llama-2-7b-chat", |
| 81 | + "messages": [ |
| 82 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 83 | + {"role": "user", "content": "What's the weather in Paris?"}, |
| 84 | + {"role": "assistant", "content": "Let me check that for you."}, |
| 85 | + {"role": "user", "content": "Thanks!"} |
| 86 | + ], |
| 87 | + "tools": [ |
| 88 | + { |
| 89 | + "type": "function", |
| 90 | + "function": { |
| 91 | + "name": "get_weather", |
| 92 | + "description": "Get weather information", |
| 93 | + "parameters": {...} |
| 94 | + } |
| 95 | + } |
| 96 | + ], |
| 97 | + "documents": [ |
| 98 | + { |
| 99 | + "type": "text", |
| 100 | + "content": "Paris weather data..." |
| 101 | + } |
| 102 | + ] |
| 103 | +} |
128 | 104 | ```
|
129 | 105 |
|
130 |
| -### Key Components in the Pipeline: |
| 106 | +```go |
| 107 | +// Converted to ChatTemplateRequest and then to JSON string |
| 108 | +req := chattemplatego.ChatTemplateRequest{ |
| 109 | + Conversations: [][]chattemplatego.ChatMessage{ |
| 110 | + { |
| 111 | + {Role: "system", Content: "You are a helpful assistant."}, |
| 112 | + {Role: "user", Content: "What's the weather in Paris?"}, |
| 113 | + {Role: "assistant", Content: "Let me check that for you."}, |
| 114 | + {Role: "user", Content: "Thanks!"}, |
| 115 | + }, |
| 116 | + }, |
| 117 | + Tools: []interface{}{...}, // From OpenAI request |
| 118 | + Documents: []interface{}{...}, // From OpenAI request |
| 119 | + // Other fields are optional and can be set as needed |
| 120 | +} |
| 121 | + |
| 122 | +// Convert to JSON string for the unified API |
| 123 | +reqJSON, err := json.Marshal(req) |
| 124 | +if err != nil { |
| 125 | + return err |
| 126 | +} |
131 | 127 |
|
132 |
| -**🔗 CGO Bindings** (Go → Python): |
133 |
| -- `cgo_functions.go` - Provides the bridge between Go and Python |
134 |
| -- Uses Python's C API via CGO to call Python functions directly |
135 |
| -- Manages Python interpreter lifecycle (Initialize/Finalize) |
| 128 | +scores, err := indexer.GetPodScores(ctx, string(reqJSON), modelName, podIdentifiers, true) |
| 129 | +``` |
136 | 130 |
|
137 |
| -**📦 Python Wrapper** (Python → Hugging Face): |
138 |
| -- `chat_template_wrapper.py` - Wraps Hugging Face's complex template system |
139 |
| -- Provides clean API for template rendering and model template fetching |
140 |
| -- Handles Jinja2 compilation, assistant tracking, and error handling |
| 131 | +### Template Processing Flow |
141 | 132 |
|
142 |
| -**🔄 Data Flow**: |
143 |
| -1. **Input**: `ChatTemplateRequest` (messages, tools, documents) |
144 |
| -2. **Template Fetching**: Model-specific chat template from Hugging Face |
145 |
| -3. **Template Rendering**: Jinja2 template processing with tools/documents |
146 |
| -4. **Tokenization**: Convert rendered prompt to tokens |
147 |
| -5. **KV Cache Lookup**: Find cached token blocks and associated pods |
148 |
| -6. **Scoring**: Calculate pod scores based on cache hits |
| 133 | +The templating process (steps 1.1-1.4) handles the conversion from structured request to flattened prompt: |
149 | 134 |
|
150 |
| -This pipeline ensures that chat completion requests are properly templated, tokenized, and scored against the KV cache, providing accurate pod recommendations for efficient request routing. |
| 135 | +``` |
| 136 | +1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper() |
| 137 | + └── cgo_functions.go:NewChatTemplateCGoWrapper() |
| 138 | + └── Creates ChatTemplateCGoWrapper struct with initialized=false |
| 139 | +
|
| 140 | +1.2. **Template Fetching**: wrapper.GetModelChatTemplate(getReq) |
| 141 | + ├── cgo_functions.go:GetModelChatTemplate(req) |
| 142 | + │ ├── Initialize() Python interpreter via CGO |
| 143 | + │ ├── executePythonCode() - **CGO Binding** to Python |
| 144 | + │ └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template() |
| 145 | + │ └── Uses Hugging Face AutoTokenizer to fetch model template |
| 146 | + └── Returns: (template, template_vars) |
| 147 | +
|
| 148 | +1.3. **Template Rendering**: wrapper.RenderChatTemplate(req) |
| 149 | + ├── cgo_functions.go:RenderChatTemplate(req) |
| 150 | + │ ├── Initialize() Python interpreter via CGO (if not already done) |
| 151 | + │ ├── executePythonCode() - **CGO Binding** to Python |
| 152 | + │ └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template() |
| 153 | + │ └── Imports render_jinja_template from transformers.utils.chat_template_utils |
| 154 | + │ └── Uses transformers library's core template rendering functionality |
| 155 | + └── Returns: ChatTemplateResponse |
| 156 | +
|
| 157 | +1.4. **Extract Flattened Prompt** |
| 158 | + └── prompt := resp.RenderedChats[0] |
| 159 | + └── Continue with existing pipeline: Tokenize → KV Block Keys → Pod Scoring |
| 160 | +``` |
151 | 161 |
|
152 |
| ---- |
| 162 | +### API Functions |
153 | 163 |
|
154 |
| -## Summary |
| 164 | +- **`GetPodScores(ctx, prompt, modelName, podIdentifiers, chatCompletion)`** - Unified function that handles both regular prompts and chat completions |
| 165 | +- **`GetPodScoresDefault(ctx, prompt, modelName, podIdentifiers)`** - Convenience function for regular prompts (equivalent to `GetPodScores` with `chatCompletion=false`) |
155 | 166 |
|
156 |
| -- The router should send a `ChatTemplateRequest` (not just a prompt string) to the indexer. |
157 |
| -- `GetCompletionsPodScores` will handle template rendering and tokenization internally, ensuring correct KV block key calculation for all supported models. |
158 |
| -- The integration uses a CGO bridge (`cgo_functions.go`) to call Python (`chat_template_wrapper.py`) for template rendering, matching vLLM and OpenAI API behavior. |
| 167 | +The integration ensures tokenization matches exactly what the model will process, enabling accurate KV cache lookups for chat completion requests. |
0 commit comments