Skip to content

Commit 5db1e8e

Browse files
guygirvMaroon
authored andcommitted
Refactor chat template integration
Signed-off-by: Guy Girmonsky <[email protected]>
1 parent 022a2ec commit 5db1e8e

File tree

12 files changed

+234
-753
lines changed

12 files changed

+234
-753
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,12 @@
1515
# Dependency directories
1616
vendor/
1717

18+
# Python cache
19+
__pycache__/
20+
*.pyc
21+
*.pyo
22+
*.pyd
23+
1824
# Go workspace file
1925
go.work
2026
go.work.sum
@@ -32,6 +38,7 @@ go.work.sum
3238
*.swo
3339
*.bak
3440
*.tmp
41+
*.code-workspace
3542

3643
# OS-specific
3744
.DS_Store

Makefile

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,7 @@ help: ## Print help
2323
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-15s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
2424

2525
##@ Tokenizer & Linking
26-
##
27-
28-
29-
##
3026
LDFLAGS ?= -extldflags '-L$(shell pwd)/lib'
31-
#LDFLAGS ?= -extldflags '-L$(shell pwd)/lib'
3227
CGO_ENABLED=1
3328
TOKENIZER_LIB = lib/libtokenizers.a
3429
TOKENIZER_RELEASE = v1.20.2
Lines changed: 139 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -1,158 +1,167 @@
1-
# HOWTO: Using `GetCompletionsPodScores` for OpenAI-API ChatCompletions Requests with kv-cache-manager
1+
# Chat Template Integration for OpenAI-API Compatibility
22

3-
## Overview
3+
## Why Templating is Needed
44

5-
`GetCompletionsPodScores` in `indexer.go` enables the kv-cache-manager to support OpenAI-compatible ChatCompletions requests by rendering the full message structure (including tools and documents) into a prompt using a Python Jinja2 template, before tokenization and KV block key calculation.
5+
When processing OpenAI ChatCompletions requests, the model doesn't see the raw message structure. Instead, it sees a **flattened prompt** created by applying a model-specific chat template (Jinja2 format) that converts messages, tools, and documents into the exact text the model will tokenize.
66

7-
---
8-
9-
## What struct do I need to receive from the router?
10-
11-
You must provide a `chattemplatego.ChatTemplateRequest` with the following fields:
12-
13-
```go
14-
// ChatTemplateRequest represents the request to render a chat template
15-
type ChatTemplateRequest struct {
16-
Conversations [][]ChatMessage `json:"conversations"`
17-
Tools []interface{} `json:"tools,omitempty"`
18-
Documents []interface{} `json:"documents,omitempty"`
19-
ChatTemplate string `json:"chat_template,omitempty"`
20-
ReturnAssistantTokensMask bool `json:"return_assistant_tokens_mask,omitempty"`
21-
ContinueFinalMessage bool `json:"continue_final_message,omitempty"`
22-
AddGenerationPrompt bool `json:"add_generation_prompt,omitempty"`
23-
TemplateVars map[string]interface{} `json:"template_vars,omitempty"`
24-
}
25-
```
26-
27-
- **Conversations**: List of message lists (role/content pairs)
28-
- **Tools**: (Optional) List of tool schemas
29-
- **Documents**: (Optional) List of document dicts
30-
- **ChatTemplate**: (Optional) Override for the chat template
31-
- **ReturnAssistantTokensMask**: (Optional) Whether to return assistant token indices
32-
- **ContinueFinalMessage**: (Optional) Whether to continue from the final message
33-
- **AddGenerationPrompt**: (Optional) Whether to add a generation prompt
34-
- **TemplateVars**: (Optional) Special tokens for template rendering
35-
36-
This struct mirrors the OpenAI ChatCompletions request, supporting messages, tools, documents, and advanced template options.
37-
38-
### ChatMessage Struct
39-
40-
The `ChatMessage` struct represents individual messages within conversations:
41-
42-
```go
43-
// ChatMessage represents a single message in a conversation
44-
type ChatMessage struct {
45-
Role string `json:"role"`
46-
Content string `json:"content"`
7+
**Example:**
8+
```json
9+
// Input: ChatCompletions request
10+
{
11+
"messages": [
12+
{"role": "user", "content": "What's 2+2?"},
13+
{"role": "assistant", "content": "Let me calculate that."},
14+
{"role": "user", "content": "Thanks!"}
15+
]
4716
}
4817
```
4918

50-
- **Role**: The role of the message sender (e.g., "user", "assistant", "system")
51-
- **Content**: The actual message content/text
19+
```jinja2
20+
<!-- Model template (e.g., Llama-2) -->
21+
{% for message in messages %}
22+
{% if message['role'] == 'user' %}
23+
{{ '<s>[INST] ' + message['content'] + ' [/INST]' }}
24+
{% elif message['role'] == 'assistant' %}
25+
{{ message['content'] + '</s>' }}
26+
{% endif %}
27+
{% endfor %}
28+
```
5229

53-
**Example usage:**
54-
```go
55-
conversation := []chattemplatego.ChatMessage{
56-
{Role: "user", Content: "What is the weather in Paris?"},
57-
{Role: "assistant", Content: "Let me check that for you."},
58-
{Role: "user", Content: "Thank you!"},
59-
}
30+
```text
31+
<!-- Flattened prompt the model actually sees -->
32+
<s>[INST] What's 2+2? [/INST]Let me calculate that.</s><s>[INST] Thanks! [/INST]
6033
```
6134

62-
This structure follows the OpenAI ChatCompletions API format, making it compatible with existing chat-based applications.
35+
**Without templating**, we'd tokenize the raw JSON structure, producing completely different tokens than what the model will actually process, leading to incorrect KV cache lookups.
6336

64-
---
37+
## Integration with Existing Pipeline
6538

66-
## How do the three scoring functions differ?
39+
The chat template integration adds a **pre-processing step** to the existing KV cache pipeline:
6740

68-
- **`GetPromptPodScores`**:
69-
Accepts a simple prompt string, tokenizes it, and calculates KV block keys directly.
41+
1. **Template Fetching**: Get model-specific chat template from Hugging Face
42+
2. **Template Rendering**: Apply Jinja2 template to flatten the request structure
43+
3. **Continue with existing pipeline**: Tokenize → KV Block Keys → Pod Scoring
7044

71-
- **`GetCompletionsPodScores`**:
72-
Accepts a full `ChatTemplateRequest` (with messages, tools, etc.), uses the Python Jinja2 template (via CGO) to flatten the structure into a prompt, then tokenizes and calculates KV block keys. This ensures the prompt matches what the model would actually see.
45+
See the main documentation for the complete pipeline details.
7346

74-
- **`GetPodScores`**:
75-
A unified interface that automatically dispatches to either `GetPromptPodScores` or `GetCompletionsPodScores` based on the input type:
76-
- If input is a `string` → calls `GetPromptPodScores`
77-
- If input is a `ChatTemplateRequest` → calls `GetCompletionsPodScores`
78-
- This provides a single entry point for both simple prompts and complex chat completions.
47+
## Usage
7948

80-
---
49+
### Unified API
8150

82-
## Detailed Flow: `GetCompletionsPodScores` Pipeline
51+
The indexer provides a unified `GetPodScores()` function that handles both regular prompts and chat completion requests:
8352

84-
When `indexer.go:GetCompletionsPodScores()` is called, here's the complete flow through files and functions:
53+
```go
54+
// For regular prompts (default behavior)
55+
scores, err := indexer.GetPodScores(ctx, prompt, modelName, podIdentifiers, false)
56+
// or use the convenience function
57+
scores, err := indexer.GetPodScoresDefault(ctx, prompt, modelName, podIdentifiers)
8558

59+
// For chat completion requests
60+
scores, err := indexer.GetPodScores(ctx, prompt, modelName, podIdentifiers, true)
8661
```
87-
1. indexer.go:GetCompletionsPodScores(ctx, req, modelName, podIdentifiers)
88-
89-
├── 1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper()
90-
│ └── cgo_functions.go:NewChatTemplateCGoWrapper()
91-
│ └── Creates ChatTemplateCGoWrapper struct with initialized=false
92-
93-
├── 1.2. **CGO Binding**: wrapper.GetModelChatTemplate(getReq)
94-
│ ├── cgo_functions.go:GetModelChatTemplate(req)
95-
│ │ ├── Initialize() Python interpreter via CGO
96-
│ │ ├── executePythonCode() - **CGO Binding** to Python
97-
│ │ └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template()
98-
│ │ └── Uses Hugging Face AutoTokenizer to fetch model template
99-
│ └── Returns: (template, template_vars)
100-
101-
├── 1.3. **CGO Binding**: wrapper.RenderChatTemplate(req)
102-
│ ├── cgo_functions.go:RenderChatTemplate(req)
103-
│ │ ├── Initialize() Python interpreter via CGO (if not already done)
104-
│ │ ├── executePythonCode() - **CGO Binding** to Python
105-
│ │ └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template()
106-
│ │ ├── _compile_jinja_template() - Compiles Jinja2 template
107-
│ │ ├── AssistantTracker class - Tracks assistant token indices
108-
│ │ └── Returns: (rendered_chats, generation_indices)
109-
│ └── Returns: ChatTemplateResponse
110-
111-
├── 1.4. Extract prompt from response
112-
│ └── prompt := resp.RenderedChats[0]
113-
114-
├── 1.5. **Tokenization**: k.tokenizersPool.AddTask(prompt, modelName)
115-
│ └── tokenization/pool.go:AddTask() - Queues tokenization task
116-
117-
├── 1.6. **Prefix Store**: k.tokensIndexer.FindLongestContainedTokens(prompt, modelName)
118-
│ └── prefixstore/lru-store.go:FindLongestContainedTokens() - Finds cached tokens
119-
120-
├── 1.7. **Token Processing**: k.tokensProcessor.TokensToKVBlockKeys(tokens, modelName)
121-
│ └── kv-cache/token-processor.go:TokensToKVBlockKeys() - Converts tokens to block keys
122-
123-
├── 1.8. **KV Block Indexing**: k.kvBlockIndexer.GetPodsForKeys(ctx, blockKeys, podSet)
124-
│ └── kv-cache/kvblock-indexer.go:GetPodsForKeys() - Queries Redis for pod mappings
125-
126-
└── 1.9. **Scoring**: k.kvBlockScorer.Score(strBlockKeys, keyToPods)
127-
└── kv-cache/kvblock-scorer.go:Score() - Calculates pod scores
62+
63+
### For ChatCompletions Requests
64+
65+
The router can receive a standard OpenAI ChatCompletions request and convert it to a JSON string representing our `ChatTemplateRequest`:
66+
67+
**ChatTemplateRequest accepts these fields:**
68+
- `Conversations` - List of message lists (role/content pairs)
69+
- `Tools` - (Optional) List of tool schemas
70+
- `Documents` - (Optional) List of document dicts
71+
- `ChatTemplate` - (Optional) Override for the chat template
72+
- `ReturnAssistantTokensMask` - (Optional) Whether to return assistant token indices
73+
- `ContinueFinalMessage` - (Optional) Whether to continue from the final message
74+
- `AddGenerationPrompt` - (Optional) Whether to add a generation prompt
75+
- `TemplateVars` - (Optional) Special tokens for template rendering
76+
77+
```json
78+
// Input: OpenAI ChatCompletions request
79+
{
80+
"model": "llama-2-7b-chat",
81+
"messages": [
82+
{"role": "system", "content": "You are a helpful assistant."},
83+
{"role": "user", "content": "What's the weather in Paris?"},
84+
{"role": "assistant", "content": "Let me check that for you."},
85+
{"role": "user", "content": "Thanks!"}
86+
],
87+
"tools": [
88+
{
89+
"type": "function",
90+
"function": {
91+
"name": "get_weather",
92+
"description": "Get weather information",
93+
"parameters": {...}
94+
}
95+
}
96+
],
97+
"documents": [
98+
{
99+
"type": "text",
100+
"content": "Paris weather data..."
101+
}
102+
]
103+
}
128104
```
129105

130-
### Key Components in the Pipeline:
106+
```go
107+
// Converted to ChatTemplateRequest and then to JSON string
108+
req := chattemplatego.ChatTemplateRequest{
109+
Conversations: [][]chattemplatego.ChatMessage{
110+
{
111+
{Role: "system", Content: "You are a helpful assistant."},
112+
{Role: "user", Content: "What's the weather in Paris?"},
113+
{Role: "assistant", Content: "Let me check that for you."},
114+
{Role: "user", Content: "Thanks!"},
115+
},
116+
},
117+
Tools: []interface{}{...}, // From OpenAI request
118+
Documents: []interface{}{...}, // From OpenAI request
119+
// Other fields are optional and can be set as needed
120+
}
121+
122+
// Convert to JSON string for the unified API
123+
reqJSON, err := json.Marshal(req)
124+
if err != nil {
125+
return err
126+
}
131127

132-
**🔗 CGO Bindings** (Go → Python):
133-
- `cgo_functions.go` - Provides the bridge between Go and Python
134-
- Uses Python's C API via CGO to call Python functions directly
135-
- Manages Python interpreter lifecycle (Initialize/Finalize)
128+
scores, err := indexer.GetPodScores(ctx, string(reqJSON), modelName, podIdentifiers, true)
129+
```
136130

137-
**📦 Python Wrapper** (Python → Hugging Face):
138-
- `chat_template_wrapper.py` - Wraps Hugging Face's complex template system
139-
- Provides clean API for template rendering and model template fetching
140-
- Handles Jinja2 compilation, assistant tracking, and error handling
131+
### Template Processing Flow
141132

142-
**🔄 Data Flow**:
143-
1. **Input**: `ChatTemplateRequest` (messages, tools, documents)
144-
2. **Template Fetching**: Model-specific chat template from Hugging Face
145-
3. **Template Rendering**: Jinja2 template processing with tools/documents
146-
4. **Tokenization**: Convert rendered prompt to tokens
147-
5. **KV Cache Lookup**: Find cached token blocks and associated pods
148-
6. **Scoring**: Calculate pod scores based on cache hits
133+
The templating process (steps 1.1-1.4) handles the conversion from structured request to flattened prompt:
149134

150-
This pipeline ensures that chat completion requests are properly templated, tokenized, and scored against the KV cache, providing accurate pod recommendations for efficient request routing.
135+
```
136+
1.1. **CGO Binding**: chattemplatego.NewChatTemplateCGoWrapper()
137+
└── cgo_functions.go:NewChatTemplateCGoWrapper()
138+
└── Creates ChatTemplateCGoWrapper struct with initialized=false
139+
140+
1.2. **Template Fetching**: wrapper.GetModelChatTemplate(getReq)
141+
├── cgo_functions.go:GetModelChatTemplate(req)
142+
│ ├── Initialize() Python interpreter via CGO
143+
│ ├── executePythonCode() - **CGO Binding** to Python
144+
│ └── **Python Wrapper**: chat_template_wrapper.py:get_model_chat_template()
145+
│ └── Uses Hugging Face AutoTokenizer to fetch model template
146+
└── Returns: (template, template_vars)
147+
148+
1.3. **Template Rendering**: wrapper.RenderChatTemplate(req)
149+
├── cgo_functions.go:RenderChatTemplate(req)
150+
│ ├── Initialize() Python interpreter via CGO (if not already done)
151+
│ ├── executePythonCode() - **CGO Binding** to Python
152+
│ └── **Python Wrapper**: chat_template_wrapper.py:render_jinja_template()
153+
│ └── Imports render_jinja_template from transformers.utils.chat_template_utils
154+
│ └── Uses transformers library's core template rendering functionality
155+
└── Returns: ChatTemplateResponse
156+
157+
1.4. **Extract Flattened Prompt**
158+
└── prompt := resp.RenderedChats[0]
159+
└── Continue with existing pipeline: Tokenize → KV Block Keys → Pod Scoring
160+
```
151161

152-
---
162+
### API Functions
153163

154-
## Summary
164+
- **`GetPodScores(ctx, prompt, modelName, podIdentifiers, chatCompletion)`** - Unified function that handles both regular prompts and chat completions
165+
- **`GetPodScoresDefault(ctx, prompt, modelName, podIdentifiers)`** - Convenience function for regular prompts (equivalent to `GetPodScores` with `chatCompletion=false`)
155166

156-
- The router should send a `ChatTemplateRequest` (not just a prompt string) to the indexer.
157-
- `GetCompletionsPodScores` will handle template rendering and tokenization internally, ensuring correct KV block key calculation for all supported models.
158-
- The integration uses a CGO bridge (`cgo_functions.go`) to call Python (`chat_template_wrapper.py`) for template rendering, matching vLLM and OpenAI API behavior.
167+
The integration ensures tokenization matches exactly what the model will process, enabling accurate KV cache lookups for chat completion requests.

0 commit comments

Comments
 (0)