-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Implement extension tokenization #2763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements extension-contributed tokenization by creating a new ExtensionContributedChatTokenizer class that leverages the VS Code Language Model API for token counting. This replaces the previous approach where extension-contributed endpoints relied on the generic ITokenizerProvider service.
Key Changes
- Adds
ExtensionContributedChatTokenizerthat delegates token counting to the VS Code language model API - Removes
ITokenizerProviderdependency fromExtensionContributedChatEndpoint - Adds comprehensive unit tests for the new tokenizer implementation
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/platform/endpoint/vscode-node/extChatTokenizer.ts | New tokenizer implementation that uses VS Code's LanguageModelChat.countTokens API for token counting with special handling for different content part types |
| src/platform/endpoint/vscode-node/test/extChatTokenizer.spec.ts | Comprehensive test suite covering string tokens, content parts, messages, and tools with mock implementations |
| src/platform/endpoint/vscode-node/extChatEndpoint.ts | Updated to instantiate ExtensionContributedChatTokenizer directly, removing ITokenizerProvider dependency injection |
Comments suppressed due to low confidence (2)
src/platform/endpoint/vscode-node/test/extChatTokenizer.spec.ts:125
- The countMessageTokens method returns 0 when convertToApiChatMessage returns an empty array, but this case is not tested. Since convertToApiChatMessage can skip certain content parts (like non-base64 images), it's possible for valid input messages to result in empty API messages. Consider adding a test case that verifies this behavior, such as a message with only non-base64 image URLs that get skipped during conversion.
describe('countMessageTokens', () => {
it('should count tokens for a user message', async () => {
const message: Raw.ChatMessage = {
role: Raw.ChatRole.User,
content: [{ type: Raw.ChatCompletionContentPartKind.Text, text: 'Hello there' }]
};
const result = await tokenizer.countMessageTokens(message);
// BaseTokensPerMessage (3) + message content tokens
expect(result).toBeGreaterThanOrEqual(3);
});
it('should count tokens for an assistant message', async () => {
const message: Raw.ChatMessage = {
role: Raw.ChatRole.Assistant,
content: [{ type: Raw.ChatCompletionContentPartKind.Text, text: 'I can help with that' }]
};
const result = await tokenizer.countMessageTokens(message);
expect(result).toBeGreaterThanOrEqual(3);
});
it('should count tokens for a system message', async () => {
const message: Raw.ChatMessage = {
role: Raw.ChatRole.System,
content: [{ type: Raw.ChatCompletionContentPartKind.Text, text: 'You are a helpful assistant' }]
};
const result = await tokenizer.countMessageTokens(message);
expect(result).toBeGreaterThanOrEqual(3);
});
});
src/platform/endpoint/vscode-node/extChatTokenizer.ts:114
- The _countObjectTokens method doesn't handle arrays correctly. When it encounters an array (which is typeof 'object'), it will cast it to Record<string, unknown> and use Object.entries, which will iterate over array indices as keys. This could lead to incorrect token counting for tool parameters that contain arrays. Consider adding explicit handling for arrays, such as checking Array.isArray(value) before the typeof check and iterating over array elements appropriately.
} else if (typeof value === 'object') {
numTokens += await this._countObjectTokens(value as Record<string, unknown>);
}
| private async _countObjectTokens(obj: Record<string, unknown>): Promise<number> { | ||
| let numTokens = 0; | ||
| for (const [key, value] of Object.entries(obj)) { | ||
| if (!value) { |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check if (!value) on line 105 will skip over falsy values including 0, false, empty strings, etc. While this may be intentional for null/undefined, it could incorrectly skip legitimate values. For example, a tool parameter with a default value of 0 or false would not contribute to the token count. Consider using a more explicit check like if (value === null || value === undefined) to only skip truly absent values.
This issue also appears in the following locations of the same file:
- line 112
| if (!value) { | |
| if (value === null || value === undefined) { |
| describe('countToolTokens', () => { | ||
| it('should count tokens for a single tool', async () => { | ||
| const tools = [{ | ||
| name: 'get_weather', | ||
| description: 'Get the current weather', | ||
| inputSchema: { | ||
| type: 'object', | ||
| properties: { | ||
| location: { type: 'string' } | ||
| } | ||
| } | ||
| }]; | ||
| const result = await tokenizer.countToolTokens(tools); | ||
| // baseToolTokens (16) + baseTokensPerTool (8) + object tokens * 1.1 | ||
| expect(result).toBeGreaterThan(24); | ||
| }); | ||
|
|
||
| it('should count tokens for multiple tools', async () => { | ||
| const tools = [ | ||
| { | ||
| name: 'get_weather', | ||
| description: 'Get weather info', | ||
| inputSchema: { type: 'object' } | ||
| }, | ||
| { | ||
| name: 'search', | ||
| description: 'Search the web', | ||
| inputSchema: { type: 'object' } | ||
| } | ||
| ]; | ||
| const result = await tokenizer.countToolTokens(tools); | ||
| // baseToolTokens (16) + 2 * baseTokensPerTool (8) + object tokens | ||
| expect(result).toBeGreaterThan(32); | ||
| }); | ||
|
|
||
| it('should return 0 for empty tools array', async () => { | ||
| const result = await tokenizer.countToolTokens([]); | ||
| expect(result).toBe(0); | ||
| }); | ||
| }); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test coverage is missing for tool schemas with non-string/non-object values (numbers, booleans, null). These values should be considered when counting tokens, but the current implementation may skip them due to the falsy check or lack of handling. Consider adding tests for tool schemas with:
- Numeric values (e.g., maxLength: 100)
- Boolean values (e.g., required: true)
- Null values
- Arrays (e.g., required: ['field1', 'field2'])
| if (text.imageUrl.url.startsWith('data:image/')) { | ||
| try { | ||
| return calculateImageTokenCost(text.imageUrl.url, text.imageUrl.detail); | ||
| } catch { | ||
| return this._textTokenLength(text.imageUrl.url); | ||
| } | ||
| } | ||
| return this._textTokenLength(text.imageUrl.url); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no null safety check for text.imageUrl before accessing text.imageUrl.url. If the imageUrl property is undefined or null, this will throw a runtime error. Consider adding a check like if (text.imageUrl?.url.startsWith('data:image/')) or handling the case where imageUrl might be missing.
| if (text.imageUrl.url.startsWith('data:image/')) { | |
| try { | |
| return calculateImageTokenCost(text.imageUrl.url, text.imageUrl.detail); | |
| } catch { | |
| return this._textTokenLength(text.imageUrl.url); | |
| } | |
| } | |
| return this._textTokenLength(text.imageUrl.url); | |
| if (!text.imageUrl || !text.imageUrl.url) { | |
| return 0; | |
| } | |
| { | |
| const imageUrl = text.imageUrl; | |
| if (imageUrl.url.startsWith('data:image/')) { | |
| try { | |
| return calculateImageTokenCost(imageUrl.url, imageUrl.detail); | |
| } catch { | |
| return this._textTokenLength(imageUrl.url); | |
| } | |
| } | |
| return this._textTokenLength(imageUrl.url); | |
| } |
| it('should return 0 for cache breakpoint content parts', async () => { | ||
| const cacheBreakpoint: Raw.ChatCompletionContentPart = { | ||
| type: Raw.ChatCompletionContentPartKind.CacheBreakpoint | ||
| }; | ||
| const result = await tokenizer.tokenLength(cacheBreakpoint); | ||
| expect(result).toBe(0); | ||
| }); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite is missing test coverage for the Image content part type. The implementation in extChatTokenizer.ts handles Image content parts with special logic for base64 data URLs and calls calculateImageTokenCost. Consider adding tests for:
- Image parts with base64 data URLs (data:image/...)
- Image parts with regular URLs
- Image parts where calculateImageTokenCost throws an exception
This issue also appears in the following locations of the same file:
- line 97
| // Convert to VS Code message format and use the language model's countTokens | ||
| const apiMessages = convertToApiChatMessage([message]); | ||
| if (apiMessages.length === 0) { | ||
| return 0; |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When convertToApiChatMessage returns an empty array (e.g., for messages with only non-base64 images), this method returns 0. However, this behavior may not accurately represent the token cost of such messages, as the empty result is due to content filtering during conversion rather than the message truly being empty. Consider whether returning BaseTokensPerMessage (3) would be more appropriate in this case, or adding a comment to document this edge case behavior.
| return 0; | |
| // Edge case: convertToApiChatMessage can return an empty array when the message | |
| // only contains filtered content (e.g., non-base64 images). In that case, we still | |
| // account for the base per-message token overhead. | |
| return BaseTokensPerMessage; |
Fix microsoft/vscode#278685