Implement extension tokenization #2763

lramos15 · 2026-01-09T15:43:15Z

Copilot

Pull request overview

This PR implements extension-contributed tokenization by creating a new ExtensionContributedChatTokenizer class that leverages the VS Code Language Model API for token counting. This replaces the previous approach where extension-contributed endpoints relied on the generic ITokenizerProvider service.

Key Changes

Adds ExtensionContributedChatTokenizer that delegates token counting to the VS Code language model API
Removes ITokenizerProvider dependency from ExtensionContributedChatEndpoint
Adds comprehensive unit tests for the new tokenizer implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
src/platform/endpoint/vscode-node/extChatTokenizer.ts	New tokenizer implementation that uses VS Code's LanguageModelChat.countTokens API for token counting with special handling for different content part types
src/platform/endpoint/vscode-node/test/extChatTokenizer.spec.ts	Comprehensive test suite covering string tokens, content parts, messages, and tools with mock implementations
src/platform/endpoint/vscode-node/extChatEndpoint.ts	Updated to instantiate ExtensionContributedChatTokenizer directly, removing ITokenizerProvider dependency injection

Comments suppressed due to low confidence (2)

src/platform/endpoint/vscode-node/test/extChatTokenizer.spec.ts:125

The countMessageTokens method returns 0 when convertToApiChatMessage returns an empty array, but this case is not tested. Since convertToApiChatMessage can skip certain content parts (like non-base64 images), it's possible for valid input messages to result in empty API messages. Consider adding a test case that verifies this behavior, such as a message with only non-base64 image URLs that get skipped during conversion.

	describe('countMessageTokens', () => {
		it('should count tokens for a user message', async () => {
			const message: Raw.ChatMessage = {
				role: Raw.ChatRole.User,
				content: [{ type: Raw.ChatCompletionContentPartKind.Text, text: 'Hello there' }]
			};
			const result = await tokenizer.countMessageTokens(message);
			// BaseTokensPerMessage (3) + message content tokens
			expect(result).toBeGreaterThanOrEqual(3);
		});

		it('should count tokens for an assistant message', async () => {
			const message: Raw.ChatMessage = {
				role: Raw.ChatRole.Assistant,
				content: [{ type: Raw.ChatCompletionContentPartKind.Text, text: 'I can help with that' }]
			};
			const result = await tokenizer.countMessageTokens(message);
			expect(result).toBeGreaterThanOrEqual(3);
		});

		it('should count tokens for a system message', async () => {
			const message: Raw.ChatMessage = {
				role: Raw.ChatRole.System,
				content: [{ type: Raw.ChatCompletionContentPartKind.Text, text: 'You are a helpful assistant' }]
			};
			const result = await tokenizer.countMessageTokens(message);
			expect(result).toBeGreaterThanOrEqual(3);
		});
	});

src/platform/endpoint/vscode-node/extChatTokenizer.ts:114

The _countObjectTokens method doesn't handle arrays correctly. When it encounters an array (which is typeof 'object'), it will cast it to Record<string, unknown> and use Object.entries, which will iterate over array indices as keys. This could lead to incorrect token counting for tool parameters that contain arrays. Consider adding explicit handling for arrays, such as checking Array.isArray(value) before the typeof check and iterating over array elements appropriately.

			} else if (typeof value === 'object') {
				numTokens += await this._countObjectTokens(value as Record<string, unknown>);
			}

Copilot · 2026-01-09T15:48:11Z

src/platform/endpoint/vscode-node/extChatTokenizer.ts

+	private async _countObjectTokens(obj: Record<string, unknown>): Promise<number> {
+		let numTokens = 0;
+		for (const [key, value] of Object.entries(obj)) {
+			if (!value) {


The check if (!value) on line 105 will skip over falsy values including 0, false, empty strings, etc. While this may be intentional for null/undefined, it could incorrectly skip legitimate values. For example, a tool parameter with a default value of 0 or false would not contribute to the token count. Consider using a more explicit check like if (value === null || value === undefined) to only skip truly absent values.

This issue also appears in the following locations of the same file:

line 112

Suggested change

if (!value) {

if (value === null || value === undefined) {

Copilot · 2026-01-09T15:48:12Z

src/platform/endpoint/vscode-node/test/extChatTokenizer.spec.ts

+	describe('countToolTokens', () => {
+		it('should count tokens for a single tool', async () => {
+			const tools = [{
+				name: 'get_weather',
+				description: 'Get the current weather',
+				inputSchema: {
+					type: 'object',
+					properties: {
+						location: { type: 'string' }
+					}
+				}
+			}];
+			const result = await tokenizer.countToolTokens(tools);
+			// baseToolTokens (16) + baseTokensPerTool (8) + object tokens * 1.1
+			expect(result).toBeGreaterThan(24);
+		});
+
+		it('should count tokens for multiple tools', async () => {
+			const tools = [
+				{
+					name: 'get_weather',
+					description: 'Get weather info',
+					inputSchema: { type: 'object' }
+				},
+				{
+					name: 'search',
+					description: 'Search the web',
+					inputSchema: { type: 'object' }
+				}
+			];
+			const result = await tokenizer.countToolTokens(tools);
+			// baseToolTokens (16) + 2 * baseTokensPerTool (8) + object tokens
+			expect(result).toBeGreaterThan(32);
+		});
+
+		it('should return 0 for empty tools array', async () => {
+			const result = await tokenizer.countToolTokens([]);
+			expect(result).toBe(0);
+		});
+	});


Test coverage is missing for tool schemas with non-string/non-object values (numbers, booleans, null). These values should be considered when counting tokens, but the current implementation may skip them due to the falsy check or lack of handling. Consider adding tests for tool schemas with:

Numeric values (e.g., maxLength: 100)

Boolean values (e.g., required: true)

Null values

Arrays (e.g., required: ['field1', 'field2'])

Copilot · 2026-01-09T15:48:12Z

src/platform/endpoint/vscode-node/extChatTokenizer.ts

+				if (text.imageUrl.url.startsWith('data:image/')) {
+					try {
+						return calculateImageTokenCost(text.imageUrl.url, text.imageUrl.detail);
+					} catch {
+						return this._textTokenLength(text.imageUrl.url);
+					}
+				}
+				return this._textTokenLength(text.imageUrl.url);


There's no null safety check for text.imageUrl before accessing text.imageUrl.url. If the imageUrl property is undefined or null, this will throw a runtime error. Consider adding a check like if (text.imageUrl?.url.startsWith('data:image/')) or handling the case where imageUrl might be missing.

Suggested change

if (text.imageUrl.url.startsWith('data:image/')) {

try {

return calculateImageTokenCost(text.imageUrl.url, text.imageUrl.detail);

} catch {

return this._textTokenLength(text.imageUrl.url);

}

}

return this._textTokenLength(text.imageUrl.url);

if (!text.imageUrl || !text.imageUrl.url) {

return 0;

}

{

const imageUrl = text.imageUrl;

if (imageUrl.url.startsWith('data:image/')) {

try {

return calculateImageTokenCost(imageUrl.url, imageUrl.detail);

} catch {

return this._textTokenLength(imageUrl.url);

}

}

return this._textTokenLength(imageUrl.url);

}

Copilot · 2026-01-09T15:48:12Z

src/platform/endpoint/vscode-node/test/extChatTokenizer.spec.ts

+		it('should return 0 for cache breakpoint content parts', async () => {
+			const cacheBreakpoint: Raw.ChatCompletionContentPart = {
+				type: Raw.ChatCompletionContentPartKind.CacheBreakpoint
+			};
+			const result = await tokenizer.tokenLength(cacheBreakpoint);
+			expect(result).toBe(0);
+		});


The test suite is missing test coverage for the Image content part type. The implementation in extChatTokenizer.ts handles Image content parts with special logic for base64 data URLs and calls calculateImageTokenCost. Consider adding tests for:

Image parts with base64 data URLs (data:image/...)

Image parts with regular URLs

Image parts where calculateImageTokenCost throws an exception

This issue also appears in the following locations of the same file:

line 97

Copilot · 2026-01-09T15:48:13Z

src/platform/endpoint/vscode-node/extChatTokenizer.ts

+		// Convert to VS Code message format and use the language model's countTokens
+		const apiMessages = convertToApiChatMessage([message]);
+		if (apiMessages.length === 0) {
+			return 0;


When convertToApiChatMessage returns an empty array (e.g., for messages with only non-base64 images), this method returns 0. However, this behavior may not accurately represent the token cost of such messages, as the empty result is due to content filtering during conversion rather than the message truly being empty. Consider whether returning BaseTokensPerMessage (3) would be more appropriate in this case, or adding a comment to document this edge case behavior.

Suggested change

return 0;

// Edge case: convertToApiChatMessage can return an empty array when the message

// only contains filtered content (e.g., non-base64 images). In that case, we still

// account for the base per-message token overhead.

return BaseTokensPerMessage;

Implement extension tokenization

ba06468

Copilot AI review requested due to automatic review settings January 9, 2026 15:43

lramos15 enabled auto-merge January 9, 2026 15:43

lramos15 self-assigned this Jan 9, 2026

lramos15 requested a review from bhavyaus January 9, 2026 15:43

Copilot started reviewing on behalf of lramos15 January 9, 2026 15:43 View session

vs-code-engineering bot added this to the January 2026 milestone Jan 9, 2026

Copilot AI reviewed Jan 9, 2026

View reviewed changes

bhavyaus approved these changes Jan 12, 2026

View reviewed changes

lramos15 added this pull request to the merge queue Jan 12, 2026

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 12, 2026

lramos15 added this pull request to the merge queue Jan 13, 2026

Merged via the queue into main with commit abf24aa Jan 13, 2026
24 checks passed

lramos15 deleted the lramos15/legislative-bat branch January 13, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement extension tokenization #2763

Implement extension tokenization #2763

Uh oh!

lramos15 commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-			return 0;
+			// Edge case: convertToApiChatMessage can return an empty array when the message
+			// only contains filtered content (e.g., non-base64 images). In that case, we still
+			// account for the base per-message token overhead.
+			return BaseTokensPerMessage;

Implement extension tokenization #2763

Implement extension tokenization #2763

Uh oh!

Conversation

lramos15 commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants