"Correct" Tokenizer for AzureOpenAITextEmbeddingGenerator for models text-embeddings-ada-002, text-embeddings-3-large and text-embeddings-3-small #880

christianarg · 2024-11-04T11:34:46Z

christianarg
Nov 4, 2024

When configuring WithAzureOpenAITextEmbeddingGeneration I'm getting the warning:

warn: Microsoft.KernelMemory.AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator[0]
Tokenizer not specified, will use GPT4oTokenizer. The token count might be incorrect, causing unexpected errors

As my understanding, text-embeddings-ada-002, text-embeddings-3-large and text-embeddings-3-small all use cl100k_base encoding.
Thus, to avoid this warning I'm passing new GPT4Tokenizer() as a parameter, as this tokenizer uses cl100k_base which is the same encoder that all the 3 mentioned embedding models use.
Is this correct? GPT4Tokenizer summary states it uses it uses cl100k_base.tiktoken + special tokens, so I'm not sure.
If this is not correct, what Tokenizer should I use for any of the 3 mentioned embedding models?
Even if this is correct, maybe you could provide a built in Cl100k_baseTokenizer so it's more obvious to use?
Thanks

Answered by dluc

Nov 4, 2024

the solution is correct:

text-embedding-ada-002 => cl100k_base
text-embedding-3-small / text-embedding-3-large => cl100k_base

The name of the tokenizers is based on the text models, so I agree it's not obvious which one to pick for embedding models. Even if we provided a tokenizer called Cl100k_baseTokenizer I'm sure some people would not know when to use. I think we could address this improving the documentation, e.g. adding a Tokenizer page in the repo, and adding the URL in the code.

View full answer

dluc · 2024-11-04T17:59:18Z

dluc
Nov 4, 2024
Maintainer

the solution is correct:

text-embedding-ada-002 => cl100k_base
text-embedding-3-small / text-embedding-3-large => cl100k_base

The name of the tokenizers is based on the text models, so I agree it's not obvious which one to pick for embedding models. Even if we provided a tokenizer called Cl100k_baseTokenizer I'm sure some people would not know when to use. I think we could address this improving the documentation, e.g. adding a Tokenizer page in the repo, and adding the URL in the code.

1 reply

christianarg Nov 6, 2024
Author

great. As a suggestion maybe indicate it in the warning of AzureOpenAITextEmbeddingGenerator when no tokenizer is specified. Something like this (better probably):

"Tokenizer not specified, will use {0}. The token count might be incorrect, causing unexpected errors. For text-embedding-ada-002, text-embedding-3-small and text-embedding-3-large, use GTP4Tokenizer as it uses cl100k_base encoder"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Correct" Tokenizer for AzureOpenAITextEmbeddingGenerator for models text-embeddings-ada-002, text-embeddings-3-large and text-embeddings-3-small #880

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

"Correct" Tokenizer for AzureOpenAITextEmbeddingGenerator for models text-embeddings-ada-002, text-embeddings-3-large and text-embeddings-3-small #880

Uh oh!

Uh oh!

christianarg Nov 4, 2024

Replies: 1 comment · 1 reply

Uh oh!

dluc Nov 4, 2024 Maintainer

Uh oh!

christianarg Nov 6, 2024 Author

christianarg
Nov 4, 2024

Replies: 1 comment 1 reply

dluc
Nov 4, 2024
Maintainer

christianarg Nov 6, 2024
Author