"Correct" Tokenizer for AzureOpenAITextEmbeddingGenerator for models text-embeddings-ada-002, text-embeddings-3-large and text-embeddings-3-small #880
-
When configuring WithAzureOpenAITextEmbeddingGeneration I'm getting the warning: warn: Microsoft.KernelMemory.AI.AzureOpenAI.AzureOpenAITextEmbeddingGenerator[0] As my understanding, text-embeddings-ada-002, text-embeddings-3-large and text-embeddings-3-small all use cl100k_base encoding. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
the solution is correct:
The name of the tokenizers is based on the text models, so I agree it's not obvious which one to pick for embedding models. Even if we provided a tokenizer called Cl100k_baseTokenizer I'm sure some people would not know when to use. I think we could address this improving the documentation, e.g. adding a Tokenizer page in the repo, and adding the URL in the code. |
Beta Was this translation helpful? Give feedback.
the solution is correct:
The name of the tokenizers is based on the text models, so I agree it's not obvious which one to pick for embedding models. Even if we provided a tokenizer called Cl100k_baseTokenizer I'm sure some people would not know when to use. I think we could address this improving the documentation, e.g. adding a Tokenizer page in the repo, and adding the URL in the code.