Skip to content
Discussion options

You must be logged in to vote

the solution is correct:

  • text-embedding-ada-002 => cl100k_base
  • text-embedding-3-small / text-embedding-3-large => cl100k_base

The name of the tokenizers is based on the text models, so I agree it's not obvious which one to pick for embedding models. Even if we provided a tokenizer called Cl100k_baseTokenizer I'm sure some people would not know when to use. I think we could address this improving the documentation, e.g. adding a Tokenizer page in the repo, and adding the URL in the code.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@christianarg
Comment options

Answer selected by christianarg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
1. Q&A
Labels
None yet
2 participants