-
Notifications
You must be signed in to change notification settings - Fork 130
Added information about the use of slow tokenizers #2517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Added information about the use of slow tokenizers to generate vocab files in ML.
🔍 Preview links for changed docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the content, but the writing LGTM
FYI I started work on switching to the fast tokenizers for Eland in elastic/eland#803. This change is required for supporting more of the models found on HuggingFace, the Jina AI Reranker is an example However, some tests failed after the switch so it is not a simple change and we must first understand why those failures are occuring |
Co-authored-by: David Kyle <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've lost the part about the slow tokenizers now
@davidkyle Oh I thought it was intentional because we are about to support fast tokenizers 😄 Do you think we should hold off on this PR until fast tokenizer support is available and we will make the statement about slow/fast tokenizers then? WDYT? |
@ppf2 I disabled the auto-merge as I saw your question on holding off on this PR, just to be sure this doesn’t get merged until you want it to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Good point. Let's merge as is and I will concentrate on the fast tokenizer work. If I don't make any progress next week I will create another PR here to document the use of slow tokenizers |
Co-authored-by: shainaraskas <[email protected]>
Added information about the use of slow tokenizers to generate vocab files in ML.