Skip to content

Added information about the use of slow tokenizers #2517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ppf2
Copy link
Contributor

@ppf2 ppf2 commented Aug 11, 2025

Added information about the use of slow tokenizers to generate vocab files in ML.

Added information about the use of slow tokenizers to generate vocab files in ML.
Copy link

github-actions bot commented Aug 11, 2025

🔍 Preview links for changed docs

Copy link
Contributor

@benironside benironside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the content, but the writing LGTM

@davidkyle
Copy link
Member

FYI I started work on switching to the fast tokenizers for Eland in elastic/eland#803. This change is required for supporting more of the models found on HuggingFace, the Jina AI Reranker is an example

However, some tests failed after the switch so it is not a simple change and we must first understand why those failures are occuring

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've lost the part about the slow tokenizers now

@ppf2
Copy link
Contributor Author

ppf2 commented Aug 14, 2025

@davidkyle Oh I thought it was intentional because we are about to support fast tokenizers 😄 Do you think we should hold off on this PR until fast tokenizer support is available and we will make the statement about slow/fast tokenizers then? WDYT?

@vishaangelova
Copy link
Contributor

@ppf2 I disabled the auto-merge as I saw your question on holding off on this PR, just to be sure this doesn’t get merged until you want it to.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle
Copy link
Member

Oh I thought it was intentional because we are about to support fast tokenizers 😄

Good point. Let's merge as is and I will concentrate on the fast tokenizer work. If I don't make any progress next week I will create another PR here to document the use of slow tokenizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants