Skip to content

Conversation

ppf2
Copy link
Contributor

@ppf2 ppf2 commented Aug 11, 2025

Added information about the use of slow tokenizers to generate vocab files in ML.

Added information about the use of slow tokenizers to generate vocab files in ML.
@ppf2 ppf2 requested a review from davidkyle August 11, 2025 22:06
@ppf2 ppf2 requested review from a team as code owners August 11, 2025 22:06
Copy link

github-actions bot commented Aug 11, 2025

🔍 Preview links for changed docs

Copy link
Contributor

@benironside benironside left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the content, but the writing LGTM

@davidkyle
Copy link
Member

FYI I started work on switching to the fast tokenizers for Eland in elastic/eland#803. This change is required for supporting more of the models found on HuggingFace, the Jina AI Reranker is an example

However, some tests failed after the switch so it is not a simple change and we must first understand why those failures are occuring

@ppf2 ppf2 enabled auto-merge (squash) August 13, 2025 22:48
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've lost the part about the slow tokenizers now

@ppf2
Copy link
Contributor Author

ppf2 commented Aug 14, 2025

@davidkyle Oh I thought it was intentional because we are about to support fast tokenizers 😄 Do you think we should hold off on this PR until fast tokenizer support is available and we will make the statement about slow/fast tokenizers then? WDYT?

@vishaangelova vishaangelova disabled auto-merge August 15, 2025 07:06
@vishaangelova
Copy link
Contributor

@ppf2 I disabled the auto-merge as I saw your question on holding off on this PR, just to be sure this doesn’t get merged until you want it to.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle
Copy link
Member

Oh I thought it was intentional because we are about to support fast tokenizers 😄

Good point. Let's merge as is and I will concentrate on the fast tokenizer work. If I don't make any progress next week I will create another PR here to document the use of slow tokenizers

@vishaangelova
Copy link
Contributor

@ppf2 please feel free to merge at your convenience as the auto-merge is not enabled.

@bmorelli25 bmorelli25 enabled auto-merge (squash) August 22, 2025 17:19
@bmorelli25 bmorelli25 merged commit b8bedb2 into main Aug 22, 2025
6 of 7 checks passed
@bmorelli25 bmorelli25 deleted the ppf2-slow-tokenizers branch August 22, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants