Tokenizers tokenizer #1261

gabe-l-hart · 2024-10-03T15:59:04Z

Dependencies

This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:

Safetensors: Safetensors #1255
Bias tensors: Bias tensors #1259
Tied word embeddings: Tied word embeddings #1260

Issues

Description

This PR adds partial support for models that use the tokenizers (as opposed to tiktoken or sentencepiece) for tokenization. This PR only addresses support in the python runner, and it does so by creating a new class in the tokenizer module that simply wraps tokenizers.

Discussion

I'm not sure this is the correct direction to go for solving this since the tokenizers library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrapping tokenizers:

Pre-tokenizers

For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in tiktoken.

From my investigation, I think there are a few candidate paths forward:

Provide a c++ implementation of the various tokenization routines from tokenizers in a separate implementation of the Tokenizer class.
Extend the existing c++ TikToken class to support multiple regexes in the pre-tokenizer
- This would also correspond with needing to make the set of patterns configurable and therefore serialized into the tokenizer.model artifact, or somehow making these tokenizer arguments an argument at instantiation time.

NOTE: The corresponding tokenization in llama.cpp lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model's GGUF file directly, so at load time, the correct tokenizer is found based on that value.

Special Tokens

Even for models that use a single regex (and even the llama regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since the tokenizer.model, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).

pytorch-bot · 2024-10-03T15:59:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4a20f69 with merge base f20f5e7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gabe-l-hart · 2024-10-10T16:09:20Z

@Jack-Khuu This PR is now the tip of the chain. I've opened it up to review, but I suspect this one will need a lot more discussion than the others. As an FYI, I'm working on a c++ implementation that would support tokenizers tokenizers (branch), but it's slow going with other competing priorities.

gabe-l-hart · 2024-10-10T22:26:03Z

Moving conversation on the various open questions here.

I think I've just discovered part of why converting from tokenizers to tiktoken format (e.g. with my script) is not straightforward.

One of the main differences between the tokenizer.model format and the tokenizer.json, besides the presence of a bunch of metadata, is that the vocab and merges are held separately in tokenizer.json whereas the merge ranks are explicitly expected to match the IDs in tokenizer.model. This comment seems to indicate that this is one way that the vocab can be constructed, but that it is not a required part of the BPE algorithm. This would indicate that tiktoken -> tokenizers should work fine, but tokenizers -> tiktoken will be much harder because there's no guarantee that this assumption about ranks will be met in an arbitrary vocab/merges in a tokenizers model.

UPDATE: Further digging shows this might still be ok for standard cases. For Granite Code at least, the ordering of the tokens in the merges strictly matches the "correct" rank and always has a value offset of 261. After a bunch of digging, I think I've convinced myself that the numeric value of the rank is not critical since its usage is entirely around a priority queue when performing merges. As such, having the ordering match should produce the same results.

Jack-Khuu · 2024-10-18T20:48:04Z

Pardon the delay: I've been OOO (still am)
Will take a look when I get back to office

Thanks again!!

gabe-l-hart · 2024-10-18T21:04:54Z

Not a problem at all, I've been distracted on other threads too. I have some partial work towards a native c++ implementation that supports multiple pre-tokenizer regexes and custom special tokens. At the same time, one of those distracting threads has had me looking more closely at sentencepiece and it's possible we could go the route of converting from tokenizers -> sentencepiece and avoid the need for a full c++ implementation. I'll update as I get more clarity.

Jack-Khuu · 2024-10-23T20:36:05Z

Thanks again @gabe-l-hart, feel free to loop me into the other threads (HF?) if you think it'll help

Jack-Khuu

Out of scope of this PR (i.e. we'll fix afterwards), but we should probably move toward using an enum for the tokenizer to save us some headache

Jack-Khuu · 2024-10-23T20:05:59Z

tokenizer/tiktoken.py

 import tiktoken
 from tiktoken.load import load_tiktoken_bpe

+from .base import TokenizerBase


Any reason not to use the full path?

Suggested change

from .base import TokenizerBase

from tokenizer.base import TokenizerBase

Heh, no, I have tended towards relative imports for local (the mental equivalent of #inlclude "foo.h" vs #include <string> for local files vs standard/third party). Definitely no strong preference though! I'd much rather stay consistent with the rest of the project.

Jack-Khuu · 2024-10-23T20:09:31Z

torchchat/cli/builder.py

    tokenizer_path: Optional[Union[Path, str]] = None
    is_sentencepiece: bool = False
    is_tiktoken: bool = False
+    is_tokenizers: bool = False


Since tokenizers as a general term is overloaded

Suggested change

is_tokenizers: bool = False

is_hf_tokenizers: bool = False

Jack-Khuu · 2024-10-23T20:09:58Z

tokenizer/tokenizers.py

+from .base import TokenizerBase
+
+
+class TokenizersTokenizer(TokenizerBase):


Suggested change

class TokenizersTokenizer(TokenizerBase):

class HFTokenizer(TokenizerBase):

Ditto with the file name

Nice, I like that. I was struggling with the generic name and things like TokenizersTokenizer just sound bad. I'll rename the file in a separate commit since I can't stage that as a suggestion.

Jack-Khuu · 2024-10-23T20:46:16Z

torchchat/cli/builder.py

            return

-        if self.is_tiktoken == self.is_sentencepiece:
+        if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1:


Suggested change

if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1:

if sum([self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]) != 1:

Nice, that's way simpler!

Jack-Khuu · 2024-10-23T21:53:23Z

torchchat/model.py

    ffn_dim_multiplier: Optional[int] = None
+    # Select the desired tokenizer. Defaults to sentencepiece
    use_tiktoken: bool = False
+    use_tokenizers: bool = False


Suggested change

use_tokenizers: bool = False

use_hf_tokenizers: bool = False

Jack-Khuu · 2024-10-23T21:54:03Z

torchchat/model.py

    model_type: ModelType
    transformer_args: Dict[str, Dict[str, Any]]
    use_tiktoken: bool
+    use_tokenizers: bool


Suggested change

use_tokenizers: bool

use_hf_tokenizers: bool

Jack-Khuu · 2024-10-24T22:14:26Z

Mini Update: We have some engineers internally who may be interested on helping on C++ front if you get stuck btw

gabe-l-hart · 2024-10-24T22:19:28Z

That's great! I'm finally getting back to this. Will push updates for your suggestions and will push a branch with the very WIP c++ stuff.

gabe-l-hart

Just pushed changes with all the renames. Thanks for the suggestion!

gabe-l-hart · 2024-10-24T21:48:18Z

tokenizer/tiktoken.py

 import tiktoken
 from tiktoken.load import load_tiktoken_bpe

+from .base import TokenizerBase


Heh, no, I have tended towards relative imports for local (the mental equivalent of #inlclude "foo.h" vs #include <string> for local files vs standard/third party). Definitely no strong preference though! I'd much rather stay consistent with the rest of the project.

gabe-l-hart · 2024-10-24T21:49:32Z

tokenizer/tokenizers.py

+from .base import TokenizerBase
+
+
+class TokenizersTokenizer(TokenizerBase):


Nice, I like that. I was struggling with the generic name and things like TokenizersTokenizer just sound bad. I'll rename the file in a separate commit since I can't stage that as a suggestion.

gabe-l-hart · 2024-10-24T22:23:11Z

torchchat/cli/builder.py

            return

-        if self.is_tiktoken == self.is_sentencepiece:
+        if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1:


Nice, that's way simpler!

gabe-l-hart · 2024-10-24T22:49:40Z

Oops, missed one place to change the name. Should be fixed now.

gabe-l-hart · 2024-10-24T22:49:59Z

very WIP on the c++ implementation is on a branch now: https://github.com/gabe-l-hart/torchchat/tree/TokenizersCpp-1251

…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersTokenizer-1251 Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

mikekgfb · 2024-11-05T18:38:19Z

very WIP on the c++ implementation is on a branch now: https://github.com/gabe-l-hart/torchchat/tree/TokenizersCpp-1251

Once all the changes needed to support granite have landed, be sure to add the models to the known model json (added: just saw #1336 which does that) and the README.md model list, please?

Also, at that point when the granite models work with the code that's checked in, is there a smallish granite model (ideally without special license that needs to be accepted, avoiding having to deal with HF tokens as github secrets?) that could be run as end-to-end test?

gabe-l-hart · 2024-11-08T15:52:40Z

Also, at that point when the granite models work with the code that's checked in, is there a smallish granite model (ideally without special license that needs to be accepted, avoiding having to deal with HF tokens as github secrets?) that could be run as end-to-end test?

All Granite models (starting with the Granite Code ones) are under Apache-2. The smallest Granite Code is the 3b model which is admittedly not CI/CD sized. Once we start tackling the "granite" and "granitemoe" architectures (Granite 3.X), the HF team has also created tiny random test models that can be used to ensure the tensors flow, but don't produce any real output.

gabe-l-hart · 2024-11-08T15:56:01Z

For the discussion around c++ support for HF tokenizers, I recently discovered the tokenizer implementation in mlc-llm which may be pretty close to exactly what we need. I'm not deeply familiar with the project (thus just finding it), but it could be an interesting starting point for a general-purpose c++ tokenizer solution. Another option would be to look into hoisting the llama.cpp tokenization into its own project, but that would likely require some significant untangling from their codebase and would be hard to maintain as the project evolves.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024

gabe-l-hart mentioned this pull request Oct 3, 2024

Add support for tokenizers tokenizers #1251

Closed

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch 7 times, most recently from f2cba4c to 3554c3e Compare October 9, 2024 23:52

gabe-l-hart marked this pull request as ready for review October 10, 2024 16:07

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch from 3554c3e to c66ac78 Compare October 10, 2024 22:27

Jack-Khuu reviewed Oct 23, 2024

View reviewed changes

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch from c66ac78 to 87bcf5c Compare October 24, 2024 21:46

gabe-l-hart commented Oct 24, 2024

View reviewed changes

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch from ca7f7ee to 5f332e7 Compare October 24, 2024 22:45

gabe-l-hart added 5 commits October 30, 2024 09:56

feat(tokenizer): Add an abstract base class for additional tokenizer …

d8443a7

…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add a python impl of the Tokenizer interface using …

2483486

…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>

feat(builder): Add support for using the TokenizersTokenizer in builder

5c41015

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add and plumb the option to use the "tokenizers" to…

27d2708

…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

fix(tokenizers): Fix how bos/eos tokens are parsed from tokenizers (lib)

9d9a4a7

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

fix(hf_tokenizer): Rename to HFTokenizer and corresponding flags

4a20f69

pytorch#1251 Branch: TokenizersTokenizer-1251 Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch from 5f332e7 to 4a20f69 Compare October 30, 2024 15:56

gabe-l-hart mentioned this pull request Oct 31, 2024

Granite code support #1336

Merged

6 tasks

Jack-Khuu approved these changes Nov 5, 2024

View reviewed changes

Jack-Khuu merged commit 9480258 into pytorch:main Nov 5, 2024
52 checks passed

gabe-l-hart deleted the TokenizersTokenizer-1251 branch November 8, 2024 15:49

	from .base import TokenizerBase
	from tokenizer.base import TokenizerBase

		from .base import TokenizerBase


		class TokenizersTokenizer(TokenizerBase):

	class TokenizersTokenizer(TokenizerBase):
	class HFTokenizer(TokenizerBase):

	if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1:
	if sum([self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]) != 1:

Tokenizers tokenizer #1261

Tokenizers tokenizer #1261

Uh oh!

Conversation

gabe-l-hart commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Issues

Description

Discussion

Pre-tokenizers

Special Tokens

Uh oh!

pytorch-bot bot commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261

✅ No Failures

Uh oh!

gabe-l-hart commented Oct 10, 2024

Uh oh!

gabe-l-hart commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jack-Khuu commented Oct 18, 2024

Uh oh!

gabe-l-hart commented Oct 18, 2024

Uh oh!

Jack-Khuu commented Oct 23, 2024

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu commented Oct 24, 2024

Uh oh!

gabe-l-hart commented Oct 24, 2024

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Oct 24, 2024

Uh oh!

gabe-l-hart commented Oct 24, 2024

Uh oh!

Uh oh!

mikekgfb commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart commented Nov 8, 2024

Uh oh!

gabe-l-hart commented Nov 8, 2024

Uh oh!

Reviewers

Assignees

gabe-l-hart commented Oct 3, 2024 •

edited

Loading

pytorch-bot bot commented Oct 3, 2024 •

edited

Loading

gabe-l-hart commented Oct 10, 2024 •

edited

Loading

mikekgfb commented Nov 5, 2024 •

edited

Loading