-
Notifications
You must be signed in to change notification settings - Fork 248
Tokenizers tokenizer #1261
Tokenizers tokenizer #1261
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4a20f69 with merge base f20f5e7 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
f2cba4c to
3554c3e
Compare
|
@Jack-Khuu This PR is now the tip of the chain. I've opened it up to review, but I suspect this one will need a lot more discussion than the others. As an FYI, I'm working on a c++ implementation that would support |
|
Moving conversation on the various open questions here. I think I've just discovered part of why converting from One of the main differences between the UPDATE: Further digging shows this might still be ok for standard cases. For Granite Code at least, the ordering of the tokens in the |
3554c3e to
c66ac78
Compare
|
Pardon the delay: I've been OOO (still am) Thanks again!! |
|
Not a problem at all, I've been distracted on other threads too. I have some partial work towards a native |
|
Thanks again @gabe-l-hart, feel free to loop me into the other threads (HF?) if you think it'll help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope of this PR (i.e. we'll fix afterwards), but we should probably move toward using an enum for the tokenizer to save us some headache
| import tiktoken | ||
| from tiktoken.load import load_tiktoken_bpe | ||
|
|
||
| from .base import TokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use the full path?
| from .base import TokenizerBase | |
| from tokenizer.base import TokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, no, I have tended towards relative imports for local (the mental equivalent of #inlclude "foo.h" vs #include <string> for local files vs standard/third party). Definitely no strong preference though! I'd much rather stay consistent with the rest of the project.
torchchat/cli/builder.py
Outdated
| tokenizer_path: Optional[Union[Path, str]] = None | ||
| is_sentencepiece: bool = False | ||
| is_tiktoken: bool = False | ||
| is_tokenizers: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since tokenizers as a general term is overloaded
| is_tokenizers: bool = False | |
| is_hf_tokenizers: bool = False |
tokenizer/tokenizers.py
Outdated
| from .base import TokenizerBase | ||
|
|
||
|
|
||
| class TokenizersTokenizer(TokenizerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class TokenizersTokenizer(TokenizerBase): | |
| class HFTokenizer(TokenizerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto with the file name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like that. I was struggling with the generic name and things like TokenizersTokenizer just sound bad. I'll rename the file in a separate commit since I can't stage that as a suggestion.
torchchat/cli/builder.py
Outdated
| return | ||
|
|
||
| if self.is_tiktoken == self.is_sentencepiece: | ||
| if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1: | |
| if sum([self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that's way simpler!
torchchat/model.py
Outdated
| ffn_dim_multiplier: Optional[int] = None | ||
| # Select the desired tokenizer. Defaults to sentencepiece | ||
| use_tiktoken: bool = False | ||
| use_tokenizers: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| use_tokenizers: bool = False | |
| use_hf_tokenizers: bool = False |
torchchat/model.py
Outdated
| model_type: ModelType | ||
| transformer_args: Dict[str, Dict[str, Any]] | ||
| use_tiktoken: bool | ||
| use_tokenizers: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| use_tokenizers: bool | |
| use_hf_tokenizers: bool |
c66ac78 to
87bcf5c
Compare
|
Mini Update: We have some engineers internally who may be interested on helping on C++ front if you get stuck btw |
|
That's great! I'm finally getting back to this. Will push updates for your suggestions and will push a branch with the very WIP c++ stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed changes with all the renames. Thanks for the suggestion!
| import tiktoken | ||
| from tiktoken.load import load_tiktoken_bpe | ||
|
|
||
| from .base import TokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, no, I have tended towards relative imports for local (the mental equivalent of #inlclude "foo.h" vs #include <string> for local files vs standard/third party). Definitely no strong preference though! I'd much rather stay consistent with the rest of the project.
tokenizer/tokenizers.py
Outdated
| from .base import TokenizerBase | ||
|
|
||
|
|
||
| class TokenizersTokenizer(TokenizerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like that. I was struggling with the generic name and things like TokenizersTokenizer just sound bad. I'll rename the file in a separate commit since I can't stage that as a suggestion.
torchchat/cli/builder.py
Outdated
| return | ||
|
|
||
| if self.is_tiktoken == self.is_sentencepiece: | ||
| if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that's way simpler!
ca7f7ee to
5f332e7
Compare
|
Oops, missed one place to change the name. Should be fixed now. |
|
very WIP on the |
…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersTokenizer-1251 Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
5f332e7 to
4a20f69
Compare
Once all the changes needed to support granite have landed, be sure to add the models to the known model json (added: just saw #1336 which does that) and the README.md model list, please? Also, at that point when the granite models work with the code that's checked in, is there a smallish granite model (ideally without special license that needs to be accepted, avoiding having to deal with HF tokens as github secrets?) that could be run as end-to-end test? |
All Granite models (starting with the Granite Code ones) are under Apache-2. The smallest Granite Code is the 3b model which is admittedly not CI/CD sized. Once we start tackling the |
|
For the discussion around |
Dependencies
This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:
Issues
Closes #1251
Description
This PR adds partial support for models that use the
tokenizers(as opposed totiktokenorsentencepiece) for tokenization. This PR only addresses support in thepythonrunner, and it does so by creating a new class in thetokenizermodule that simply wrapstokenizers.Discussion
I'm not sure this is the correct direction to go for solving this since the
tokenizerslibrary is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrappingtokenizers:Pre-tokenizers
For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in
tiktoken.From my investigation, I think there are a few candidate paths forward:
c++implementation of the various tokenization routines fromtokenizersin a separate implementation of theTokenizerclass.c++TikTokenclass to support multiple regexes in the pre-tokenizertokenizer.modelartifact, or somehow making these tokenizer arguments an argument at instantiation time.NOTE: The corresponding tokenization in
llama.cpplives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model'sGGUFfile directly, so at load time, the correct tokenizer is found based on that value.Special Tokens
Even for models that use a single regex (and even the
llamaregex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since thetokenizer.model, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).