Skip to content

Path-based huggingface tokenizers do not load from the checkpoint when using load_tokenizer(...) #1900

@jstjohn

Description

@jstjohn

Describe the bug

If you use a path based hugging face checkpoint, the object saved into the mbridge checkpoint will not load from the config saved into the checkpoint. The problem is that the config is saved with the tokenizer_model set to PosixPath(".") which would only work if you ran the load function from within the iter_XXXXX/tokenizer directory.

A loaded config from the checkpoint looks like:

(Pdb) cfg
TokenizerConfig(vocab_size=None, vocab_file=None, merge_file=None, vocab_extra_ids=0, tokenizer_type='HuggingFaceTokenizer', tokenizer_model=PosixPath('.'), tiktoken_pattern=None, tiktoken_num_special_tokens=1000, tiktoken_special_tokens=None, tokenizer_prompt_format=None, special_tokens=None, image_tag_type=None, hf_tokenizer_kwargs={'trust_remote_code': True})

Where originally the tokenizer_model=PosixPath("/path/to/original/tokenizer/dir"). Leaving it as is would not be portable. However maybe we could add special handling of paths to tokenizer = load_tokenizer(mbridge_ckpt_path) so that it sets tokenizer_model=Path(mbridge_ckpt_path)/"tokenizer") at runtime if some condition is met? Maybe when isinstance(cfg.tokenizer_model, Path) then override that probably invalid path with ckpt_dir/"tokenizer"?

To be clear the following works still with the tokenizer saved into the checkpoint, so the issue is that the path saved in the config does not point to the right place (it couldn't though with a portable config):

tokenizer = _HuggingFaceTokenizer(mbridge_ckpt_path / "tokenizer")

Another option would be to use some magic special symbol like __CKPT_BASEDIR__/tokenizer and then resolve that checkpoint basedir at runtime from the config to a path? I think the default Path(".") that is in there now is probably enough of a clue that we can change the value safely to something valid at runtime, unless we have other places we want to resolve paths that are relative to the base of a checkpoint.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions