-
Notifications
You must be signed in to change notification settings - Fork 226
Description
Describe the bug
If you use a path based hugging face checkpoint, the object saved into the mbridge checkpoint will not load from the config saved into the checkpoint. The problem is that the config is saved with the tokenizer_model set to PosixPath(".") which would only work if you ran the load function from within the iter_XXXXX/tokenizer directory.
A loaded config from the checkpoint looks like:
(Pdb) cfg
TokenizerConfig(vocab_size=None, vocab_file=None, merge_file=None, vocab_extra_ids=0, tokenizer_type='HuggingFaceTokenizer', tokenizer_model=PosixPath('.'), tiktoken_pattern=None, tiktoken_num_special_tokens=1000, tiktoken_special_tokens=None, tokenizer_prompt_format=None, special_tokens=None, image_tag_type=None, hf_tokenizer_kwargs={'trust_remote_code': True})
Where originally the tokenizer_model=PosixPath("/path/to/original/tokenizer/dir"). Leaving it as is would not be portable. However maybe we could add special handling of paths to tokenizer = load_tokenizer(mbridge_ckpt_path) so that it sets tokenizer_model=Path(mbridge_ckpt_path)/"tokenizer") at runtime if some condition is met? Maybe when isinstance(cfg.tokenizer_model, Path) then override that probably invalid path with ckpt_dir/"tokenizer"?
To be clear the following works still with the tokenizer saved into the checkpoint, so the issue is that the path saved in the config does not point to the right place (it couldn't though with a portable config):
tokenizer = _HuggingFaceTokenizer(mbridge_ckpt_path / "tokenizer")
Another option would be to use some magic special symbol like __CKPT_BASEDIR__/tokenizer and then resolve that checkpoint basedir at runtime from the config to a path? I think the default Path(".") that is in there now is probably enough of a clue that we can change the value safely to something valid at runtime, unless we have other places we want to resolve paths that are relative to the base of a checkpoint.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.