Skip to content

Commit b5b7ffb

Browse files
authored
[HF] Deprecate tokenizer_path in Toml Files (#1592)
This PR deprecates the `model.tokenizer_path` in .toml files and replaces them with `model.hf_assets_path`. See #1526 for more details. Reasoning: `tokenizer_path` is still supported in .toml files by naively overriding `hf_assets_path` when it is specified. This is meant to allow backwards compatibility, but it's not meant to be a well-maintained option in the future. `tokenizer_path` is used for: - loading a tokenizer `hf_assets_path` can be used for: - one stop shop for accessing hf repo's files - loading a tokenizer (or multiple tokenizers) - loading safetensor checkpoints - loading other hf assets within the same repo (encoders, autoencoders, etc.) The reason we change `tokenizer_path -> hf_assets_path` is to be more consistent with this new functionality. and we additionally change the path `assets/tokenizer -> assets/hf/<model-name>` to reflect the new `download_hf_assets.py` script. Breaking Changes: - You may have to download or move tokenizer to the new default HF assets path `./assets/hf/<model-name>` - You may have to download "duplicate" tokenizers for different versions of the same model - e.g. Llama-3.1-8B and Llama-3.1-405B will each require a tokenizer in their respective HF assets path
1 parent 9f47ceb commit b5b7ffb

File tree

10 files changed

+10
-10
lines changed

10 files changed

+10
-10
lines changed

torchtitan/experiments/deepseek_v3/train_configs/deepseek_v2.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ enable_wandb = false
2222
name = "deepseek_v2"
2323
flavor = "deepseek-ai/DeepSeek-V2-Lite"
2424
# test tokenizer.model, for debug purpose only
25-
tokenizer_path = "./tests/assets/tokenizer"
25+
hf_assets_path = "./tests/assets/tokenizer"
2626
# converters = ["float8"]
2727

2828
[optimizer]

torchtitan/experiments/llama4/train_configs/debug_model.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ enable_wandb = false
2222
name = "llama4"
2323
flavor = "debugmodel"
2424
# test tokenizer.model, for debug purpose only
25-
tokenizer_path = "./tests/assets/tokenizer"
25+
hf_assets_path = "./tests/assets/tokenizer"
2626
# converters = ["float8"]
2727

2828
[optimizer]

torchtitan/experiments/llama4/train_configs/llama4_17bx128e.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ save_tb_folder = "tb"
1717
[model]
1818
name = "llama4"
1919
flavor = "17bx128e"
20-
tokenizer_path = "./assets/tokenizer/Llama-4-Scout-17B-16E"
20+
hf_assets_path = "./assets/hf/Llama-4-Scout-17B-128E"
2121
# converters = ["float8"]
2222

2323
[optimizer]

torchtitan/experiments/llama4/train_configs/llama4_17bx16e.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ save_tb_folder = "tb"
1717
[model]
1818
name = "llama4"
1919
flavor = "17bx16e"
20-
tokenizer_path = "./assets/tokenizer/Llama-4-Scout-17B-16E"
20+
hf_assets_path = "./assets/hf/Llama-4-Scout-17B-16E"
2121
# converters = ["float8"]
2222

2323
[optimizer]

torchtitan/experiments/qwen3/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Other model sizes are added to the args, but toml file configs need to be added
1010

1111
#### Download Qwen3 tokenizer
1212

13-
```python scripts/download_tokenizer.py --repo_id Qwen/Qwen3-0.6B```
13+
```python scripts/download_hf_assets.py --repo_id Qwen/Qwen3-0.6B --asset tokenizer```
1414

1515

1616
#### Parity with HF

torchtitan/models/deepseek_v3/train_configs/debug_model.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ enable_wandb = false
2222
name = "deepseek_v3"
2323
flavor = "debugmodel"
2424
# test tokenizer, for debug purpose only
25-
tokenizer_path = "./tests/assets/tokenizer"
25+
hf_assets_path = "./tests/assets/tokenizer"
2626
# converters = ["float8"]
2727

2828
[optimizer]

torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ enable_wandb = false
2020
[model]
2121
name = "deepseek_v3"
2222
flavor = "16B"
23-
tokenizer_path = "./assets/tokenizer/deepseek-moe-16b-base"
23+
hf_assets_path = "./assets/hf/deepseek-moe-16b-base"
2424
# converters = ["float8"]
2525

2626
[optimizer]

torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ enable_wandb = false
2020
[model]
2121
name = "deepseek_v3"
2222
flavor = "671B"
23-
tokenizer_path = "./assets/tokenizer/DeepSeek-V3"
23+
hf_assets_path = "./assets/hf/DeepSeek-V3"
2424
# converters = ["float8"]
2525

2626
[optimizer]

torchtitan/models/llama3/train_configs/llama3_405b.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ save_tb_folder = "tb"
1818
[model]
1919
name = "llama3"
2020
flavor = "405B"
21-
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"
21+
hf_assets_path = "./assets/hf/Llama-3.1-405B"
2222
converters = ["float8"]
2323

2424
[optimizer]

torchtitan/models/llama3/train_configs/llama3_70b.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ save_tb_folder = "tb"
1818
[model]
1919
name = "llama3"
2020
flavor = "70B"
21-
tokenizer_path = "./assets/tokenizer/Llama-3.1-8B"
21+
hf_assets_path = "./assets/hf/Llama-3.1-70B"
2222
# converters = ["float8"]
2323

2424
[optimizer]

0 commit comments

Comments
 (0)