You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README_en.md
+22-7Lines changed: 22 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -564,7 +564,7 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ are recomm
564
564
├── lora_medical.jsonl (34MB)
565
565
├── pretrain_hq.jsonl (1.6GB, ✨)
566
566
├── r1_mix_1024.jsonl (340MB)
567
-
├── rlaif-mini.jsonl (1MB)
567
+
├── rlaif-mini.jsonl (1MB, ✨)
568
568
├── sft_1024.jsonl (5.6GB)
569
569
├── sft_2048.jsonl (9GB)
570
570
├── sft_512.jsonl (7.5GB)
@@ -577,13 +577,28 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ are recomm
577
577
*`dpo.jsonl`✨ --RLHF stage dataset (optimized and simplified, suitable for fast training)
578
578
*`lora_identity.jsonl` --Self-awareness dataset (e.g., Who are you? I am minimind...), recommended for lora training (can also be used for full-parameter SFT, don't be limited by the name)
579
579
*`lora_medical.jsonl` --Medical Q&A dataset, recommended for lora training (can also be used for full-parameter SFT, don't be limited by the name)
580
-
*`pretrain_hq.jsonl`✨ --Pretraining dataset, integrated from JiangShu Technology
581
-
*`r1_mix_1024.jsonl` --DeepSeek-R1-1.5B distilled data, maximum character length per entry is 1024 (therefore set max_seq_len=1024 when training)
580
+
*`pretrain_hq.jsonl`✨ --Pretraining dataset, integrated from JiangShu Technology (recommended `max_seq_len≈320`)
581
+
*`r1_mix_1024.jsonl` --DeepSeek-R1-1.5B distilled data, maximum character length per entry is 1024 (recommended `max_seq_len≈720`)
582
582
*`rlaif-mini.jsonl` --RLAIF training dataset, randomly sampled 10,000 high-quality conversations from SFT dataset for training reinforcement learning algorithms like PPO/GRPO/SPO
583
-
*`sft_1024.jsonl` --Integrated from Qwen2.5 distilled data (a subset of sft_2048), maximum character length per entry is 1024 (therefore set max_seq_len=1024 when training)
584
-
*`sft_2048.jsonl` --Integrated from Qwen2.5 distilled data, maximum character length per entry is 2048 (therefore set max_seq_len=2048 when training)
585
-
*`sft_512.jsonl` --Integrated from JiangShu Technology SFT data, maximum character length per entry is 512 (therefore set max_seq_len=512 when training)
586
-
*`sft_mini_512.jsonl`✨ --Minimal integration from JiangShu Technology SFT data + Qwen2.5 distilled data (for quick training of Zero models), maximum character length per entry is 512 (therefore set max_seq_len=512 when training)
583
+
*`sft_1024.jsonl` --Integrated from Qwen2.5 distilled data (a subset of sft_2048), maximum character length per entry is 1024 (recommended `max_seq_len≈650`)
584
+
*`sft_2048.jsonl` --Integrated from Qwen2.5 distilled data, maximum character length per entry is 2048 (recommended `max_seq_len≈1400`)
585
+
*`sft_512.jsonl` --Integrated from JiangShu Technology SFT data, maximum character length per entry is 512 (recommended `max_seq_len≈350`)
586
+
*`sft_mini_512.jsonl`✨ --Minimal integration from JiangShu Technology SFT data + Qwen2.5 distilled data (for quick training of Zero models), maximum character length per entry is 512 (recommended `max_seq_len≈340`)
587
+
588
+
589
+
Training parameter `max_seq_len` currently refers to the **token length**, not the absolute number of characters.
590
+
For this project's tokenizer, typical Chinese text is roughly `1.5~1.7 chars/token`, while pure English text is roughly `4~5 chars/token` (it varies with data distribution).
591
+
The “max length” annotated in dataset names is measured in **characters**. For example, a 100-character Chinese string can be roughly converted to `100/1.5≈67` tokens.
592
+
593
+
For example:
594
+
595
+
* Chinese: `白日依山尽` (5 chars) may be tokenized into [`白日`, `依`, `山`, `尽`] (4 tokens)
596
+
* English: `The sun sets in the west` (24 chars) may be tokenized into [`The `, `sun `, `sets `, `in `, `the`, `west`] (6 tokens)
597
+
598
+
The “recommended setting” above provides a rough estimate of the max token length for each dataset.
599
+
Note that `max_seq_len` can be tuned aggressively / conservatively / in a balanced way: a larger value increases padding waste, while a smaller value increases truncation.
600
+
601
+
Just find a balance between `compute efficiency` <---> `semantic completeness`.
0 commit comments