Skip to content

Commit fe24501

Browse files
committed
[feat] adjust seq length
1 parent fa82707 commit fe24501

File tree

9 files changed

+50
-25
lines changed

9 files changed

+50
-25
lines changed

README.md

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -565,7 +565,7 @@ MiniMind训练数据集下载地址: [ModelScope](https://www.modelscope.cn/da
565565
├── lora_medical.jsonl (34MB)
566566
├── pretrain_hq.jsonl (1.6GB, ✨)
567567
├── r1_mix_1024.jsonl (340MB)
568-
├── rlaif-mini.jsonl (1MB)
568+
├── rlaif-mini.jsonl (1MB, ✨)
569569
├── sft_1024.jsonl (5.6GB)
570570
├── sft_2048.jsonl (9GB)
571571
├── sft_512.jsonl (7.5GB)
@@ -578,13 +578,28 @@ MiniMind训练数据集下载地址: [ModelScope](https://www.modelscope.cn/da
578578
* `dpo.jsonl`✨ --RLHF阶段数据集(已精简优化,适合快速训练)
579579
* `lora_identity.jsonl` --自我认知数据集(例如:你是谁?我是minimind...),推荐用于lora训练(亦可用于全参SFT,勿被名字局限)
580580
* `lora_medical.jsonl` --医疗问答数据集,推荐用于lora训练(亦可用于全参SFT,勿被名字局限)
581-
* `pretrain_hq.jsonl`✨ --预训练数据集,整合自匠数科技
582-
* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B蒸馏数据,每条数据字符最大长度为1024(因此训练时设置max_seq_len=1024
581+
* `pretrain_hq.jsonl`✨ --预训练数据集,整合自匠数科技(推荐设置`max_seq_len≈320`
582+
* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B蒸馏数据,每条数据字符最大长度为1024(推荐设置`max_seq_len≈720`
583583
* `rlaif-mini.jsonl` --RLAIF训练数据集,从SFT数据集中随机采样1万条高质量对话,用于PPO/GRPO/SPO等强化学习算法训练
584-
* `sft_1024.jsonl` --整合自Qwen2.5蒸馏数据(是sft_2048的子集),每条数据字符最大长度为1024(因此训练时设置max_seq_len=1024)
585-
* `sft_2048.jsonl` --整合自Qwen2.5蒸馏数据,每条数据字符最大长度为2048(因此训练时设置max_seq_len=2048)
586-
* `sft_512.jsonl` --整合自匠数科技SFT数据,每条数据字符最大长度为512(因此训练时设置max_seq_len=512)
587-
* `sft_mini_512.jsonl`✨ --极简整合自匠数科技SFT数据+Qwen2.5蒸馏数据(用于快速训练Zero模型),每条数据字符最大长度为512(因此训练时设置max_seq_len=512)
584+
* `sft_1024.jsonl` --整合自Qwen2.5蒸馏数据(是sft_2048的子集),每条数据字符最大长度为1024(推荐设置`max_seq_len≈650`
585+
* `sft_2048.jsonl` --整合自Qwen2.5蒸馏数据,每条数据字符最大长度为2048(推荐设置`max_seq_len≈1400`
586+
* `sft_512.jsonl` --整合自匠数科技SFT数据,每条数据字符最大长度为512(推荐设置`max_seq_len≈350`
587+
* `sft_mini_512.jsonl`✨ --极简整合自匠数科技SFT数据+Qwen2.5蒸馏数据(用于快速训练Zero模型),每条数据字符最大长度为512(推荐设置`max_seq_len≈340`
588+
589+
590+
训练参数`max_seq_len`目前指的是tokens长度,而非绝对字符数。
591+
本项目tokenizer在中文文本上大约`1.5~1.7 字符/token`,纯英文的压缩比在`4~5 字符/token`,不同数据分布会有波动。
592+
数据集命名标注的“最大长度”均为字符数,100长度的字符串可粗略换算成`100/1.5≈67`的tokens长度。
593+
594+
例如:
595+
596+
* 中文:`白日依山尽`5个字符可能被拆分为[`白日`,``,``,``] 4个tokens;
597+
* 英文:`The sun sets in the west`24个字符可能被拆分为[`The `,`sun `,`sets `,`in `,`the`,`west`] 6个tokens
598+
599+
“推荐设置”给出了各个数据集上最大tokens长度的粗略估计。
600+
须知max_seq_len可以激进/保守/均衡地调整,因为更大或更小均无法避免副作用:一些样本短于max_seq_len后被padding浪费算力,一些样本长于max_seq_len后被截断语意。
601+
602+
`算力效率` <---> `语义完整性` 之间找到一个平衡点即可
588603

589604
</details>
590605

README_en.md

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -564,7 +564,7 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ are recomm
564564
├── lora_medical.jsonl (34MB)
565565
├── pretrain_hq.jsonl (1.6GB, ✨)
566566
├── r1_mix_1024.jsonl (340MB)
567-
├── rlaif-mini.jsonl (1MB)
567+
├── rlaif-mini.jsonl (1MB, ✨)
568568
├── sft_1024.jsonl (5.6GB)
569569
├── sft_2048.jsonl (9GB)
570570
├── sft_512.jsonl (7.5GB)
@@ -577,13 +577,28 @@ Place the downloaded dataset files in the `./dataset/` directory (✨ are recomm
577577
* `dpo.jsonl`✨ --RLHF stage dataset (optimized and simplified, suitable for fast training)
578578
* `lora_identity.jsonl` --Self-awareness dataset (e.g., Who are you? I am minimind...), recommended for lora training (can also be used for full-parameter SFT, don't be limited by the name)
579579
* `lora_medical.jsonl` --Medical Q&A dataset, recommended for lora training (can also be used for full-parameter SFT, don't be limited by the name)
580-
* `pretrain_hq.jsonl`✨ --Pretraining dataset, integrated from JiangShu Technology
581-
* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B distilled data, maximum character length per entry is 1024 (therefore set max_seq_len=1024 when training)
580+
* `pretrain_hq.jsonl`✨ --Pretraining dataset, integrated from JiangShu Technology (recommended `max_seq_len≈320`)
581+
* `r1_mix_1024.jsonl` --DeepSeek-R1-1.5B distilled data, maximum character length per entry is 1024 (recommended `max_seq_len≈720`)
582582
* `rlaif-mini.jsonl` --RLAIF training dataset, randomly sampled 10,000 high-quality conversations from SFT dataset for training reinforcement learning algorithms like PPO/GRPO/SPO
583-
* `sft_1024.jsonl` --Integrated from Qwen2.5 distilled data (a subset of sft_2048), maximum character length per entry is 1024 (therefore set max_seq_len=1024 when training)
584-
* `sft_2048.jsonl` --Integrated from Qwen2.5 distilled data, maximum character length per entry is 2048 (therefore set max_seq_len=2048 when training)
585-
* `sft_512.jsonl` --Integrated from JiangShu Technology SFT data, maximum character length per entry is 512 (therefore set max_seq_len=512 when training)
586-
* `sft_mini_512.jsonl`✨ --Minimal integration from JiangShu Technology SFT data + Qwen2.5 distilled data (for quick training of Zero models), maximum character length per entry is 512 (therefore set max_seq_len=512 when training)
583+
* `sft_1024.jsonl` --Integrated from Qwen2.5 distilled data (a subset of sft_2048), maximum character length per entry is 1024 (recommended `max_seq_len≈650`)
584+
* `sft_2048.jsonl` --Integrated from Qwen2.5 distilled data, maximum character length per entry is 2048 (recommended `max_seq_len≈1400`)
585+
* `sft_512.jsonl` --Integrated from JiangShu Technology SFT data, maximum character length per entry is 512 (recommended `max_seq_len≈350`)
586+
* `sft_mini_512.jsonl`✨ --Minimal integration from JiangShu Technology SFT data + Qwen2.5 distilled data (for quick training of Zero models), maximum character length per entry is 512 (recommended `max_seq_len≈340`)
587+
588+
589+
Training parameter `max_seq_len` currently refers to the **token length**, not the absolute number of characters.
590+
For this project's tokenizer, typical Chinese text is roughly `1.5~1.7 chars/token`, while pure English text is roughly `4~5 chars/token` (it varies with data distribution).
591+
The “max length” annotated in dataset names is measured in **characters**. For example, a 100-character Chinese string can be roughly converted to `100/1.5≈67` tokens.
592+
593+
For example:
594+
595+
* Chinese: `白日依山尽` (5 chars) may be tokenized into [`白日`, ``, ``, ``] (4 tokens)
596+
* English: `The sun sets in the west` (24 chars) may be tokenized into [`The `, `sun `, `sets `, `in `, `the`, `west`] (6 tokens)
597+
598+
The “recommended setting” above provides a rough estimate of the max token length for each dataset.
599+
Note that `max_seq_len` can be tuned aggressively / conservatively / in a balanced way: a larger value increases padding waste, while a smaller value increases truncation.
600+
601+
Just find a balance between `compute efficiency` <---> `semantic completeness`.
587602

588603
</details>
589604

dataset/lm_dataset.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,9 @@
11
import json
2-
import random
3-
import re
4-
52
import pandas as pd
63
import numpy as np
74
from torch.utils.data import Dataset, DataLoader
85
import torch
9-
from sklearn.model_selection import train_test_split
106
import os
11-
import ast
127

138
os.environ["TOKENIZERS_PARALLELISM"] = "false"
149

trainer/train_distill_reason.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ def train_epoch(epoch, loader, iters, tokenizer, lm_config, start_step=0, wandb=
110110
parser.add_argument("--save_interval", type=int, default=100, help="模型保存间隔")
111111
parser.add_argument('--hidden_size', default=512, type=int, help="隐藏层维度")
112112
parser.add_argument('--num_hidden_layers', default=8, type=int, help="隐藏层数量")
113-
parser.add_argument('--max_seq_len', default=1024, type=int, help="训练的最大截断长度")
113+
parser.add_argument('--max_seq_len', default=720, type=int, help="训练的最大截断长度(中文1token≈1.5~1.7字符)")
114114
parser.add_argument('--use_moe', default=0, type=int, choices=[0, 1], help="是否使用MoE架构(0=否,1=是)")
115115
parser.add_argument("--data_path", type=str, default="../dataset/r1_mix_1024.jsonl", help="推理蒸馏数据路径")
116116
parser.add_argument('--from_weight', default='dpo', type=str, help="基于哪个权重训练,默认dpo")

trainer/train_distillation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ def train_epoch(epoch, loader, iters, teacher_model, lm_config_student, start_st
146146
parser.add_argument("--grad_clip", type=float, default=1.0, help="梯度裁剪阈值")
147147
parser.add_argument("--log_interval", type=int, default=100, help="日志打印间隔")
148148
parser.add_argument("--save_interval", type=int, default=100, help="模型保存间隔")
149-
parser.add_argument("--max_seq_len", type=int, default=512, help="训练的最大截断长度")
149+
parser.add_argument("--max_seq_len", type=int, default=340, help="训练的最大截断长度(中文1token≈1.5~1.7字符)")
150150
parser.add_argument("--data_path", type=str, default="../dataset/sft_mini_512.jsonl", help="训练数据路径")
151151
parser.add_argument('--student_hidden_size', default=512, type=int, help="学生模型隐藏层维度")
152152
parser.add_argument('--student_num_layers', default=8, type=int, help="学生模型隐藏层数量")

trainer/train_dpo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ def train_epoch(epoch, loader, iters, ref_model, lm_config, start_step=0, wandb=
136136
parser.add_argument("--save_interval", type=int, default=100, help="模型保存间隔")
137137
parser.add_argument('--hidden_size', default=512, type=int, help="隐藏层维度")
138138
parser.add_argument('--num_hidden_layers', default=8, type=int, help="隐藏层数量")
139-
parser.add_argument('--max_seq_len', default=1024, type=int, help="训练的最大截断长度")
139+
parser.add_argument('--max_seq_len', default=1024, type=int, help="训练的最大截断长度(中文1token≈1.5~1.7字符)")
140140
parser.add_argument('--use_moe', default=0, type=int, choices=[0, 1], help="是否使用MoE架构(0=否,1=是)")
141141
parser.add_argument("--data_path", type=str, default="../dataset/dpo.jsonl", help="DPO训练数据路径")
142142
parser.add_argument('--from_weight', default='full_sft', type=str, help="基于哪个权重训练")

trainer/train_full_sft.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ def train_epoch(epoch, loader, iters, start_step=0, wandb=None):
9898
parser.add_argument("--save_interval", type=int, default=100, help="模型保存间隔")
9999
parser.add_argument('--hidden_size', default=512, type=int, help="隐藏层维度")
100100
parser.add_argument('--num_hidden_layers', default=8, type=int, help="隐藏层数量")
101-
parser.add_argument('--max_seq_len', default=512, type=int, help="训练的最大截断长度")
101+
parser.add_argument('--max_seq_len', default=340, type=int, help="训练的最大截断长度(中文1token≈1.5~1.7字符)")
102102
parser.add_argument('--use_moe', default=0, type=int, choices=[0, 1], help="是否使用MoE架构(0=否,1=是)")
103103
parser.add_argument("--data_path", type=str, default="../dataset/sft_mini_512.jsonl", help="训练数据路径")
104104
parser.add_argument('--from_weight', default='pretrain', type=str, help="基于哪个权重训练,为none则不基于任何权重训练")

trainer/train_lora.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ def train_epoch(epoch, loader, iters, lora_params, start_step=0, wandb=None):
9292
parser.add_argument("--save_interval", type=int, default=1, help="模型保存间隔")
9393
parser.add_argument('--hidden_size', default=512, type=int, help="隐藏层维度")
9494
parser.add_argument('--num_hidden_layers', default=8, type=int, help="隐藏层数量")
95-
parser.add_argument('--max_seq_len', default=512, type=int, help="训练的最大截断长度")
95+
parser.add_argument('--max_seq_len', default=340, type=int, help="训练的最大截断长度(中文1token≈1.5~1.7字符)")
9696
parser.add_argument('--use_moe', default=0, type=int, choices=[0, 1], help="是否使用MoE架构(0=否,1=是)")
9797
parser.add_argument("--data_path", type=str, default="../dataset/lora_identity.jsonl", help="LoRA训练数据路径")
9898
parser.add_argument('--from_weight', default='full_sft', type=str, help="基于哪个权重训练,默认full_sft")

trainer/train_pretrain.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def train_epoch(epoch, loader, iters, start_step=0, wandb=None):
9797
parser.add_argument("--save_interval", type=int, default=100, help="模型保存间隔")
9898
parser.add_argument('--hidden_size', default=512, type=int, help="隐藏层维度")
9999
parser.add_argument('--num_hidden_layers', default=8, type=int, help="隐藏层数量")
100-
parser.add_argument('--max_seq_len', default=512, type=int, help="训练的最大截断长度")
100+
parser.add_argument('--max_seq_len', default=340, type=int, help="训练的最大截断长度(中文1token≈1.5~1.7字符)")
101101
parser.add_argument('--use_moe', default=0, type=int, choices=[0, 1], help="是否使用MoE架构(0=否,1=是)")
102102
parser.add_argument("--data_path", type=str, default="../dataset/pretrain_hq.jsonl", help="预训练数据路径")
103103
parser.add_argument('--from_weight', default='none', type=str, help="基于哪个权重训练,为none则从头开始")

0 commit comments

Comments
 (0)