Skip to content

Commit 125a350

Browse files
authored
Fix resume from checkpointing (#827)
1 parent e4c792f commit 125a350

File tree

98 files changed

+132
-221
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

98 files changed

+132
-221
lines changed

docs/source/LLM/Grok训练和推理.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -72,11 +72,10 @@ torchrun \
7272
--save_steps 100 \
7373
--save_total_limit 2 \
7474
--logging_steps 10 \
75-
--deepspeed_config_path scripts/grok-1/lora_ddp_ds/zero3.json \
76-
--save_only_model true \
75+
--deepspeed zero3-offload \
7776
```
7877

79-
改脚本需要一个zero3.json文件,完整的训练文件可以在[这里](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/grok-1/lora_ddp_ds)找到。
78+
完整的训练文件可以在[这里](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/grok-1/lora_ddp_ds)找到。
8079

8180
下面是训练过程的一些benchmark:
8281

docs/source/LLM/命令行参数.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@
9494
- `--save_on_each_node`: 该参数在多机训练时生效, 默认为`True`.
9595
- `--save_strategy`: 保存checkpoint的策略, 默认为`'steps'`, 可选择的值包括: 'steps', 'no'.
9696
- `--save_safetensors`: 默认为`True`.
97+
- `--include_num_input_tokens_seen`: 默认为`False`. 跟踪整个训练过程中观察到的输入tokens的数量.
9798
- `--max_new_tokens`: 默认为`2048`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
9899
- `--do_sample`: 默认为`True`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
99100
- `--temperature`: 默认为`0.3`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
@@ -209,7 +210,7 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
209210
- `--bnb_4bit_comp_dtype`: 默认值为`'AUTO'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
210211
- `--bnb_4bit_quant_type`: 默认值为`'nf4'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
211212
- `--bnb_4bit_use_double_quant`: 默认值为`True`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
212-
- `--bnb_4bit_quant_storage`: 默认值为`True`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
213+
- `--bnb_4bit_quant_storage`: 默认值为`True`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 若`quantization_bit`设置为0, 则该参数失效.
213214
- `--max_new_tokens`: 生成新token的最大数量, 默认值为`2048`.
214215
- `--do_sample`: 是使用贪婪生成的方式还是采样生成的方式, 默认值为`True`.
215216
- `--temperature`: 默认值为`0.3`. 该参数只有在`do_sample`设置为True时才生效. 该参数会在部署参数中作为默认值使用.

docs/source_en/LLM/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@
9393
- `--save_on_each_node`: Takes effect during multi-machine training, default is `True`.
9494
- `--save_strategy`: Strategy for saving checkpoint, default is `'steps'`, options include: 'steps', 'no'.
9595
- `--save_safetensors`: Default is `True`.
96+
- `--include_num_input_tokens_seen`: Default is `False`. Tracks the number of input tokens seen throughout training.
9697
- `--max_new_tokens`: Default is `2048`. This parameter only takes effect when `predict_with_generate` is set to True.
9798
- `--do_sample`: Default is `True`. This parameter only takes effect when `predict_with_generate` is set to True.
9899
- `--temperature`: Default is `0.3`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.

docs/source_en/LLM/Grok-1-best-practice.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,11 +70,10 @@ torchrun \
7070
--save_steps 100 \
7171
--save_total_limit 2 \
7272
--logging_steps 10 \
73-
--deepspeed_config_path scripts/grok-1/lora_ddp_ds/zero3.json \
74-
--save_only_model true \
73+
--deepspeed zero3-offload \
7574
```
7675

77-
This script requires a zero3.json file. The complete training files can be found [here](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/grok-1/lora_ddp_ds).
76+
The complete training files can be found [here](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/grok-1/lora_ddp_ds).
7877

7978
Here are some benchmarks from the training process:
8079

examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_ddp_ds/sft.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ torchrun \
1212
--model_revision master \
1313
--sft_type lora \
1414
--tuner_backend peft \
15-
--template_type baichuan \
15+
--template_type AUTO \
1616
--dtype AUTO \
1717
--output_dir output \
1818
--ddp_backend nccl \
@@ -37,4 +37,3 @@ torchrun \
3737
--save_total_limit 2 \
3838
--logging_steps 10 \
3939
--deepspeed default-zero2 \
40-
--save_only_model true \

examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp/sft.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ swift sft \
77
--model_revision master \
88
--sft_type lora \
99
--tuner_backend peft \
10-
--template_type baichuan \
10+
--template_type AUTO \
1111
--dtype AUTO \
1212
--output_dir output \
1313
--dataset dureader-robust-zh \

examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp_ddp/sft.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ torchrun \
1212
--model_revision master \
1313
--sft_type lora \
1414
--tuner_backend peft \
15-
--template_type baichuan \
15+
--template_type AUTO \
1616
--dtype AUTO \
1717
--output_dir output \
1818
--ddp_backend nccl \

examples/pytorch/llm/scripts/baichuan2_13b_chat/qlora_ddp_ds/sft.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ torchrun \
1212
--model_revision master \
1313
--sft_type lora \
1414
--tuner_backend peft \
15-
--template_type baichuan \
15+
--template_type AUTO \
1616
--dtype AUTO \
1717
--output_dir output \
1818
--ddp_backend nccl \
@@ -39,4 +39,3 @@ torchrun \
3939
--save_total_limit 2 \
4040
--logging_steps 10 \
4141
--deepspeed default-zero2 \
42-
--save_only_model true \

examples/pytorch/llm/scripts/baichuan2_13b_chat_int4/qlora_ddp_ds/sft.sh

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ torchrun \
1212
--model_revision master \
1313
--sft_type lora \
1414
--tuner_backend peft \
15-
--template_type baichuan \
15+
--template_type AUTO \
1616
--dtype AUTO \
1717
--output_dir output \
1818
--ddp_backend nccl \
@@ -40,5 +40,4 @@ torchrun \
4040
--hub_model_id baichuan2-13b-chat-int4-qlora \
4141
--hub_private_repo true \
4242
--hub_token 'your-sdk-token' \
43-
--deepspeed_config_path default-zero2 \
44-
--save_only_model true \
43+
--deepspeed default-zero2 \

examples/pytorch/llm/scripts/baichuan2_7b_chat/lora_ddp/sft.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ torchrun \
1212
--model_revision master \
1313
--sft_type lora \
1414
--tuner_backend peft \
15-
--template_type baichuan \
15+
--template_type AUTO \
1616
--dtype AUTO \
1717
--output_dir output \
1818
--ddp_backend nccl \

0 commit comments

Comments
 (0)