Skip to content

Commit 7c64582

Browse files
committed
Merge branch 'main' into release/2.0
2 parents ba3e277 + 5850472 commit 7c64582

37 files changed

+1231
-51
lines changed

docs/source/LLM/命令行参数.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
- 对subsets更细粒度的控制: 默认使用注册时指定的subsets(注册时未指定则使用'default'). e.g. 'sharegpt-gpt4'. 如果指定subsets则使用对应子集的数据集. e.g. 'sharegpt-gpt4:default/V3_format#2000'. 使用'/'进行分隔.
3838
- dataset_id的支持. e.g. 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. 如果dataset_id已经注册,则会使用注册时的预处理函数、subsets、split等. 否则使用`SmartPreprocessor`, 支持4种数据集格式, 并使用'default'的subsets, split设置为'train'. 支持的数据集格式可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
3939
- dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径,则为相对于运行目录的相对路径).
40-
- `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 如果使用本参数, 则`dataset_test_ratio`不再生效.
40+
- `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 默认为`[]`. 如果使用本参数, 则`dataset_test_ratio`不再生效.
4141
- `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
4242
- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`. 若设置了`--val_dataset`, 则该参数失效.
4343
- `--train_dataset_sample`: 对训练集的采样数, 默认是`-1`, 即使用完整的训练集进行训练. 该参数已废弃, 请使用`--dataset {dataset_name}#{dataset_sample}`
@@ -240,6 +240,7 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
240240
- `--seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
241241
- `--dtype`: 默认值为`'AUTO`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
242242
- `--dataset`: 默认值为`[]`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
243+
- `--val_dataset`: 默认为`[]`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
243244
- `--dataset_seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
244245
- `--dataset_test_ratio`: 默认值为`0.01`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
245246
- `--show_dataset_sample`: 表示想要评估和展示的验证集的数量, 默认值为`10`.

docs/source_en/LLM/Command-line-parameters.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
- More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
3636
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
3737
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
38-
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
38+
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument, default is `[]`. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
3939
- `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
4040
- `--dataset_test_ratio`: Used to specify the ratio for splitting the sub-dataset into training and validation sets. The default value is `0.01`. If `--val_dataset` is set, this parameter becomes ineffective.
4141
- `--train_dataset_sample`: The number of samples for the training dataset, default is `-1`, which means using the complete training dataset for training. This parameter is deprecated, please use `--dataset {dataset_name}#{dataset_sample}` instead.
@@ -239,6 +239,7 @@ dpo parameters inherit from sft parameters, with the following added parameters:
239239
- `--seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
240240
- `--dtype`: Default is `'AUTO`, see `sft.sh command line arguments` for parameter details.
241241
- `--dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
242+
- `--val_dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
242243
- `--dataset_seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
243244
`--dataset_test_ratio`: Default value is `0.01`. For specific parameter details, refer to the `sft.sh command line arguments`.
244245
- `--show_dataset_sample`: Represents number of validation set samples to evaluate and display, default is `10`.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
# torchacc dp
5+
export USE_TORCHACC=1
6+
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
7+
export XLA_IR_SHAPE_CACHE_SIZE=100000000
8+
export XLA_ALLOCATOR_FRACTION=0.95
9+
export XLA_EXPERIMENTAL=nonzero:masked_select
10+
11+
NPROC_PER_NODE=2 \
12+
CUDA_VISIBLE_DEVICES=0,1 \
13+
MASTER_PORT=27829 \
14+
swift sft \
15+
--model_id_or_path baichuan-inc/Baichuan2-13B-Chat \
16+
--model_layer_cls_name BaichuanLayer \
17+
--dataset codefuse-python-en \
18+
--sft_type lora \
19+
--output_dir output \
20+
--num_train_epochs 1 \
21+
--max_length 2048 \
22+
--batch_size 12 \
23+
--use_flash_attn true \
24+
--gradient_accumulation_steps 1 \
25+
--gradient_checkpointing no \
26+
--tuner_backend 'peft' \
27+
--dataset_test_ratio 0 \
28+
--save_strategy no \
29+
--eval_steps 2000000 \
30+
--save_steps 2000000 \
31+
--logging_steps 100 \
32+
--preprocess_num_proc 1 \
33+
--metric_warmup_step 0.1 \
34+
--report_to 'none'
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
# torchacc fsdp
5+
export USE_TORCHACC=1
6+
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
7+
export XLA_IR_SHAPE_CACHE_SIZE=100000000
8+
export XLA_ALLOCATOR_FRACTION=0.95
9+
export XLA_EXPERIMENTAL=nonzero:masked_select
10+
11+
NPROC_PER_NODE=2 \
12+
CUDA_VISIBLE_DEVICES=0,1 \
13+
swift sft \
14+
--model_id_or_path baichuan-inc/Baichuan2-13B-Chat \
15+
--model_layer_cls_name BaichuanLayer \
16+
--dataset codefuse-python-en \
17+
--sft_type lora \
18+
--output_dir output \
19+
--num_train_epochs 1 \
20+
--max_length 2048 \
21+
--batch_size 16 \
22+
--use_flash_attn true \
23+
--gradient_accumulation_steps 1 \
24+
--gradient_checkpointing no \
25+
--tuner_backend 'peft' \
26+
--dataset_test_ratio 0 \
27+
--save_strategy no \
28+
--eval_steps 2000000 \
29+
--save_steps 2000000 \
30+
--logging_steps 100 \
31+
--preprocess_num_proc 1 \
32+
--metric_warmup_step 0.1 \
33+
--fsdp_num 2 \
34+
--report_to 'none'
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
5+
# MASTER_ADDR=127.0.0.1 \
6+
7+
NPROC_PER_NODE=2 \
8+
CUDA_VISIBLE_DEVICES=0,1 \
9+
swift sft \
10+
--model_id_or_path baichuan-inc/Baichuan2-13B-Chat \
11+
--dataset codefuse-python-en \
12+
--sft_type lora \
13+
--dtype AUTO \
14+
--output_dir output \
15+
--num_train_epochs 1 \
16+
--max_length 2048 \
17+
--batch_size 2 \
18+
--use_flash_attn true \
19+
--gradient_accumulation_steps 1 \
20+
--dataset_test_ratio 0 \
21+
--save_strategy no \
22+
--eval_steps 2000000 \
23+
--save_steps 2000000 \
24+
--logging_steps 100 \
25+
--preprocess_num_proc 1 \
26+
--metric_warmup_step 0.1 \
27+
--report_to 'none'
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
# torchacc dp
5+
export USE_TORCHACC=1
6+
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
7+
export XLA_IR_SHAPE_CACHE_SIZE=100000000
8+
export XLA_ALLOCATOR_FRACTION=0.95
9+
export XLA_EXPERIMENTAL=nonzero:masked_select
10+
11+
12+
NPROC_PER_NODE=2 \
13+
CUDA_VISIBLE_DEVICES=0,1 \
14+
MASTER_PORT=27829 \
15+
swift sft \
16+
--model_id_or_path ZhipuAI/chatglm3-6b \
17+
--model_layer_cls_name GLMBlock \
18+
--dataset codefuse-python-en \
19+
--sft_type lora \
20+
--output_dir output \
21+
--num_train_epochs 1 \
22+
--max_length 2048 \
23+
--batch_size 16 \
24+
--use_flash_attn true \
25+
--gradient_accumulation_steps 1 \
26+
--gradient_checkpointing no \
27+
--tuner_backend 'peft' \
28+
--dataset_test_ratio 0 \
29+
--save_strategy no \
30+
--eval_steps 2000000 \
31+
--save_steps 2000000 \
32+
--logging_steps 100 \
33+
--preprocess_num_proc 1 \
34+
--metric_warmup_step 0.1 \
35+
--report_to 'none'
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
# torchacc fsdp
5+
export USE_TORCHACC=1
6+
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
7+
export XLA_IR_SHAPE_CACHE_SIZE=100000000
8+
export XLA_ALLOCATOR_FRACTION=0.95
9+
export XLA_EXPERIMENTAL=nonzero:masked_select
10+
11+
12+
NPROC_PER_NODE=2 \
13+
CUDA_VISIBLE_DEVICES=0,1 \
14+
swift sft \
15+
--model_id_or_path ZhipuAI/chatglm3-6b \
16+
--model_layer_cls_name GLMBlock \
17+
--dataset codefuse-python-en \
18+
--sft_type lora \
19+
--output_dir output \
20+
--num_train_epochs 1 \
21+
--max_length 2048 \
22+
--batch_size 16 \
23+
--use_flash_attn true \
24+
--gradient_accumulation_steps 1 \
25+
--gradient_checkpointing no \
26+
--tuner_backend 'peft' \
27+
--dataset_test_ratio 0 \
28+
--save_strategy no \
29+
--eval_steps 2000000 \
30+
--save_steps 2000000 \
31+
--logging_steps 100 \
32+
--preprocess_num_proc 1 \
33+
--metric_warmup_step 0.1 \
34+
--fsdp_num 2 \
35+
--report_to 'none'
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
5+
# MASTER_ADDR=127.0.0.1 \
6+
# MASTER_PORT=12356 \
7+
NPROC_PER_NODE=2 \
8+
CUDA_VISIBLE_DEVICES=0,1 \
9+
swift sft \
10+
--model_id_or_path ZhipuAI/chatglm3-6b \
11+
--dataset codefuse-python-en \
12+
--sft_type lora \
13+
--dtype AUTO \
14+
--output_dir output \
15+
--num_train_epochs 1 \
16+
--max_length 2048 \
17+
--batch_size 4 \
18+
--use_flash_attn true \
19+
--gradient_accumulation_steps 1 \
20+
--dataset_test_ratio 0 \
21+
--save_strategy no \
22+
--eval_steps 2000000 \
23+
--save_steps 2000000 \
24+
--logging_steps 100 \
25+
--preprocess_num_proc 1 \
26+
--metric_warmup_step 0.1 \
27+
--report_to 'none'
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
5+
export USE_TORCHACC=1
6+
export TORCHACC_TRIM_GRAPH=1
7+
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
8+
export XLA_IR_SHAPE_CACHE_SIZE=100000000
9+
export XLA_ALLOCATOR_FRACTION=0.95
10+
export XLA_EXPERIMENTAL=nonzero:masked_select
11+
12+
NPROC_PER_NODE=2 \
13+
CUDA_VISIBLE_DEVICES=0,1 \
14+
swift sft \
15+
--model_id_or_path modelscope/Llama-2-13b-chat-ms \
16+
--model_layer_cls_name LlamaDecoderLayer \
17+
--dataset codefuse-python-en \
18+
--template_type llama \
19+
--sft_type lora \
20+
--output_dir output \
21+
--num_train_epochs 1 \
22+
--max_length 2048 \
23+
--batch_size 16 \
24+
--use_flash_attn true \
25+
--gradient_accumulation_steps 1 \
26+
--gradient_checkpointing no \
27+
--tuner_backend 'peft' \
28+
--dataset_test_ratio 0 \
29+
--save_strategy no \
30+
--eval_steps 2000000 \
31+
--save_steps 2000000 \
32+
--logging_steps 100 \
33+
--preprocess_num_proc 1 \
34+
--metric_warmup_step 0.1 \
35+
--report_to 'none'
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Experimental environment: 2 * A100
2+
# 80GB GPU memory
3+
# Note: TorchAcc is currently only available internally.
4+
export USE_TORCHACC=1
5+
export TORCHACC_TRIM_GRAPH=1
6+
export XLA_FLAGS='--xla_gpu_force_compilation_parallelism=32 --xla_multiheap_size_constraint_per_heap=4831838208 --xla_disable_hlo_passes=all-gather-combiner,all-reduce-combiner,reduce-scatter-combiner,gpu-convert-async-collectives-to-sync,rematerialization'
7+
export XLA_IR_SHAPE_CACHE_SIZE=100000000
8+
export XLA_ALLOCATOR_FRACTION=0.95
9+
export XLA_EXPERIMENTAL=nonzero:masked_select
10+
11+
NPROC_PER_NODE=2 \
12+
CUDA_VISIBLE_DEVICES=0,1 \
13+
MASTER_PORT=27829 \
14+
swift sft \
15+
--model_id_or_path modelscope/Llama-2-13b-chat-ms \
16+
--model_layer_cls_name LlamaDecoderLayer \
17+
--dataset codefuse-python-en \
18+
--template_type llama \
19+
--sft_type lora \
20+
--output_dir output \
21+
--num_train_epochs 1 \
22+
--max_length 2048 \
23+
--batch_size 24 \
24+
--use_flash_attn true \
25+
--gradient_accumulation_steps 1 \
26+
--gradient_checkpointing no \
27+
--tuner_backend 'peft' \
28+
--dataset_test_ratio 0 \
29+
--save_strategy no \
30+
--eval_steps 2000000 \
31+
--save_steps 2000000 \
32+
--logging_steps 100 \
33+
--preprocess_num_proc 1 \
34+
--metric_warmup_step 0.1 \
35+
--fsdp_num 2 \
36+
--report_to 'none'

0 commit comments

Comments
 (0)