Skip to content

Commit d1224e0

Browse files
authored
support transformers==4.41 (#979)
1 parent 14a5283 commit d1224e0

File tree

6 files changed

+48
-24
lines changed

6 files changed

+48
-24
lines changed

docs/source/LLM/命令行参数.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,15 @@
3939
- dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径,则为相对于运行目录的相对路径).
4040
- `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 如果使用本参数, 则`dataset_test_ratio`不再生效.
4141
- `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
42-
- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`.
42+
- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`. 若设置了`--val_dataset`, 则该参数失效.
4343
- `--train_dataset_sample`: 对训练集的采样数, 默认是`-1`, 即使用完整的训练集进行训练. 该参数已废弃, 请使用`--dataset {dataset_name}#{dataset_sample}`
44-
- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证. 该参数已废弃, 验证集数量完全由`dataset_test_ratio`控制.
44+
- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证. 该参数已废弃, 验证集数量由`--dataset_test_ratio`或者`--val_dataset {dataset_name}#{dataset_sample}`控制.
4545
- `--system`: 对话模板中使用的system, 默认为`None`, 即使用模型默认的system. 如果指定为'', 则不使用system.
4646
- `--max_length`: token的最大长度, 默认为`2048`. 可以避免个别过长的数据样本造成OOM的问题. 当指定`--truncation_strategy delete`时, 如果某数据样本长度超过max_length, 我们会删除该数据样本. 如果指定`--truncation_strategy truncation_left`时, 我们会切除最前面的token: `input_ids[-max_length:]`. 如果设置为-1, 则无限制.
4747
- `--truncation_strategy`: 默认是`'delete'`表示把超过max_length的句子从数据集中删除. `'truncation_left'`表示会将超过文本的左边给切除掉, 这可能会切到special token, 会影响性能, 并不推荐.
4848
- `--check_dataset_strategy`: 默认值为`'none'`, 即不做检查. 如果你训练的模型是LLM, 则推荐使用`'warning'`作为数据检查的策略. 如果你的训练目标为句子分类等任务, 则建议设置为'`none`'.
4949
- `--custom_train_dataset_path`: 默认值为`[]`. 该参数已废弃, 请使用`--dataset {dataset_path}`.
50-
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 不再区分训练集和验证集, 使用`dataset_test_ratio`统一进行切分. 请使用`--dataset {dataset_path}`.
50+
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 该参数已废弃. 请使用`--val_dataset {dataset_path}`.
5151
- `--self_cognition_sample`: 自我认知数据集的采样数. 默认为`0`. 你该值设置为>0时, 需要同时指定`--model_name`, `--model_author`. 该参数已废弃, 请使用`--dataset self-cognition#{self_cognition_sample}`.
5252
- `--model_name`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样(即指定`--dataset self-cognition`或者self_cognition_sample>0), 你需要传入两个值, 分别代表模型的中文名和英文名. 例如: `--model_name 小黄 'Xiao Huang'`. 如果你想了解更多, 可以查看[自我认知微调最佳实践](自我认知微调最佳实践.md).
5353
- `--model_author`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样, 你需要传入两个值, 分别代表作者的中文名和英文名. 例如: `--model_author 魔搭 ModelScope`.
@@ -241,14 +241,14 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
241241
- `--dtype`: 默认值为`'AUTO`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
242242
- `--dataset`: 默认值为`[]`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
243243
- `--dataset_seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
244-
- `--dataset_test_ratio`: 默认值为`None`, 如果`--load_dataset_config true`则使用训练时的dataset_test_ratio, 否则设置为1. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
244+
- `--dataset_test_ratio`: 默认值为`0.01`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
245245
- `--show_dataset_sample`: 表示想要评估和展示的验证集的数量, 默认值为`10`.
246246
- `--system`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
247247
- `--max_length`: 默认值为`-1`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
248248
- `--truncation_strategy`: 默认是`'delete'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
249249
- `--check_dataset_strategy`: 默认值为`'none'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
250250
- `--custom_train_dataset_path`: 默认值为`[]`. 该参数已废弃, 请使用`--dataset {dataset_path}`.
251-
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 不再区分训练集和验证集, 使用`dataset_test_ratio`统一进行切分. 请使用`--dataset {dataset_path}`.
251+
- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃. 请使用`--val_dataset {dataset_path}`.
252252
- `--quantization_bit`: 默认值为0. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
253253
- `--quant_method`: 量化方法, 默认为`None`. 你可以选择为'bnb', 'hqq', 'eetq'.
254254
- `--hqq_axis`: hqq量化参数,表示执行分组的所沿的轴,默认为`0`, 可选值包括`0`,`1`

docs/source_en/LLM/Command-line-parameters.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,16 +37,16 @@
3737
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
3838
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
3939
- `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
40-
- `--dataset_test_ratio`: Ratio for splitting subdataset into train and validation sets, default is `0.01`.
40+
- `--dataset_test_ratio`: Used to specify the ratio for splitting the sub-dataset into training and validation sets. The default value is `0.01`. If `--val_dataset` is set, this parameter becomes ineffective.
4141
- `--train_dataset_sample`: The number of samples for the training dataset, default is `-1`, which means using the complete training dataset for training. This parameter is deprecated, please use `--dataset {dataset_name}#{dataset_sample}` instead.
42-
- `--val_dataset_sample`: Sampling for the validation dataset, default is `None`, which automatically selects an appropriate number of samples for validation. If you specify `-1`, it uses the complete validation dataset for validation. This parameter is deprecated, and the number of samples in the validation dataset is fully controlled by dataset_test_ratio.
42+
- `--val_dataset_sample`: Used to sample the validation set, with a default value of `None`, which automatically selects a suitable number of data samples for validation. If you specify `-1`, the complete validation set is used for validation. This parameter is deprecated and the number of samples in the validation set is controlled by `--dataset_test_ratio` or `--val_dataset {dataset_name}#{dataset_sample}`.
4343
- `--system`: System used in dialogue template, default is `None`, i.e. use the model's default system. If set to '', no system is used.
4444
- `--max_length`: Maximum token length, default is `2048`. Avoids OOM issues caused by individual overly long samples. When `--truncation_strategy delete` is specified, samples exceeding max_length will be deleted. When `--truncation_strategy truncation_left` is specified, the leftmost tokens will be truncated: `input_ids[-max_length:]`. If set to -1, no limit.
4545
- `--truncation_strategy`: Default is `'delete'` which removes sentences exceeding max_length from dataset. `'truncation_left'` will truncate excess text from the left, which may truncate special tokens and affect performance, not recommended.
4646
- `--check_dataset_strategy`: Default is `'none'`, i.e. no checking. If training an LLM model, `'warning'` is recommended as data check strategy. If your training target is sentence classification etc., setting to `'none'` is recommended.
4747

4848
- `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
49-
- `--custom_val_dataset_path`: Default value is `[]`. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using `dataset_test_ratio`. Please use `--dataset {dataset_path}`.
49+
- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
5050
- `--self_cognition_sample`: The number of samples for the self-cognition dataset. Default is `0`. If you set this value to >0, you need to specify `--model_name` and `--model_author` at the same time. This parameter has been deprecated, please use `--dataset self-cognition#{self_cognition_sample}` instead.
5151
- `--model_name`: Default value is `[None, None]`. If self-cognition dataset sampling is enabled (i.e., specifying `--dataset self-cognition` or self_cognition_sample>0), you need to provide two values, representing the Chinese and English names of the model, respectively. For example: `--model_name 小黄 'Xiao Huang'`. If you want to learn more, you can refer to the [Self-Cognition Fine-tuning Best Practices](Self-cognition-best-practice.md).
5252
- `--model_name`: Default is `[None, None]`. If self-cognition dataset sampling is enabled (i.e. self_cognition_sample>0), you need to pass two values, representing the model's Chinese and English names respectively. E.g. `--model_name 小黄 'Xiao Huang'`.
@@ -240,14 +240,14 @@ dpo parameters inherit from sft parameters, with the following added parameters:
240240
- `--dtype`: Default is `'AUTO`, see `sft.sh command line arguments` for parameter details.
241241
- `--dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
242242
- `--dataset_seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
243-
`--dataset_test_ratio`: Default value is `None`, if `--load_dataset_config true` is set, then use the dataset_test_ratio from training, else set it to 1. For specific parameter details, refer to the `sft.sh command line arguments`.
243+
`--dataset_test_ratio`: Default value is `0.01`. For specific parameter details, refer to the `sft.sh command line arguments`.
244244
- `--show_dataset_sample`: Represents number of validation set samples to evaluate and display, default is `10`.
245245
- `--system`: Default is `None`. See `sft.sh command line arguments` for parameter details.
246246
- `--max_length`: Default is `-1`. See `sft.sh command line arguments` for parameter details.
247247
- `--truncation_strategy`: Default is `'delete'`. See `sft.sh command line arguments` for parameter details.
248248
- `--check_dataset_strategy`: Default is `'none'`, see `sft.sh command line arguments` for parameter details.
249249
- `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
250-
- `--custom_val_dataset_path`: Default value is `[]`. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using `dataset_test_ratio`. Please use `--dataset {dataset_path}`.
250+
- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
251251
- `--quantization_bit`: Default is 0. See `sft.sh command line arguments` for parameter details.
252252
- `--quant_method`: Quantization method, default is None. You can choose from 'bnb', 'hqq', 'eetq'.
253253
- `--hqq_axis`: Hqq argument. Axis along which grouping is performed. Supported values are 0 or 1. default is `0`

requirements/framework.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
accelerate
22
dacite
3-
datasets<=2.18 # modelscope
43
jieba
54
matplotlib
65
modelscope>=1.14
@@ -14,6 +13,6 @@ rouge
1413
safetensors
1514
tensorboard
1615
tqdm
17-
transformers>=4.33,<4.41
16+
transformers>=4.33,<4.42
1817
transformers_stream_generator
1918
trl>=0.8.2

swift/llm/infer.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -386,13 +386,16 @@ def llm_infer(args: InferArguments) -> None:
386386
append_to_jsonl(jsonl_path, obj)
387387
result.append(obj)
388388
else:
389-
_, val_dataset = get_dataset(
390-
args.dataset,
391-
args.dataset_test_ratio,
392-
args.dataset_seed,
393-
check_dataset_strategy=args.check_dataset_strategy,
394-
model_name=args.model_name,
395-
model_author=args.model_author)
389+
dataset_kwargs = {
390+
'dataset_seed': args.dataset_seed,
391+
'check_dataset_strategy': args.check_dataset_strategy,
392+
'model_name': args.model_name,
393+
'model_author': args.model_author
394+
}
395+
if args.val_dataset is None:
396+
_, val_dataset = get_dataset(args.dataset, args.dataset_test_ratio, **dataset_kwargs)
397+
else:
398+
_, val_dataset = get_dataset(args.val_dataset, 1.0, **dataset_kwargs)
396399
_, val_dataset = args._handle_dataset_compat(_, val_dataset)
397400
if args.show_dataset_sample >= 0 and val_dataset.shape[0] > args.show_dataset_sample:
398401
random_state = np.random.RandomState(args.dataset_seed)

swift/llm/utils/argument.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -966,8 +966,9 @@ class InferArguments(ArgumentsBase):
966966

967967
dataset: List[str] = field(
968968
default_factory=list, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
969+
val_dataset: List[str] = field(default=None, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
969970
dataset_seed: int = 42
970-
dataset_test_ratio: Optional[float] = None
971+
dataset_test_ratio: float = 0.01
971972
show_dataset_sample: int = 10
972973
save_result: bool = True
973974
system: Optional[str] = None
@@ -1035,6 +1036,9 @@ def __post_init__(self) -> None:
10351036
'the dir contains a `configuration.json` file.')
10361037
self.handle_compatibility()
10371038
self._register_self_cognition()
1039+
if self.val_dataset is not None:
1040+
self.dataset_test_ratio = 0.0 if self.val_dataset is not None else self.dataset_test_ratio
1041+
logger.info('Using val_dataset, ignoring dataset_test_ratio')
10381042
self.handle_path()
10391043
logger.info(f'ckpt_dir: {self.ckpt_dir}')
10401044
if self.ckpt_dir is None and self.load_args_from_ckpt_dir:
@@ -1054,8 +1058,6 @@ def __post_init__(self) -> None:
10541058

10551059
self.torch_dtype, _, _ = self.select_dtype()
10561060
self.prepare_template()
1057-
if self.dataset_test_ratio is None:
1058-
self.dataset_test_ratio = 1
10591061
if self.eval_human is None:
10601062
if not len(self.dataset) > 0:
10611063
self.eval_human = True
@@ -1139,8 +1141,8 @@ def load_from_ckpt_dir(self) -> None:
11391141
]
11401142
if self.load_dataset_config:
11411143
imported_keys += [
1142-
'dataset', 'dataset_seed', 'dataset_test_ratio', 'check_dataset_strategy', 'self_cognition_sample',
1143-
'model_name', 'model_author', 'train_dataset_sample', 'val_dataset_sample'
1144+
'dataset', 'val_dataset', 'dataset_seed', 'dataset_test_ratio', 'check_dataset_strategy',
1145+
'self_cognition_sample', 'model_name', 'model_author', 'train_dataset_sample', 'val_dataset_sample'
11441146
]
11451147
for key in imported_keys:
11461148
value = getattr(self, key)

tests/llm/test_run.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,25 @@ def setUp(self):
3939
def tearDown(self):
4040
shutil.rmtree(self.tmp_dir)
4141

42+
def test_template(self):
43+
if not __name__ == '__main__':
44+
# ignore citest error in github
45+
return
46+
torch.cuda.empty_cache()
47+
output = sft_main(
48+
SftArguments(
49+
model_type=ModelType.qwen1half_1_8b,
50+
model_id_or_path='../models/Qwen1.5-1.8B',
51+
template_type='qwen',
52+
sft_type='full',
53+
dataset=f'{DatasetName.jd_sentiment_zh}#200',
54+
eval_steps=5))
55+
best_model_checkpoint = output['best_model_checkpoint']
56+
torch.cuda.empty_cache()
57+
result = infer_main(
58+
InferArguments(ckpt_dir=best_model_checkpoint, load_dataset_config=True, val_dataset_sample=2))
59+
assert len(result['result'][0]['response']) < 20
60+
4261
def test_basic(self):
4362
output_dir = 'output'
4463
quantization_bit_list = [0, 4]
@@ -481,6 +500,7 @@ def tokenize_func(examples):
481500
metric_for_best_model='loss',
482501
greater_is_better=False,
483502
gradient_accumulation_steps=1,
503+
logging_steps=5,
484504
eval_steps=10,
485505
save_only_model=save_only_model)
486506
trainer_args._n_gpu = 1

0 commit comments

Comments
 (0)