support transformers==4.41 (#979)

Jintao-Huang · web-flow · commit d1224e0bfe17 · 2024-05-22T11:34:06.000+08:00
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -39,15 +39,15 @@
    - dataset_path的支持. e.g. '1.jsonl#5000'. (如果是相对路径，则为相对于运行目录的相对路径).
 - `--val_dataset`: 用于指定单独的验证集, 格式和`dataset`参数相同, 如果使用本参数, 则`dataset_test_ratio`不再生效.
 - `--dataset_seed`: 用于指定数据集处理的seed, 默认为`42`. 以random_state形式存在, 不影响全局seed.
-- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`.
+- `--dataset_test_ratio`: 用于指定子数据集切分成训练集和验证集的比例, 默认为`0.01`. 若设置了`--val_dataset`, 则该参数失效.
 - `--train_dataset_sample`: 对训练集的采样数, 默认是`-1`, 即使用完整的训练集进行训练. 该参数已废弃, 请使用`--dataset {dataset_name}#{dataset_sample}`
-- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证. 该参数已废弃, 验证集数量完全由`dataset_test_ratio`控制.
+- `--val_dataset_sample`: 对验证集进行采样, 默认是`None`, 自动选取合适数量的数据集数量进行验证. 如果你指定为`-1`, 则使用完整的验证集进行验证. 该参数已废弃, 验证集数量由`--dataset_test_ratio`或者`--val_dataset {dataset_name}#{dataset_sample}`控制.
 - `--system`: 对话模板中使用的system, 默认为`None`, 即使用模型默认的system. 如果指定为'', 则不使用system.
 - `--max_length`: token的最大长度, 默认为`2048`. 可以避免个别过长的数据样本造成OOM的问题. 当指定`--truncation_strategy delete`时, 如果某数据样本长度超过max_length, 我们会删除该数据样本. 如果指定`--truncation_strategy truncation_left`时, 我们会切除最前面的token: `input_ids[-max_length:]`. 如果设置为-1, 则无限制.
 - `--truncation_strategy`: 默认是`'delete'`表示把超过max_length的句子从数据集中删除. `'truncation_left'`表示会将超过文本的左边给切除掉, 这可能会切到special token, 会影响性能, 并不推荐.
 - `--check_dataset_strategy`: 默认值为`'none'`, 即不做检查. 如果你训练的模型是LLM, 则推荐使用`'warning'`作为数据检查的策略. 如果你的训练目标为句子分类等任务, 则建议设置为'`none`'.
 - `--custom_train_dataset_path`: 默认值为`[]`. 该参数已废弃, 请使用`--dataset {dataset_path}`.
-- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 不再区分训练集和验证集, 使用`dataset_test_ratio`统一进行切分. 请使用`--dataset {dataset_path}`.
+- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 该参数已废弃. 请使用`--val_dataset {dataset_path}`.
 - `--self_cognition_sample`: 自我认知数据集的采样数. 默认为`0`. 你该值设置为>0时, 需要同时指定`--model_name`, `--model_author`. 该参数已废弃, 请使用`--dataset self-cognition#{self_cognition_sample}`.
 - `--model_name`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样(即指定`--dataset self-cognition`或者self_cognition_sample>0), 你需要传入两个值, 分别代表模型的中文名和英文名. 例如: `--model_name 小黄 'Xiao Huang'`. 如果你想了解更多, 可以查看[自我认知微调最佳实践](自我认知微调最佳实践.md).
 - `--model_author`: 默认为`[None, None]`. 如果开启了自我认知数据集的采样, 你需要传入两个值, 分别代表作者的中文名和英文名. 例如: `--model_author 魔搭 ModelScope`.
@@ -241,14 +241,14 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
 - `--dtype`: 默认值为`'AUTO`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--dataset`: 默认值为`[]`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--dataset_seed`: 默认值为`42`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
-- `--dataset_test_ratio`: 默认值为`None`, 如果`--load_dataset_config true`则使用训练时的dataset_test_ratio, 否则设置为1. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
+- `--dataset_test_ratio`: 默认值为`0.01`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--show_dataset_sample`: 表示想要评估和展示的验证集的数量, 默认值为`10`.
 - `--system`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--max_length`: 默认值为`-1`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--truncation_strategy`: 默认是`'delete'`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--check_dataset_strategy`: 默认值为`'none'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--custom_train_dataset_path`: 默认值为`[]`. 该参数已废弃, 请使用`--dataset {dataset_path}`.
-- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃, 不再区分训练集和验证集, 使用`dataset_test_ratio`统一进行切分. 请使用`--dataset {dataset_path}`.
+- `--custom_val_dataset_path`: 默认值为`[]`. 该参数已废弃. 请使用`--val_dataset {dataset_path}`.
 - `--quantization_bit`: 默认值为0. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--quant_method`: 量化方法, 默认为`None`. 你可以选择为'bnb', 'hqq', 'eetq'.
 - `--hqq_axis`: hqq量化参数，表示执行分组的所沿的轴，默认为`0`, 可选值包括`0`,`1`
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -37,16 +37,16 @@
   - Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
 - `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
 - `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
-- `--dataset_test_ratio`: Ratio for splitting subdataset into train and validation sets, default is `0.01`.
+- `--dataset_test_ratio`: Used to specify the ratio for splitting the sub-dataset into training and validation sets. The default value is `0.01`. If `--val_dataset` is set, this parameter becomes ineffective.
 - `--train_dataset_sample`: The number of samples for the training dataset, default is `-1`, which means using the complete training dataset for training. This parameter is deprecated, please use `--dataset {dataset_name}#{dataset_sample}` instead.
-- `--val_dataset_sample`: Sampling for the validation dataset, default is `None`, which automatically selects an appropriate number of samples for validation. If you specify `-1`, it uses the complete validation dataset for validation. This parameter is deprecated, and the number of samples in the validation dataset is fully controlled by dataset_test_ratio.
+- `--val_dataset_sample`: Used to sample the validation set, with a default value of `None`, which automatically selects a suitable number of data samples for validation. If you specify `-1`, the complete validation set is used for validation. This parameter is deprecated and the number of samples in the validation set is controlled by `--dataset_test_ratio` or `--val_dataset {dataset_name}#{dataset_sample}`.
 - `--system`: System used in dialogue template, default is `None`, i.e. use the model's default system. If set to '', no system is used.
 - `--max_length`: Maximum token length, default is `2048`. Avoids OOM issues caused by individual overly long samples. When `--truncation_strategy delete` is specified, samples exceeding max_length will be deleted. When `--truncation_strategy truncation_left` is specified, the leftmost tokens will be truncated: `input_ids[-max_length:]`. If set to -1, no limit.
 - `--truncation_strategy`: Default is `'delete'` which removes sentences exceeding max_length from dataset. `'truncation_left'` will truncate excess text from the left, which may truncate special tokens and affect performance, not recommended.
 - `--check_dataset_strategy`: Default is `'none'`, i.e. no checking. If training an LLM model, `'warning'` is recommended as data check strategy. If your training target is sentence classification etc., setting to `'none'` is recommended.
 
 - `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
-- `--custom_val_dataset_path`: Default value is `[]`. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using `dataset_test_ratio`. Please use `--dataset {dataset_path}`.
+- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
 - `--self_cognition_sample`: The number of samples for the self-cognition dataset. Default is `0`. If you set this value to >0, you need to specify `--model_name` and `--model_author` at the same time. This parameter has been deprecated, please use `--dataset self-cognition#{self_cognition_sample}` instead.
 - `--model_name`: Default value is `[None, None]`. If self-cognition dataset sampling is enabled (i.e., specifying `--dataset self-cognition` or self_cognition_sample>0), you need to provide two values, representing the Chinese and English names of the model, respectively. For example: `--model_name 小黄 'Xiao Huang'`. If you want to learn more, you can refer to the [Self-Cognition Fine-tuning Best Practices](Self-cognition-best-practice.md).
 - `--model_name`: Default is `[None, None]`. If self-cognition dataset sampling is enabled (i.e. self_cognition_sample>0), you need to pass two values, representing the model's Chinese and English names respectively. E.g. `--model_name 小黄 'Xiao Huang'`.
@@ -240,14 +240,14 @@ dpo parameters inherit from sft parameters, with the following added parameters:
 - `--dtype`: Default is `'AUTO`, see `sft.sh command line arguments` for parameter details.
 - `--dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
 - `--dataset_seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
-`--dataset_test_ratio`: Default value is `None`, if `--load_dataset_config true` is set, then use the dataset_test_ratio from training, else set it to 1. For specific parameter details, refer to the `sft.sh command line arguments`.
+`--dataset_test_ratio`: Default value is `0.01`. For specific parameter details, refer to the `sft.sh command line arguments`.
 - `--show_dataset_sample`: Represents number of validation set samples to evaluate and display, default is `10`.
 - `--system`: Default is `None`. See `sft.sh command line arguments` for parameter details.
 - `--max_length`: Default is `-1`. See `sft.sh command line arguments` for parameter details.
 - `--truncation_strategy`: Default is `'delete'`. See `sft.sh command line arguments` for parameter details.
 - `--check_dataset_strategy`: Default is `'none'`, see `sft.sh command line arguments` for parameter details.
 - `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
-- `--custom_val_dataset_path`: Default value is `[]`. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using `dataset_test_ratio`. Please use `--dataset {dataset_path}`.
+- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
 - `--quantization_bit`: Default is 0. See `sft.sh command line arguments` for parameter details.
 - `--quant_method`: Quantization method, default is None. You can choose from 'bnb', 'hqq', 'eetq'.
 - `--hqq_axis`: Hqq argument. Axis along which grouping is performed. Supported values are 0 or 1. default is `0`
diff --git a/requirements/framework.txt b/requirements/framework.txt
@@ -1,6 +1,5 @@
 accelerate
 dacite
-datasets<=2.18  # modelscope
 jieba
 matplotlib
 modelscope>=1.14
@@ -14,6 +13,6 @@ rouge
 safetensors
 tensorboard
 tqdm
-transformers>=4.33,<4.41
+transformers>=4.33,<4.42
 transformers_stream_generator
 trl>=0.8.2
diff --git a/swift/llm/infer.py b/swift/llm/infer.py
@@ -386,13 +386,16 @@ def llm_infer(args: InferArguments) -> None:
                 append_to_jsonl(jsonl_path, obj)
             result.append(obj)
     else:
-        _, val_dataset = get_dataset(
-            args.dataset,
-            args.dataset_test_ratio,
-            args.dataset_seed,
-            check_dataset_strategy=args.check_dataset_strategy,
-            model_name=args.model_name,
-            model_author=args.model_author)
+        dataset_kwargs = {
+            'dataset_seed': args.dataset_seed,
+            'check_dataset_strategy': args.check_dataset_strategy,
+            'model_name': args.model_name,
+            'model_author': args.model_author
+        }
+        if args.val_dataset is None:
+            _, val_dataset = get_dataset(args.dataset, args.dataset_test_ratio, **dataset_kwargs)
+        else:
+            _, val_dataset = get_dataset(args.val_dataset, 1.0, **dataset_kwargs)
         _, val_dataset = args._handle_dataset_compat(_, val_dataset)
         if args.show_dataset_sample >= 0 and val_dataset.shape[0] > args.show_dataset_sample:
             random_state = np.random.RandomState(args.dataset_seed)
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -966,8 +966,9 @@ class InferArguments(ArgumentsBase):
 
     dataset: List[str] = field(
         default_factory=list, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
+    val_dataset: List[str] = field(default=None, metadata={'help': f'dataset choices: {list(DATASET_MAPPING.keys())}'})
     dataset_seed: int = 42
-    dataset_test_ratio: Optional[float] = None
+    dataset_test_ratio: float = 0.01
     show_dataset_sample: int = 10
     save_result: bool = True
     system: Optional[str] = None
@@ -1035,6 +1036,9 @@ def __post_init__(self) -> None:
                            'the dir contains a `configuration.json` file.')
         self.handle_compatibility()
         self._register_self_cognition()
+        if self.val_dataset is not None:
+            self.dataset_test_ratio = 0.0 if self.val_dataset is not None else self.dataset_test_ratio
+            logger.info('Using val_dataset, ignoring dataset_test_ratio')
         self.handle_path()
         logger.info(f'ckpt_dir: {self.ckpt_dir}')
         if self.ckpt_dir is None and self.load_args_from_ckpt_dir:
@@ -1054,8 +1058,6 @@ def __post_init__(self) -> None:
 
         self.torch_dtype, _, _ = self.select_dtype()
         self.prepare_template()
-        if self.dataset_test_ratio is None:
-            self.dataset_test_ratio = 1
         if self.eval_human is None:
             if not len(self.dataset) > 0:
                 self.eval_human = True
@@ -1139,8 +1141,8 @@ def load_from_ckpt_dir(self) -> None:
         ]
         if self.load_dataset_config:
             imported_keys += [
-                'dataset', 'dataset_seed', 'dataset_test_ratio', 'check_dataset_strategy', 'self_cognition_sample',
-                'model_name', 'model_author', 'train_dataset_sample', 'val_dataset_sample'
+                'dataset', 'val_dataset', 'dataset_seed', 'dataset_test_ratio', 'check_dataset_strategy',
+                'self_cognition_sample', 'model_name', 'model_author', 'train_dataset_sample', 'val_dataset_sample'
             ]
         for key in imported_keys:
             value = getattr(self, key)
diff --git a/tests/llm/test_run.py b/tests/llm/test_run.py
@@ -39,6 +39,25 @@ def setUp(self):
     def tearDown(self):
         shutil.rmtree(self.tmp_dir)
 
+    def test_template(self):
+        if not __name__ == '__main__':
+            # ignore citest error in github
+            return
+        torch.cuda.empty_cache()
+        output = sft_main(
+            SftArguments(
+                model_type=ModelType.qwen1half_1_8b,
+                model_id_or_path='../models/Qwen1.5-1.8B',
+                template_type='qwen',
+                sft_type='full',
+                dataset=f'{DatasetName.jd_sentiment_zh}#200',
+                eval_steps=5))
+        best_model_checkpoint = output['best_model_checkpoint']
+        torch.cuda.empty_cache()
+        result = infer_main(
+            InferArguments(ckpt_dir=best_model_checkpoint, load_dataset_config=True, val_dataset_sample=2))
+        assert len(result['result'][0]['response']) < 20
+
     def test_basic(self):
         output_dir = 'output'
         quantization_bit_list = [0, 4]
@@ -481,6 +500,7 @@ def tokenize_func(examples):
                 metric_for_best_model='loss',
                 greater_is_better=False,
                 gradient_accumulation_steps=1,
+                logging_steps=5,
                 eval_steps=10,
                 save_only_model=save_only_model)
         trainer_args._n_gpu = 1