Merge branch 'main' into v2.1

Jintao-Huang · Jintao-Huang · commit dcbf9e4ba85a · 2024-06-04T15:13:39.000+08:00
diff --git a/docs/source/LLM/LLM微调文档.md b/docs/source/LLM/LLM微调文档.md
@@ -83,6 +83,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
     --output_dir output \
 
 # 使用自己的数据集
+# 自定义数据集格式查看: https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86
 CUDA_VISIBLE_DEVICES=0 swift sft \
     --model_id_or_path qwen/Qwen-7B-Chat \
     --dataset chatml.jsonl \
diff --git a/docs/source/LLM/自定义与拓展.md b/docs/source/LLM/自定义与拓展.md
@@ -7,11 +7,11 @@
 ## 自定义数据集
 我们支持三种**自定义数据集**的方法.
 
-1. 【推荐】**命令行参数**的形式: **更加方便支持自定义数据集**, 支持四种数据集格式（即使用`SmartPreprocessor`）, 支持`dataset_id`和`dataset_path`.
+1. 【推荐】直接命令行传参的方式，指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式（即使用`SmartPreprocessor`，支持的数据集格式见下方）, 支持`dataset_id`和`dataset_path`.
 2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--custom_dataset_info xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
 3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
 
-### 📌 【推荐】命令行参数的形式
+### 📌 【推荐】直接命令行传参
 支持直接传入行自定义的**dataset_id**(兼容MS和HF)和**dataset_path**, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是`dataset_id`, 默认会使用dataset\_id中的'default'子数据集, 并设置split为'train'. 如果该dataset\_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是`dataset_path`, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.
 
 ```bash
diff --git a/docs/source_en/LLM/Customization.md b/docs/source_en/LLM/Customization.md
@@ -8,11 +8,11 @@
 
 We support three methods for **customizing datasets**.
 
-1. \[Recommended\] using command line arguments: It is more convenient to support custom datasets, and it supports four dataset formats (using `SmartPreprocessor`) as well as the `dataset_id` and `dataset_path`.
+1. \[Recommended] Use the command line argument directly to specify `--dataset xxx.json yyy.jsonl zzz.csv`, which is more convenient for supporting custom datasets. It supports five data formats (using `SmartPreprocessor`, supported dataset formats are listed below) and supports `dataset_id` and `dataset_path`.
 2. Adding datasets to `dataset_info.json` is more flexible but cumbersome compared to the first method, and supports using two preprocessors and specifying their parameters: `RenameColumnsPreprocessor`, `ConversationsPreprocessor` (default is to use `SmartPreprocessor`). You can directly modify the built-in `dataset_info.json` in Swift, or pass in an external json file using `--custom_dataset_info xxx.json` (for users who prefer pip install over git clone to expand datasets).
 3. Registering datasets: More flexible but cumbersome compared to the first and second methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using `--custom_register_path xxx.py`, where the script will parse the py file (for pip install users).
 
-### 📌 \[Recommended\] using Command Line Arguments
+### 📌 \[Recommended\] Using Command Line Arguments Directly
 
 Supports directly passing in custom `dataset_id` (compatible with MS and HF) and `dataset_path`, as well as simultaneously passing in multiple custom datasets and their respective sample sizes. The script will automatically preprocess and concatenate the datasets. If a `dataset_id` is passed in, it will default to using the 'default' subset in the dataset_id and set the split to 'train'. If the dataset_id has already been registered, it will use the subsets, split, and preprocessing functions that were passed in during registration. If a `dataset_path` is passed in, it can be specified as a relative path or an absolute path, where the relative path is relative to the current running directory.
 
diff --git a/docs/source_en/LLM/LLM-fine-tuning.md b/docs/source_en/LLM/LLM-fine-tuning.md
@@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
     --output_dir output \
 
 # Using your own dataset
+# custom dataset format: https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Customization.md#custom-datasets
 CUDA_VISIBLE_DEVICES=0 swift sft \
     --model_id_or_path qwen/Qwen-7B-Chat \
     --dataset chatml.jsonl \
diff --git a/swift/llm/data/dataset_info.json b/swift/llm/data/dataset_info.json
@@ -217,8 +217,9 @@
         "tags": ["chat", "medical"]
     },
     "self-cognition": {
-        "dataset_path": "self_cognition.jsonl",
+        "dataset_id": "swift/self-cognition",
+        "hf_dataset_id": "modelscope/self-cognition",
         "remove_useless_columns": false,
-        "tags": ["chat", "self_cognition", "🔥"]
+        "tags": ["chat", "self-cognition", "🔥"]
     }
 }
diff --git a/swift/llm/data/self_cognition.jsonl b/swift/llm/data/self_cognition.jsonl
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py

Original file line number	Diff line number	Diff line change
`@@ -217,8 +217,9 @@`
`217`	`217`	`"tags": ["chat", "medical"]`
`218`	`218`	`},`
`219`	`219`	`"self-cognition": {`
`220`		`- "dataset_path": "self_cognition.jsonl",`
	`220`	`+ "dataset_id": "swift/self-cognition",`
	`221`	`+ "hf_dataset_id": "modelscope/self-cognition",`
`221`	`222`	`"remove_useless_columns": false,`
`222`		`- "tags": ["chat", "self_cognition", "🔥"]`
	`223`	`+ "tags": ["chat", "self-cognition", "🔥"]`
`223`	`224`	`}`
`224`	`225`	`}`