Skip to content

Commit e349174

Browse files
authored
Fix citest test_run.py (#1059)
1 parent c2e93a7 commit e349174

File tree

7 files changed

+12
-142
lines changed

7 files changed

+12
-142
lines changed

docs/source/LLM/LLM微调文档.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
8383
--output_dir output \
8484

8585
# 使用自己的数据集
86+
# 自定义数据集格式查看: https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86
8687
CUDA_VISIBLE_DEVICES=0 swift sft \
8788
--model_id_or_path qwen/Qwen-7B-Chat \
8889
--dataset chatml.jsonl \

docs/source/LLM/自定义与拓展.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,11 @@
77
## 自定义数据集
88
我们支持三种**自定义数据集**的方法.
99

10-
1. 【推荐】**命令行参数**的形式: **更加方便支持自定义数据集**, 支持四种数据集格式(即使用`SmartPreprocessor`), 支持`dataset_id``dataset_path`.
10+
1. 【推荐】直接命令行传参的方式,指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式(即使用`SmartPreprocessor`,支持的数据集格式见下方), 支持`dataset_id``dataset_path`.
1111
2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`(默认使用`SmartPreprocessor`). 支持直接修改swift内置的`dataset_info.json`, 或者通过`--custom_dataset_info xxx.json`的方式传入外置的json文件(方便pip install而非git clone的用户拓展数据集).
1212
3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析(方便pip install的用户).
1313

14-
### 📌 【推荐】命令行参数的形式
14+
### 📌 【推荐】直接命令行传参
1515
支持直接传入行自定义的**dataset_id**(兼容MS和HF)和**dataset_path**, 以及同时传入多个自定义数据集以及对应采样数, 脚本会进行自动的预处理和拼接. 如果传入的是`dataset_id`, 默认会使用dataset\_id中的'default'子数据集, 并设置split为'train'. 如果该dataset\_id已经注册, 则会使用注册时传入的subsets、split以及预处理函数. 如果传入的是`dataset_path`, 则可以指定为相对路径和绝对路径, 其中相对路径为相对于当前运行目录.
1616

1717
```bash

docs/source_en/LLM/Customization.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@
88

99
We support three methods for **customizing datasets**.
1010

11-
1. \[Recommended\] using command line arguments: It is more convenient to support custom datasets, and it supports four dataset formats (using `SmartPreprocessor`) as well as the `dataset_id` and `dataset_path`.
11+
1. \[Recommended] Use the command line argument directly to specify `--dataset xxx.json yyy.jsonl zzz.csv`, which is more convenient for supporting custom datasets. It supports five data formats (using `SmartPreprocessor`, supported dataset formats are listed below) and supports `dataset_id` and `dataset_path`.
1212
2. Adding datasets to `dataset_info.json` is more flexible but cumbersome compared to the first method, and supports using two preprocessors and specifying their parameters: `RenameColumnsPreprocessor`, `ConversationsPreprocessor` (default is to use `SmartPreprocessor`). You can directly modify the built-in `dataset_info.json` in Swift, or pass in an external json file using `--custom_dataset_info xxx.json` (for users who prefer pip install over git clone to expand datasets).
1313
3. Registering datasets: More flexible but cumbersome compared to the first and second methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using `--custom_register_path xxx.py`, where the script will parse the py file (for pip install users).
1414

15-
### 📌 \[Recommended\] using Command Line Arguments
15+
### 📌 \[Recommended\] Using Command Line Arguments Directly
1616

1717
Supports directly passing in custom `dataset_id` (compatible with MS and HF) and `dataset_path`, as well as simultaneously passing in multiple custom datasets and their respective sample sizes. The script will automatically preprocess and concatenate the datasets. If a `dataset_id` is passed in, it will default to using the 'default' subset in the dataset_id and set the split to 'train'. If the dataset_id has already been registered, it will use the subsets, split, and preprocessing functions that were passed in during registration. If a `dataset_path` is passed in, it can be specified as a relative path or an absolute path, where the relative path is relative to the current running directory.
1818

docs/source_en/LLM/LLM-fine-tuning.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
7979
--output_dir output \
8080

8181
# Using your own dataset
82+
# custom dataset format: https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Customization.md#custom-datasets
8283
CUDA_VISIBLE_DEVICES=0 swift sft \
8384
--model_id_or_path qwen/Qwen-7B-Chat \
8485
--dataset chatml.jsonl \

swift/llm/data/dataset_info.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -217,8 +217,9 @@
217217
"tags": ["chat", "medical"]
218218
},
219219
"self-cognition": {
220-
"dataset_path": "self_cognition.jsonl",
220+
"dataset_id": "swift/self-cognition",
221+
"hf_dataset_id": "modelscope/self-cognition",
221222
"remove_useless_columns": false,
222-
"tags": ["chat", "self_cognition", "🔥"]
223+
"tags": ["chat", "self-cognition", "🔥"]
223224
}
224225
}

0 commit comments

Comments
 (0)