Fix windows (#1096)

Jintao-Huang · Jintao-Huang · commit 9858e4f1a4a8 · 2024-06-07T09:31:17.000+08:00
diff --git a/docs/source/LLM/自定义与拓展.md b/docs/source/LLM/自定义与拓展.md
@@ -7,7 +7,7 @@
 ## 自定义数据集
 我们支持三种**自定义数据集**的方法.
 
-1. 【推荐】直接命令行传参的方式，指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式（即使用`SmartPreprocessor`，支持的数据集格式见下方）, 支持`dataset_id`和`dataset_path`.
+1. 【推荐】直接命令行传参的方式，指定`--dataset xxx.json yyy.jsonl zzz.csv`, **更加方便支持自定义数据集**, 支持五种数据集格式（即使用`SmartPreprocessor`，支持的数据集格式见下方）, 支持`dataset_id`和`dataset_path`. 不需要修改`dataset_info.json`文件.
 2. 添加数据集到`dataset_info.json`中, 比第一种方式更灵活但繁琐, 支持对数据集使用两种预处理器并指定其参数: `RenameColumnsPreprocessor`, `ConversationsPreprocessor`（默认使用`SmartPreprocessor`）. 支持直接修改swift内置的`dataset_info.json`, 或者通过`--custom_dataset_info xxx.json`的方式传入外置的json文件（方便pip install而非git clone的用户拓展数据集）.
 3. **注册数据集**的方式: 比第1、2种方式更加灵活但繁琐, 支持使用函数对数据集进行预处理. 方法1、2在实现上借助了方法3. 可以直接修改源码进行拓展, 或者通过`--custom_register_path xxx.py`的方式传入, 脚本会对py文件进行解析（方便pip install的用户）.
 
diff --git a/docs/source_en/LLM/Customization.md b/docs/source_en/LLM/Customization.md
@@ -8,7 +8,7 @@
 
 We support three methods for **customizing datasets**.
 
-1. \[Recommended] Use the command line argument directly to specify `--dataset xxx.json yyy.jsonl zzz.csv`, which is more convenient for supporting custom datasets. It supports five data formats (using `SmartPreprocessor`, supported dataset formats are listed below) and supports `dataset_id` and `dataset_path`.
+1. \[Recommended] Use the command line argument directly to specify `--dataset xxx.json yyy.jsonl zzz.csv`, which is more convenient for supporting custom datasets. It supports five data formats (using `SmartPreprocessor`, supported dataset formats are listed below) and supports `dataset_id` and `dataset_path`. No need to modify the `dataset_info.json` file.
 2. Adding datasets to `dataset_info.json` is more flexible but cumbersome compared to the first method, and supports using two preprocessors and specifying their parameters: `RenameColumnsPreprocessor`, `ConversationsPreprocessor` (default is to use `SmartPreprocessor`). You can directly modify the built-in `dataset_info.json` in Swift, or pass in an external json file using `--custom_dataset_info xxx.json` (for users who prefer pip install over git clone to expand datasets).
 3. Registering datasets: More flexible but cumbersome compared to the first and second methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using `--custom_register_path xxx.py`, where the script will parse the py file (for pip install users).
 
diff --git a/tests/llm/data/swift_#:#.jsonl b/tests/llm/data/swift_#:#.jsonl
diff --git a/tests/llm/test_run.py b/tests/llm/test_run.py
@@ -169,8 +169,12 @@ def test_custom_dataset(self):
             'swift_multi.json', 'sharegpt.jsonl'
         ]
         val_dataset_fnames = [
-            'alpaca.jsonl', 'alpaca2.csv', 'conversations.jsonl', 'swift_pre.csv', 'swift_single.jsonl',
-            'swift_#:#.jsonl'
+            'alpaca.jsonl',
+            'alpaca2.csv',
+            'conversations.jsonl',
+            'swift_pre.csv',
+            'swift_single.jsonl',
+            # 'swift_#:#.jsonl'
         ]
         folder = os.path.join(os.path.dirname(__file__), 'data')
         resume_from_checkpoint = None