Merge branch 'main' into release/3.0

Jintao-Huang · Jintao-Huang · commit 80ac76249f8b · 2024-12-26T15:16:29.000+08:00
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -67,6 +67,13 @@ query-response格式：
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
 ```
 
+### 序列分类
+```jsonl
+{"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1}
+{"messages": [{"role": "user", "content": "今天真倒霉"}], "label": 0}
+{"messages": [{"role": "user", "content": "好开心"}], "label": 1}
+```
+
 ### 多模态
 
 对于多模态数据集，和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key，分别代表多模态资源，`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置。下面给出的四条示例分别展示了纯文本，以及包含图像、视频和音频数据的数据格式。
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -66,6 +66,13 @@ The following provides the recommended dataset format for ms-swift, where the sy
 {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
 ```
 
+### Sequence Classification
+```jsonl
+{"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}
+{"messages": [{"role": "user", "content": "Today is really unlucky"}], "label": 0}
+{"messages": [{"role": "user", "content": "So happy"}], "label": 1}
+```
+
 ### Multimodal
 
 For multimodal datasets, the format is the same as the tasks mentioned above. The difference is the addition of several keys: `images`, `videos`, and `audios`, which represent multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate the positions where images, videos, and audio are inserted, respectively. The four examples provided below demonstrate the data format for pure text, as well as formats that include image, video, and audio data.
diff --git a/swift/llm/dataset/loader.py b/swift/llm/dataset/loader.py
@@ -170,11 +170,11 @@ def _load_dataset_path(dataset_meta: DatasetMeta,
         dataset_path = dataset_meta.dataset_path
 
         ext = os.path.splitext(dataset_path)[1].lstrip('.')
-        ext = ext if ext != 'jsonl' else 'json'
+        file_type = {'jsonl': 'json', 'txt': 'text'}.get(ext) or ext
         kwargs = {'split': 'train', 'streaming': streaming, 'num_proc': num_proc}
-        if ext == 'csv':
+        if file_type == 'csv':
             kwargs['na_filter'] = False
-        dataset = hf_load_dataset(ext, data_files=dataset_path, **kwargs)
+        dataset = hf_load_dataset(file_type, data_files=dataset_path, **kwargs)
 
         dataset = dataset_meta.preprocess_func(
             dataset, num_proc=num_proc, strict=strict, load_from_cache_file=load_from_cache_file)