Skip to content

Commit 80ac762

Browse files
committed
Merge branch 'main' into release/3.0
2 parents dac37de + 37cb3c6 commit 80ac762

File tree

3 files changed

+17
-3
lines changed

3 files changed

+17
-3
lines changed

docs/source/Customization/自定义数据集.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,13 @@ query-response格式:
6767
{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
6868
```
6969

70+
### 序列分类
71+
```jsonl
72+
{"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1}
73+
{"messages": [{"role": "user", "content": "今天真倒霉"}], "label": 0}
74+
{"messages": [{"role": "user", "content": "好开心"}], "label": 1}
75+
```
76+
7077
### 多模态
7178

7279
对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源,`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置。下面给出的四条示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。

docs/source_en/Customization/Custom-dataset.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,13 @@ The following provides the recommended dataset format for ms-swift, where the sy
6666
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
6767
```
6868

69+
### Sequence Classification
70+
```jsonl
71+
{"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}
72+
{"messages": [{"role": "user", "content": "Today is really unlucky"}], "label": 0}
73+
{"messages": [{"role": "user", "content": "So happy"}], "label": 1}
74+
```
75+
6976
### Multimodal
7077

7178
For multimodal datasets, the format is the same as the tasks mentioned above. The difference is the addition of several keys: `images`, `videos`, and `audios`, which represent multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate the positions where images, videos, and audio are inserted, respectively. The four examples provided below demonstrate the data format for pure text, as well as formats that include image, video, and audio data.

swift/llm/dataset/loader.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,11 +170,11 @@ def _load_dataset_path(dataset_meta: DatasetMeta,
170170
dataset_path = dataset_meta.dataset_path
171171

172172
ext = os.path.splitext(dataset_path)[1].lstrip('.')
173-
ext = ext if ext != 'jsonl' else 'json'
173+
file_type = {'jsonl': 'json', 'txt': 'text'}.get(ext) or ext
174174
kwargs = {'split': 'train', 'streaming': streaming, 'num_proc': num_proc}
175-
if ext == 'csv':
175+
if file_type == 'csv':
176176
kwargs['na_filter'] = False
177-
dataset = hf_load_dataset(ext, data_files=dataset_path, **kwargs)
177+
dataset = hf_load_dataset(file_type, data_files=dataset_path, **kwargs)
178178

179179
dataset = dataset_meta.preprocess_func(
180180
dataset, num_proc=num_proc, strict=strict, load_from_cache_file=load_from_cache_file)

0 commit comments

Comments
 (0)