-
Notifications
You must be signed in to change notification settings - Fork 11
Description
报错信息如下:
Traceback (most recent call last):
File "/root/lmms-engine-main/src/lmms_engine/launch/cli.py", line 120, in main
task.build()
File "/root/lmms-engine-main/src/lmms_engine/train/runner.py", line 60, in build
self.train_dataset = self._build_train_dataset()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/lmms-engine-main/src/lmms_engine/train/runner.py", line 120, in _build_train_dataset
dataset.build()
File "/root/lmms-engine-main/src/lmms_engine/datasets/iterable/base_iterable_dataset.py", line 49, in build
self._build_from_config()
File "/root/lmms-engine-main/src/lmms_engine/datasets/iterable/multimodal_iterable_dataset.py", line 88, in _build_from_config
self.data_list, self.data_folder = DataUtilities.load_inline_datasets(self.config.datasets)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/lmms-engine-main/src/lmms_engine/utils/data_utils.py", line 226, in load_inline_datasets
data_list = concatenate_datasets(data_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/datasets/combine.py", line 221, in concatenate_datasets
return _concatenate_map_style_datasets(dsets, info=info, split=split, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/datasets/arrow_dataset.py", line 6507, in _concatenate_map_style_datasets
_check_if_features_can_be_aligned([dset.features for dset in dsets])
File "/usr/local/lib/python3.12/dist-packages/datasets/features/features.py", line 2319, in _check_if_features_can_be_aligned
raise ValueError(
ValueError: The features can't be aligned because the key messages of features {'id': Value('string'), 'messages': List({'content': List({'image_url': Value('null'), 'text': Value('string'), 'type': Value('string'), 'video_url': {'url': Value('string')}}), 'role': Value('string')})} has unexpected type - List({'content': List({'image_url': Value('null'), 'text': Value('string'), 'type': Value('string'), 'video_url': {'url': Value('string')}}), 'role': Value('string')}) (expected either List({'content': List({'image_url': {'url': Value('string')}, 'text': Value('string'), 'type': Value('string'), 'video_url': {'url': Value('string')}}), 'role': Value('string')}) or Value("null").
简单来说,SFT阶段一共使用了八个数据集,但是数据集之间的格式存在不一致。
重点在:
File "/root/lmms-engine-main/src/lmms_engine/utils/data_utils.py", line 226, in load_inline_datasets data_list = concatenate_datasets(data_list)
concatenate_datasets是datasets库函数。
经过测试:geminicot, longvideoreflection, tvg之间是兼容的。
llavacot, openvlthinker, wemath之间是兼容的。
video_r1和longvideoreason之间是兼容的。
这三组之间任意组合就会报错,报错信息基本一样。
简单来说就是,longvt_7b_sft.yaml 的datasets部分不能同时放八个数据集,只能放上边三组中的其中一组,类似这样:
datasets:
- path: longvideotool/LongVT-Parquet/longvt_sft_llavacot_54k5.parquet
data_folder: ""
data_type: parquet
- path: longvideotool/LongVT-Parquet/longvt_sft_openvlthinker_2k8.parquet
data_folder: ""
data_type: parquet
- path: longvideotool/LongVT-Parquet/longvt_sft_wemath_602.parquet
data_folder: ""
data_type: parquet
期待解决😀。 @mwxely