Skip to content

Commit a690057

Browse files
committed
[megatron] Support ovis2.5 (#5719)
1 parent 5e1de0f commit a690057

File tree

14 files changed

+139
-35
lines changed

14 files changed

+139
-35
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,6 @@
171171
- enable_dft_loss: 是否在SFT训练中使用[DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss,默认为False。
172172
- enable_channel_loss: 打开channel loss,默认为`False`。你需要在数据集中准备"channel"字段,ms-swift会根据该字段分组统计loss。数据集格式参考[channel loss](../Customization/自定义数据集.md#channel-loss)。channel loss兼容packing/padding_free/loss_scale等技术。
173173
- 注意:该参数为"ms-swift>=3.8"新增,若要在"ms-swift<3.8"使用channel loss,请查看v3.7文档。
174-
- 注意:该功能暂不兼容序列并行,待修复。
175174
- logging_dir: tensorboard日志路径。默认为None,即设置为`f'{self.output_dir}/runs'`
176175
- predict_with_generate: 验证时使用生成式的方式,默认为False。
177176
- metric_for_best_model: 默认为None,即当`predict_with_generate`设置为False时,设置为'loss',否则设置为'rouge-l'(在PPO训练时,不进行默认值设置;GRPO训练设置为'reward')。

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -701,8 +701,8 @@
701701
|[AIDC-AI/Ovis2-8B](https://modelscope.cn/models/AIDC-AI/Ovis2-8B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-8B](https://huggingface.co/AIDC-AI/Ovis2-8B)|
702702
|[AIDC-AI/Ovis2-16B](https://modelscope.cn/models/AIDC-AI/Ovis2-16B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B)|
703703
|[AIDC-AI/Ovis2-34B](https://modelscope.cn/models/AIDC-AI/Ovis2-34B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B)|
704-
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
705-
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
704+
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2714;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
705+
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2714;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
706706
|[XiaomiMiMo/MiMo-VL-7B-SFT](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-SFT)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-SFT](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-SFT)|
707707
|[XiaomiMiMo/MiMo-VL-7B-RL](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-RL)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL)|
708708
|[mispeech/midashenglm-7b](https://modelscope.cn/models/mispeech/midashenglm-7b)|midashenglm|midashenglm|transformers>=4.52, soundfile|&#x2718;|audio|[mispeech/midashenglm-7b](https://huggingface.co/mispeech/midashenglm-7b)|

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,6 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
174174
- enable_dft_loss: Whether to use [DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss in SFT training, default is False.
175175
- enable_channel_loss: Enable channel loss, default is `false`. You need to prepare a "channel" field in your dataset; ms-swift will compute and aggregate the loss grouped by this field. For dataset format, please refer to [channel loss](../Customization/Custom-dataset.md#channel-loss). Channel loss is compatible with techniques such as packing, padding-free, and loss scaling.
176176
- Note: This parameter is newly added in "ms-swift>=3.8". If you want to use channel loss in "ms-swift<3.8", please refer to the v3.7 documentation.
177-
- Note: This feature is currently not compatible with sequence parallelism and will be fixed later.
178177
- logging_dir: The path for TensorBoard logs. Defaults to None, which means it is set to `f'{self.output_dir}/runs'`.
179178
- predict_with_generate: Whether to use generative method during validation, default is False.
180179
- metric_for_best_model: Default is None, which means that when predict_with_generate is set to False, it is set to 'loss'; otherwise, it is set to 'rouge-l' (during PPO training, the default value is not set; in GRPO training, it is set to 'reward').

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -701,8 +701,8 @@ The table below introduces the models integrated with ms-swift:
701701
|[AIDC-AI/Ovis2-8B](https://modelscope.cn/models/AIDC-AI/Ovis2-8B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-8B](https://huggingface.co/AIDC-AI/Ovis2-8B)|
702702
|[AIDC-AI/Ovis2-16B](https://modelscope.cn/models/AIDC-AI/Ovis2-16B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B)|
703703
|[AIDC-AI/Ovis2-34B](https://modelscope.cn/models/AIDC-AI/Ovis2-34B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B)|
704-
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
705-
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
704+
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2714;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
705+
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2714;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
706706
|[XiaomiMiMo/MiMo-VL-7B-SFT](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-SFT)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-SFT](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-SFT)|
707707
|[XiaomiMiMo/MiMo-VL-7B-RL](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-RL)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL)|
708708
|[mispeech/midashenglm-7b](https://modelscope.cn/models/mispeech/midashenglm-7b)|midashenglm|midashenglm|transformers>=4.52, soundfile|&#x2718;|audio|[mispeech/midashenglm-7b](https://huggingface.co/mispeech/midashenglm-7b)|

swift/llm/argument/train_args.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ def _check_padding_free(self):
133133
if self.padding_free or self.packing:
134134
if self.packing:
135135
feature = 'packing'
136-
self.padding_free = False
136+
self.padding_free = True
137137
else:
138138
feature = 'padding_free'
139139
if self.attn_impl not in {'flash_attn', 'flash_attention_2', 'flash_attention_3'}:

swift/llm/dataset/utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ def __init__(
141141
**kwargs,
142142
):
143143
template.packing = True
144-
template.padding_free = True
144+
template.padding_free = True # TODO: remove
145145
self.template = template
146146
self.dataset = dataset
147147
self.num_proc = num_proc
@@ -200,7 +200,7 @@ def __init__(
200200
**kwargs,
201201
):
202202
template.packing = True
203-
template.padding_free = True
203+
template.padding_free = True # TODO: remove
204204
self.template = template
205205
self.dataset = dataset
206206
self.num_proc = num_proc

swift/llm/model/register.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ def get_model_tokenizer_from_local(model_dir: str,
226226
model_config.keys_to_ignore_at_inference.append('past_key_values')
227227

228228
torch_dtype = model_info.torch_dtype
229-
model_config.torch_dtype = torch_dtype
229+
HfConfigFactory.set_config_attr(model_config, 'torch_dtype', torch_dtype, include_vit=True)
230230
HfConfigFactory.compat_zero3(model_config)
231231
rope_scaling = kwargs.get('rope_scaling')
232232
max_model_len = kwargs.get('max_model_len')

swift/llm/model/utils.py

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,9 @@ def update_attn_impl(config: PretrainedConfig,
4444
attn_impl_keys = [attn_impl_keys]
4545
attn_impl_keys = attn_impl_keys or AttnImpl.attn_impl_keys
4646
for key in attn_impl_keys:
47-
HfConfigFactory.set_config_attr(config, key, attn_impl, ensure_set=False)
47+
HfConfigFactory.set_config_attr(config, key, attn_impl, include_vit=True, ensure_set=False)
4848
for key in AttnImpl.use_flash_attn_keys:
49-
HfConfigFactory.set_config_attr(config, key, use_flash_attn, ensure_set=False)
49+
HfConfigFactory.set_config_attr(config, key, use_flash_attn, include_vit=True, ensure_set=False)
5050

5151

5252
@dataclass
@@ -88,6 +88,7 @@ def get_torch_dtype(config: Union[PretrainedConfig, Dict[str, Any]],
8888
@staticmethod
8989
def _get_config_attrs(config: Union[PretrainedConfig, Dict[str, Any]],
9090
attr_name: str,
91+
include_vit: bool = False,
9192
parent_key: Optional[str] = None) -> List[Tuple[PretrainedConfig, Any]]:
9293
res = []
9394
if isinstance(config, dict):
@@ -96,8 +97,10 @@ def _get_config_attrs(config: Union[PretrainedConfig, Dict[str, Any]],
9697
keys = dir(config)
9798
else:
9899
return []
99-
100-
if attr_name in keys and parent_key in [None, 'language_config', 'llm_config', 'text_config']:
100+
config_keys = [None, 'language_config', 'llm_config', 'text_config']
101+
if include_vit:
102+
config_keys += ['vit_config', 'vision_config', 'audio_config']
103+
if attr_name in keys and parent_key in config_keys:
101104
res.append((config, deep_getattr(config, attr_name)))
102105

103106
for k in keys:
@@ -106,7 +109,7 @@ def _get_config_attrs(config: Union[PretrainedConfig, Dict[str, Any]],
106109
v = config[k]
107110
else:
108111
v = getattr(config, k)
109-
res += HfConfigFactory._get_config_attrs(v, attr_name, k)
112+
res += HfConfigFactory._get_config_attrs(v, attr_name, include_vit, k)
110113
return res
111114

112115
@staticmethod
@@ -119,9 +122,11 @@ def is_moe_model(config) -> bool:
119122
return False
120123

121124
@staticmethod
122-
def get_config_attr(config: Union[PretrainedConfig, Dict[str, Any]], attr_name: str) -> Optional[Any]:
125+
def get_config_attr(config: Union[PretrainedConfig, Dict[str, Any]],
126+
attr_name: str,
127+
include_vit: bool = False) -> Optional[Any]:
123128
"""Get the value of the attribute named attr_name."""
124-
attrs = HfConfigFactory._get_config_attrs(config, attr_name)
129+
attrs = HfConfigFactory._get_config_attrs(config, attr_name, include_vit)
125130
if len(attrs) == 0:
126131
return None
127132
else:
@@ -131,9 +136,10 @@ def get_config_attr(config: Union[PretrainedConfig, Dict[str, Any]], attr_name:
131136
def set_config_attr(config: Union[PretrainedConfig, Dict[str, Any]],
132137
attr_name: str,
133138
value: Any,
139+
include_vit: bool = False,
134140
ensure_set: bool = True) -> int:
135141
"""Set all the attr_name attributes to value."""
136-
attrs = HfConfigFactory._get_config_attrs(config, attr_name)
142+
attrs = HfConfigFactory._get_config_attrs(config, attr_name, include_vit)
137143
if ensure_set and len(attrs) == 0:
138144
attrs.append((config, None))
139145
for config, _ in attrs:

swift/llm/template/template/qwen.py

Lines changed: 38 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,11 @@
88
import torch.nn.functional as F
99
import transformers
1010
from packaging import version
11+
from PIL import Image
1112
from torch import nn
1213
from transformers.integrations import is_deepspeed_zero3_enabled
1314

14-
from swift.llm import get_packed_seq_params, to_float_dtype
15+
from swift.llm import get_packed_seq_params, to_device, to_float_dtype
1516
from swift.utils import get_env_args, is_deepspeed_enabled
1617
from ..base import Template
1718
from ..constant import LLMTemplateType, MLLMTemplateType
@@ -717,11 +718,17 @@ def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int
717718

718719

719720
class Ovis2_5Template(ThinkingTemplate):
720-
num_frames = 8
721721
use_model = True
722722
skip_prompt = False
723723
support_padding_free = True
724724

725+
def init_processor(self, processor) -> None:
726+
super().init_processor(processor)
727+
self.min_pixels = get_env_args('min_pixels', int, 448 * 448)
728+
self.max_pixels = get_env_args('max_pixels', int, 1344 * 1792)
729+
self.video_max_pixels = get_env_args('video_max_pixels', int, 896 * 896)
730+
self.num_frames = get_env_args('num_frames', int, 8)
731+
725732
def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int,
726733
inputs: StdTemplateInputs) -> List[Context]:
727734
if media_type == 'image':
@@ -733,14 +740,10 @@ def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int
733740
if self.mode == 'vllm':
734741
return ['<video>']
735742
else:
736-
num_frames = get_env_args('num_frames', int, self.num_frames)
737-
inputs.images = load_video_ovis2_5(inputs.videos[index], num_frames)
743+
inputs.images = load_video_ovis2_5(inputs.videos[index], self.num_frames)
738744
return [[-200], '\n']
739745

740746
def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
741-
min_pixels = get_env_args('min_pixels', int, 448 * 448)
742-
max_pixels = get_env_args('max_pixels', int, 1344 * 1792)
743-
video_max_pixels = get_env_args('video_max_pixels', int, 896 * 896)
744747
encoded = super()._encode(inputs)
745748
images = inputs.images
746749
input_ids = encoded['input_ids']
@@ -749,7 +752,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
749752
if inputs.videos:
750753
assert len(inputs.videos) == 1, 'only support single video'
751754
encoded['pixel_values'], encoded['grid_thws'] = visual_tokenizer.preprocess(
752-
video=inputs.images, min_pixels=min_pixels, max_pixels=video_max_pixels)
755+
video=inputs.images, min_pixels=self.min_pixels, max_pixels=self.video_max_pixels)
753756
num_video_tokens = encoded['grid_thws'].prod(dim=-1)
754757
num_video_tokens //= visual_tokenizer.vit.config.hidden_stride**2
755758
num_video_tokens //= visual_tokenizer.vit.config.temporal_patch_size
@@ -762,7 +765,7 @@ def _get_new_tokens(i):
762765
input_ids, encoded['labels'], encoded['loss_scale'], idx_list, _get_new_tokens)
763766
elif images:
764767
pixel_values, grid_thws = zip(
765-
*(visual_tokenizer.preprocess(image=image, min_pixels=min_pixels, max_pixels=max_pixels)
768+
*(visual_tokenizer.preprocess(image=image, min_pixels=self.min_pixels, max_pixels=self.max_pixels)
766769
for image in images))
767770
encoded['pixel_values'] = torch.cat(pixel_values, dim=0)
768771
encoded['grid_thws'] = torch.cat(grid_thws, dim=0)
@@ -782,10 +785,32 @@ def _get_new_tokens(i):
782785
return encoded
783786

784787
def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, Any]:
785-
inputs_embeds = model.merge_multimodal(
786-
input_ids=inputs['input_ids'],
787-
pixel_values=inputs.pop('pixel_values', None),
788-
grid_thws=inputs.pop('grid_thws', None))
788+
input_ids = inputs['input_ids']
789+
pixel_values = inputs.get('pixel_values', None)
790+
grid_thws = inputs.get('grid_thws')
791+
INDICATOR_IDS = [-301, -302, -303, -304]
792+
VISUAL_ATOM_ID = -300
793+
placeholder_token_mask = torch.lt(input_ids, 0)
794+
inputs_embeds = model.get_wte()(torch.masked_fill(input_ids, placeholder_token_mask, 0))
795+
796+
if pixel_values is not None or is_deepspeed_enabled():
797+
visual_indicator_embeds = model.vte(model.indicator_token_indices).to(
798+
dtype=inputs_embeds.dtype, device=inputs_embeds.device)
799+
for i, indicator_id in enumerate(INDICATOR_IDS):
800+
inputs_embeds[input_ids == indicator_id] = visual_indicator_embeds[i]
801+
if pixel_values is not None:
802+
visual_tokens = model.visual_tokenizer(pixel_values, grid_thws)
803+
visual_embeds = model.vte(visual_tokens).to(dtype=inputs_embeds.dtype, device=inputs_embeds.device)
804+
inputs_embeds[input_ids == VISUAL_ATOM_ID] = visual_embeds
805+
elif is_deepspeed_enabled():
806+
media_inputs = model.visual_tokenizer.preprocess(
807+
Image.new('RGB', (32, 32), (0, 0, 0)), min_pixels=self.min_pixels, max_pixels=self.max_pixels)
808+
media_inputs = to_device(media_inputs, input_ids.device)
809+
pixel_values = media_inputs['pixel_values'].type(inputs_embeds.dtype)
810+
visual_tokens = model.visual_tokenizer(pixel_values, media_inputs['grid_thws'])
811+
visual_embeds = model.vte(visual_tokens).to(dtype=inputs_embeds.dtype, device=inputs_embeds.device)
812+
inputs_embeds = inputs_embeds + visual_embeds.mean() * 0.
813+
789814
return {'inputs_embeds': inputs_embeds}
790815

791816
def _data_collator_mm_data(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:

swift/megatron/model/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ class MegatronModelType:
55
qwen2_vl = 'qwen2_vl'
66
qwen2_5_vl = 'qwen2_5_vl'
77
qwen2_5_omni = 'qwen2_5_omni'
8+
ovis2_5 = 'ovis2_5'
89

910
internvl3 = 'internvl3'
1011
glm4_5v = 'glm4_5v'

0 commit comments

Comments
 (0)