Skip to content

Commit a7d2158

Browse files
committed
[model] support ovis2.5 (#5426)
1 parent c6fb911 commit a7d2158

File tree

12 files changed

+176
-12
lines changed

12 files changed

+176
-12
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -722,6 +722,13 @@ qwen2_5_omni除了包含qwen2_5_vl和qwen2_audio的模型特定参数外,还
722722
### ovis1_6, ovis2
723723
- MAX_PARTITION: 默认为9,参考[这里](https://github.com/AIDC-AI/Ovis/blob/d248e34d755a95d24315c40e2489750a869c5dbc/ovis/model/modeling_ovis.py#L312)
724724

725+
### ovis2_5
726+
以下参数含义可以在[这里](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)的示例代码中找到。
727+
- MIX_PIXELS: int类型,默认为`448 * 448`
728+
- MAX_PIXELS: int类型,默认为`1344 * 1792`。若出现OOM,可以调小该值。
729+
- VIDEO_MAX_PIXELS: int类型,默认为`896 * 896`
730+
- NUM_FRAMES: 默认为8。用于视频抽帧。
731+
725732
### mplug_owl3, mplug_owl3_241101
726733
- MAX_NUM_FRAMES: 默认为16,参考[这里](https://modelscope.cn/models/iic/mPLUG-Owl3-7B-240728)
727734

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -691,6 +691,8 @@
691691
|[AIDC-AI/Ovis2-8B](https://modelscope.cn/models/AIDC-AI/Ovis2-8B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-8B](https://huggingface.co/AIDC-AI/Ovis2-8B)|
692692
|[AIDC-AI/Ovis2-16B](https://modelscope.cn/models/AIDC-AI/Ovis2-16B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B)|
693693
|[AIDC-AI/Ovis2-34B](https://modelscope.cn/models/AIDC-AI/Ovis2-34B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B)|
694+
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
695+
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
694696
|[XiaomiMiMo/MiMo-VL-7B-SFT](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-SFT)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-SFT](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-SFT)|
695697
|[XiaomiMiMo/MiMo-VL-7B-RL](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-RL)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL)|
696698
|[mispeech/midashenglm-7b](https://modelscope.cn/models/mispeech/midashenglm-7b)|midashenglm|midashenglm|transformers>=4.52, soundfile|&#x2718;|audio|[mispeech/midashenglm-7b](https://huggingface.co/mispeech/midashenglm-7b)|

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -738,6 +738,15 @@ For the meaning of the arguments, please refer to [here](https://modelscope.cn/m
738738
### ovis1_6, ovis2
739739
- MAX_PARTITION: Default is 9, refer to [here](https://github.com/AIDC-AI/Ovis/blob/d248e34d755a95d24315c40e2489750a869c5dbc/ovis/model/modeling_ovis.py#L312)
740740

741+
### ovis2_5
742+
743+
The meanings of the following parameters can be found in the example code [here](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B).
744+
745+
- MIX_PIXELS: int type, default is `448 * 448`.
746+
- MAX_PIXELS: int type, default is `1344 * 1792`. If OOM (out of memory) occurs, you can reduce this value.
747+
- VIDEO_MAX_PIXELS: int type, default is `896 * 896`.
748+
- NUM_FRAMES: default is 8. Used for video frame sampling.
749+
741750
### mplug_owl3, mplug_owl3_241101
742751
- MAX_NUM_FRAMES: Default is 16, refer to [here](https://modelscope.cn/models/iic/mPLUG-Owl3-7B-240728)
743752

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -691,6 +691,8 @@ The table below introduces the models integrated with ms-swift:
691691
|[AIDC-AI/Ovis2-8B](https://modelscope.cn/models/AIDC-AI/Ovis2-8B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-8B](https://huggingface.co/AIDC-AI/Ovis2-8B)|
692692
|[AIDC-AI/Ovis2-16B](https://modelscope.cn/models/AIDC-AI/Ovis2-16B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B)|
693693
|[AIDC-AI/Ovis2-34B](https://modelscope.cn/models/AIDC-AI/Ovis2-34B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B)|
694+
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
695+
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
694696
|[XiaomiMiMo/MiMo-VL-7B-SFT](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-SFT)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-SFT](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-SFT)|
695697
|[XiaomiMiMo/MiMo-VL-7B-RL](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-RL)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL)|
696698
|[mispeech/midashenglm-7b](https://modelscope.cn/models/mispeech/midashenglm-7b)|midashenglm|midashenglm|transformers>=4.52, soundfile|&#x2718;|audio|[mispeech/midashenglm-7b](https://huggingface.co/mispeech/midashenglm-7b)|

swift/llm/model/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@ class MLLMModelType:
157157
ovis1_6 = 'ovis1_6'
158158
ovis1_6_llama3 = 'ovis1_6_llama3'
159159
ovis2 = 'ovis2'
160+
ovis2_5 = 'ovis2_5'
160161
mimo_vl = 'mimo_vl'
161162
midashenglm = 'midashenglm'
162163

swift/llm/model/model/qwen.py

Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -926,7 +926,7 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
926926
],
927927
TemplateType.ovis1_6,
928928
get_model_tokenizer_ovis,
929-
model_arch=ModelArch.ovis1_6,
929+
model_arch=ModelArch.ovis,
930930
architectures=['Ovis'],
931931
tags=['vision'],
932932
requires=['transformers>=4.42'],
@@ -942,7 +942,7 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
942942
],
943943
TemplateType.ovis1_6_llama3,
944944
get_model_tokenizer_ovis,
945-
model_arch=ModelArch.ovis1_6,
945+
model_arch=ModelArch.ovis,
946946
architectures=['Ovis'],
947947
tags=['vision'],
948948
))
@@ -962,7 +962,40 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
962962
],
963963
TemplateType.ovis2,
964964
get_model_tokenizer_ovis,
965-
model_arch=ModelArch.ovis1_6,
965+
model_arch=ModelArch.ovis,
966+
architectures=['Ovis'],
967+
tags=['vision'],
968+
requires=['transformers>=4.46.2', 'moviepy<2'],
969+
))
970+
971+
972+
def get_model_tokenizer_ovis2_5(*args, **kwargs):
973+
model, tokenizer = get_model_tokenizer_with_flash_attn(*args, **kwargs)
974+
if model is not None:
975+
model.visual_tokenizer.to(model.dtype)
976+
model.vte.to(model.dtype)
977+
978+
func_list = ['generate', 'forward', 'get_input_embeddings']
979+
use_submodel_func(model, 'llm', func_list)
980+
embedding = model.get_input_embeddings()
981+
patch_output_clone(embedding)
982+
patch_get_input_embeddings(model.visual_tokenizer, 'vit.vision_model.embeddings.patch_embedding')
983+
984+
return model, tokenizer
985+
986+
987+
register_model(
988+
ModelMeta(
989+
MLLMModelType.ovis2_5,
990+
[
991+
ModelGroup([
992+
Model('AIDC-AI/Ovis2.5-2B', 'AIDC-AI/Ovis2.5-2B'),
993+
Model('AIDC-AI/Ovis2.5-9B', 'AIDC-AI/Ovis2.5-9B'),
994+
]),
995+
],
996+
TemplateType.ovis2_5,
997+
get_model_tokenizer_ovis2_5,
998+
model_arch=ModelArch.ovis,
966999
architectures=['Ovis'],
9671000
tags=['vision'],
9681001
requires=['transformers>=4.46.2', 'moviepy<2'],

swift/llm/model/model_arch.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ class MLLMModelArch:
6868
got_ocr2 = 'got_ocr2'
6969
dots_ocr = 'dots_ocr'
7070

71-
ovis1_6 = 'ovis1_6'
71+
ovis = 'ovis'
7272
molmo = 'molmo'
7373
emu3_chat = 'emu3_chat'
7474
megrez_omni = 'megrez_omni'
@@ -593,11 +593,12 @@ def register_model_arch(model_arch: ModelKeys, *, exist_ok: bool = False) -> Non
593593
vision_tower='vision_model',
594594
))
595595

596-
register_model_arch(MultiModelKeys(
597-
MLLMModelArch.ovis1_6,
598-
language_model='llm',
599-
vision_tower='visual_tokenizer',
600-
))
596+
register_model_arch(
597+
MultiModelKeys(
598+
MLLMModelArch.ovis,
599+
language_model='llm',
600+
vision_tower=['visual_tokenizer', 'vte'],
601+
))
601602

602603
register_model_arch(
603604
MultiModelKeys(

swift/llm/template/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ class MLLMTemplateType:
116116
ovis1_6 = 'ovis1_6'
117117
ovis1_6_llama3 = 'ovis1_6_llama3'
118118
ovis2 = 'ovis2'
119+
ovis2_5 = 'ovis2_5'
119120
mimo_vl = 'mimo_vl'
120121
midashenglm = 'midashenglm'
121122

swift/llm/template/template/qwen.py

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import torch.nn.functional as F
99
import transformers
1010
from packaging import version
11+
from torch import nn
1112

1213
from swift.llm import get_packed_seq_params, to_device, to_float_dtype
1314
from swift.utils import get_env_args, is_deepspeed_enabled
@@ -17,7 +18,7 @@
1718
from ..template_inputs import StdTemplateInputs
1819
from ..template_meta import TemplateMeta
1920
from ..utils import Context, Word, findall
20-
from ..vision_utils import load_audio, load_batch, load_video_ovis2
21+
from ..vision_utils import load_audio, load_batch, load_video_ovis2, load_video_ovis2_5
2122
from .llama import Llama3TemplateMeta
2223
from .utils import DEFAULT_SYSTEM, ChatmlTemplateMeta, ThinkingTemplate
2324

@@ -736,6 +737,86 @@ def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int
736737
))
737738

738739

740+
class Ovis2_5Template(ThinkingTemplate):
741+
num_frames = 8
742+
use_model = True
743+
skip_prompt = False
744+
745+
def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int,
746+
inputs: StdTemplateInputs) -> List[Context]:
747+
if media_type == 'image':
748+
return [[-200], '\n']
749+
elif media_type == 'video':
750+
num_frames = get_env_args('num_frames', int, self.num_frames)
751+
inputs.images = load_video_ovis2_5(inputs.videos[index], num_frames)
752+
return [[-200], '\n']
753+
754+
def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
755+
min_pixels = get_env_args('min_pixels', int, 448 * 448)
756+
max_pixels = get_env_args('max_pixels', int, 1344 * 1792)
757+
video_max_pixels = get_env_args('video_max_pixels', int, 896 * 896)
758+
encoded = super()._encode(inputs)
759+
images = inputs.images
760+
input_ids = encoded['input_ids']
761+
visual_tokenizer = self.model.visual_tokenizer
762+
idx_list = findall(input_ids, [-200])
763+
if inputs.videos:
764+
assert len(inputs.videos) == 1, 'only support single video'
765+
encoded['pixel_values'], encoded['grid_thws'] = visual_tokenizer.preprocess(
766+
video=inputs.images, min_pixels=min_pixels, max_pixels=video_max_pixels)
767+
num_video_tokens = encoded['grid_thws'].prod(dim=-1)
768+
num_video_tokens //= visual_tokenizer.vit.config.hidden_stride**2
769+
num_video_tokens //= visual_tokenizer.vit.config.temporal_patch_size
770+
771+
def _get_new_tokens(i):
772+
token_len = num_video_tokens[i].item()
773+
return [-303] + [-300] * token_len + [-304]
774+
775+
input_ids, encoded['labels'], encoded['loss_scale'] = self._extend_tokens(
776+
input_ids, encoded['labels'], encoded['loss_scale'], idx_list, _get_new_tokens)
777+
elif images:
778+
pixel_values, grid_thws = zip(
779+
*(visual_tokenizer.preprocess(image=image, min_pixels=min_pixels, max_pixels=max_pixels)
780+
for image in images))
781+
encoded['pixel_values'] = torch.cat(pixel_values, dim=0)
782+
encoded['grid_thws'] = torch.cat(grid_thws, dim=0)
783+
784+
num_image_atoms = encoded['grid_thws'].prod(dim=-1)
785+
num_image_atoms //= visual_tokenizer.vit.config.hidden_stride**2
786+
num_image_atoms //= visual_tokenizer.vit.config.temporal_patch_size
787+
788+
def _get_new_tokens(i):
789+
token_len = num_image_atoms[i].item()
790+
return [-301] + [-300] * token_len + [-302]
791+
792+
input_ids, encoded['labels'], encoded['loss_scale'] = self._extend_tokens(
793+
input_ids, encoded['labels'], encoded['loss_scale'], idx_list, _get_new_tokens)
794+
795+
encoded['input_ids'] = input_ids
796+
return encoded
797+
798+
def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, Any]:
799+
inputs_embeds = model.merge_multimodal(
800+
input_ids=inputs['input_ids'],
801+
pixel_values=inputs.pop('pixel_values', None),
802+
grid_thws=inputs.pop('grid_thws', None))
803+
return {'inputs_embeds': inputs_embeds}
804+
805+
def _data_collator_mm_data(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
806+
res = super()._data_collator_mm_data(batch)
807+
grid_thws = self.concat_tensor(batch, 'grid_thws', 0)
808+
if grid_thws is not None:
809+
res['grid_thws'] = grid_thws
810+
return res
811+
812+
813+
register_template(QwenTemplateMeta(
814+
MLLMTemplateType.ovis2_5,
815+
template_cls=Ovis2_5Template,
816+
default_system=None,
817+
))
818+
819+
739820
@dataclass
740821
class MarcoO1TemplateMeta(QwenTemplateMeta):
741822
default_system: Optional[str] = """

swift/llm/template/vision_utils.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,3 +273,12 @@ def load_video_ovis2(video_path, num_frames):
273273
frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
274274
frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
275275
return frames
276+
277+
278+
def load_video_ovis2_5(video_path, num_frames):
279+
from moviepy.editor import VideoFileClip
280+
with VideoFileClip(video_path) as clip:
281+
total_frames = int(clip.fps * clip.duration)
282+
indices = [int(i * total_frames / num_frames) for i in range(num_frames)]
283+
frames = [Image.fromarray(clip.get_frame(t)) for t in (idx / clip.fps for idx in indices)]
284+
return frames

0 commit comments

Comments
 (0)