Skip to content

Commit 75ec804

Browse files
authored
[model] support ovis2.5 (#5426)
1 parent ae71bb7 commit 75ec804

File tree

12 files changed

+177
-13
lines changed

12 files changed

+177
-13
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -728,6 +728,13 @@ qwen2_5_omni除了包含qwen2_5_vl和qwen2_audio的模型特定参数外,还
728728
### ovis1_6, ovis2
729729
- MAX_PARTITION: 默认为9,参考[这里](https://github.com/AIDC-AI/Ovis/blob/d248e34d755a95d24315c40e2489750a869c5dbc/ovis/model/modeling_ovis.py#L312)
730730

731+
### ovis2_5
732+
以下参数含义可以在[这里](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)的示例代码中找到。
733+
- MIX_PIXELS: int类型,默认为`448 * 448`
734+
- MAX_PIXELS: int类型,默认为`1344 * 1792`。若出现OOM,可以调小该值。
735+
- VIDEO_MAX_PIXELS: int类型,默认为`896 * 896`
736+
- NUM_FRAMES: 默认为8。用于视频抽帧。
737+
731738
### mplug_owl3, mplug_owl3_241101
732739
- MAX_NUM_FRAMES: 默认为16,参考[这里](https://modelscope.cn/models/iic/mPLUG-Owl3-7B-240728)
733740

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -691,6 +691,8 @@
691691
|[AIDC-AI/Ovis2-8B](https://modelscope.cn/models/AIDC-AI/Ovis2-8B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-8B](https://huggingface.co/AIDC-AI/Ovis2-8B)|
692692
|[AIDC-AI/Ovis2-16B](https://modelscope.cn/models/AIDC-AI/Ovis2-16B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B)|
693693
|[AIDC-AI/Ovis2-34B](https://modelscope.cn/models/AIDC-AI/Ovis2-34B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B)|
694+
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
695+
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
694696
|[XiaomiMiMo/MiMo-VL-7B-SFT](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-SFT)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-SFT](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-SFT)|
695697
|[XiaomiMiMo/MiMo-VL-7B-RL](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-RL)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL)|
696698
|[mispeech/midashenglm-7b](https://modelscope.cn/models/mispeech/midashenglm-7b)|midashenglm|midashenglm|transformers>=4.52, soundfile|&#x2718;|audio|[mispeech/midashenglm-7b](https://huggingface.co/mispeech/midashenglm-7b)|

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -745,6 +745,15 @@ For the meaning of the arguments, please refer to [here](https://modelscope.cn/m
745745
### ovis1_6, ovis2
746746
- MAX_PARTITION: Default is 9, refer to [here](https://github.com/AIDC-AI/Ovis/blob/d248e34d755a95d24315c40e2489750a869c5dbc/ovis/model/modeling_ovis.py#L312)
747747

748+
### ovis2_5
749+
750+
The meanings of the following parameters can be found in the example code [here](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B).
751+
752+
- MIX_PIXELS: int type, default is `448 * 448`.
753+
- MAX_PIXELS: int type, default is `1344 * 1792`. If OOM (out of memory) occurs, you can reduce this value.
754+
- VIDEO_MAX_PIXELS: int type, default is `896 * 896`.
755+
- NUM_FRAMES: default is 8. Used for video frame sampling.
756+
748757
### mplug_owl3, mplug_owl3_241101
749758
- MAX_NUM_FRAMES: Default is 16, refer to [here](https://modelscope.cn/models/iic/mPLUG-Owl3-7B-240728)
750759

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -691,6 +691,8 @@ The table below introduces the models integrated with ms-swift:
691691
|[AIDC-AI/Ovis2-8B](https://modelscope.cn/models/AIDC-AI/Ovis2-8B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-8B](https://huggingface.co/AIDC-AI/Ovis2-8B)|
692692
|[AIDC-AI/Ovis2-16B](https://modelscope.cn/models/AIDC-AI/Ovis2-16B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-16B](https://huggingface.co/AIDC-AI/Ovis2-16B)|
693693
|[AIDC-AI/Ovis2-34B](https://modelscope.cn/models/AIDC-AI/Ovis2-34B)|ovis2|ovis2|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2-34B](https://huggingface.co/AIDC-AI/Ovis2-34B)|
694+
|[AIDC-AI/Ovis2.5-2B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-2B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-2B](https://huggingface.co/AIDC-AI/Ovis2.5-2B)|
695+
|[AIDC-AI/Ovis2.5-9B](https://modelscope.cn/models/AIDC-AI/Ovis2.5-9B)|ovis2_5|ovis2_5|transformers>=4.46.2, moviepy<2|&#x2718;|vision|[AIDC-AI/Ovis2.5-9B](https://huggingface.co/AIDC-AI/Ovis2.5-9B)|
694696
|[XiaomiMiMo/MiMo-VL-7B-SFT](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-SFT)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-SFT](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-SFT)|
695697
|[XiaomiMiMo/MiMo-VL-7B-RL](https://modelscope.cn/models/XiaomiMiMo/MiMo-VL-7B-RL)|mimo_vl|mimo_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[XiaomiMiMo/MiMo-VL-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-VL-7B-RL)|
696698
|[mispeech/midashenglm-7b](https://modelscope.cn/models/mispeech/midashenglm-7b)|midashenglm|midashenglm|transformers>=4.52, soundfile|&#x2718;|audio|[mispeech/midashenglm-7b](https://huggingface.co/mispeech/midashenglm-7b)|

swift/llm/model/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@ class MLLMModelType:
157157
ovis1_6 = 'ovis1_6'
158158
ovis1_6_llama3 = 'ovis1_6_llama3'
159159
ovis2 = 'ovis2'
160+
ovis2_5 = 'ovis2_5'
160161
mimo_vl = 'mimo_vl'
161162
midashenglm = 'midashenglm'
162163

swift/llm/model/model/qwen.py

Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -926,7 +926,7 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
926926
],
927927
TemplateType.ovis1_6,
928928
get_model_tokenizer_ovis,
929-
model_arch=ModelArch.ovis1_6,
929+
model_arch=ModelArch.ovis,
930930
architectures=['Ovis'],
931931
tags=['vision'],
932932
requires=['transformers>=4.42'],
@@ -942,7 +942,7 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
942942
],
943943
TemplateType.ovis1_6_llama3,
944944
get_model_tokenizer_ovis,
945-
model_arch=ModelArch.ovis1_6,
945+
model_arch=ModelArch.ovis,
946946
architectures=['Ovis'],
947947
tags=['vision'],
948948
))
@@ -962,7 +962,40 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
962962
],
963963
TemplateType.ovis2,
964964
get_model_tokenizer_ovis,
965-
model_arch=ModelArch.ovis1_6,
965+
model_arch=ModelArch.ovis,
966+
architectures=['Ovis'],
967+
tags=['vision'],
968+
requires=['transformers>=4.46.2', 'moviepy<2'],
969+
))
970+
971+
972+
def get_model_tokenizer_ovis2_5(*args, **kwargs):
973+
model, tokenizer = get_model_tokenizer_with_flash_attn(*args, **kwargs)
974+
if model is not None:
975+
model.visual_tokenizer.to(model.dtype)
976+
model.vte.to(model.dtype)
977+
978+
func_list = ['generate', 'forward', 'get_input_embeddings']
979+
use_submodel_func(model, 'llm', func_list)
980+
embedding = model.get_input_embeddings()
981+
patch_output_clone(embedding)
982+
patch_get_input_embeddings(model.visual_tokenizer, 'vit.vision_model.embeddings.patch_embedding')
983+
984+
return model, tokenizer
985+
986+
987+
register_model(
988+
ModelMeta(
989+
MLLMModelType.ovis2_5,
990+
[
991+
ModelGroup([
992+
Model('AIDC-AI/Ovis2.5-2B', 'AIDC-AI/Ovis2.5-2B'),
993+
Model('AIDC-AI/Ovis2.5-9B', 'AIDC-AI/Ovis2.5-9B'),
994+
]),
995+
],
996+
TemplateType.ovis2_5,
997+
get_model_tokenizer_ovis2_5,
998+
model_arch=ModelArch.ovis,
966999
architectures=['Ovis'],
9671000
tags=['vision'],
9681001
requires=['transformers>=4.46.2', 'moviepy<2'],

swift/llm/model/model_arch.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ class MLLMModelArch:
6868
got_ocr2 = 'got_ocr2'
6969
dots_ocr = 'dots_ocr'
7070

71-
ovis1_6 = 'ovis1_6'
71+
ovis = 'ovis'
7272
molmo = 'molmo'
7373
emu3_chat = 'emu3_chat'
7474
megrez_omni = 'megrez_omni'
@@ -593,11 +593,12 @@ def register_model_arch(model_arch: ModelKeys, *, exist_ok: bool = False) -> Non
593593
vision_tower='vision_model',
594594
))
595595

596-
register_model_arch(MultiModelKeys(
597-
MLLMModelArch.ovis1_6,
598-
language_model='llm',
599-
vision_tower='visual_tokenizer',
600-
))
596+
register_model_arch(
597+
MultiModelKeys(
598+
MLLMModelArch.ovis,
599+
language_model='llm',
600+
vision_tower=['visual_tokenizer', 'vte'],
601+
))
601602

602603
register_model_arch(
603604
MultiModelKeys(

swift/llm/template/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ class MLLMTemplateType:
116116
ovis1_6 = 'ovis1_6'
117117
ovis1_6_llama3 = 'ovis1_6_llama3'
118118
ovis2 = 'ovis2'
119+
ovis2_5 = 'ovis2_5'
119120
mimo_vl = 'mimo_vl'
120121
midashenglm = 'midashenglm'
121122

swift/llm/template/template/qwen.py

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import torch.nn.functional as F
99
import transformers
1010
from packaging import version
11+
from torch import nn
1112

1213
from swift.llm import get_packed_seq_params, to_device, to_float_dtype
1314
from swift.utils import get_env_args, is_deepspeed_enabled
@@ -17,7 +18,7 @@
1718
from ..template_inputs import StdTemplateInputs
1819
from ..template_meta import TemplateMeta
1920
from ..utils import Context, Word, findall
20-
from ..vision_utils import load_audio, load_batch, load_video_ovis2
21+
from ..vision_utils import load_audio, load_batch, load_video_ovis2, load_video_ovis2_5
2122
from .llama import Llama3TemplateMeta
2223
from .utils import DEFAULT_SYSTEM, ChatmlTemplateMeta, ThinkingTemplate
2324

@@ -731,6 +732,86 @@ def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int
731732
))
732733

733734

735+
class Ovis2_5Template(ThinkingTemplate):
736+
num_frames = 8
737+
use_model = True
738+
skip_prompt = False
739+
740+
def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int,
741+
inputs: StdTemplateInputs) -> List[Context]:
742+
if media_type == 'image':
743+
return [[-200], '\n']
744+
elif media_type == 'video':
745+
num_frames = get_env_args('num_frames', int, self.num_frames)
746+
inputs.images = load_video_ovis2_5(inputs.videos[index], num_frames)
747+
return [[-200], '\n']
748+
749+
def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
750+
min_pixels = get_env_args('min_pixels', int, 448 * 448)
751+
max_pixels = get_env_args('max_pixels', int, 1344 * 1792)
752+
video_max_pixels = get_env_args('video_max_pixels', int, 896 * 896)
753+
encoded = super()._encode(inputs)
754+
images = inputs.images
755+
input_ids = encoded['input_ids']
756+
visual_tokenizer = self.model.visual_tokenizer
757+
idx_list = findall(input_ids, [-200])
758+
if inputs.videos:
759+
assert len(inputs.videos) == 1, 'only support single video'
760+
encoded['pixel_values'], encoded['grid_thws'] = visual_tokenizer.preprocess(
761+
video=inputs.images, min_pixels=min_pixels, max_pixels=video_max_pixels)
762+
num_video_tokens = encoded['grid_thws'].prod(dim=-1)
763+
num_video_tokens //= visual_tokenizer.vit.config.hidden_stride**2
764+
num_video_tokens //= visual_tokenizer.vit.config.temporal_patch_size
765+
766+
def _get_new_tokens(i):
767+
token_len = num_video_tokens[i].item()
768+
return [-303] + [-300] * token_len + [-304]
769+
770+
input_ids, encoded['labels'], encoded['loss_scale'] = self._extend_tokens(
771+
input_ids, encoded['labels'], encoded['loss_scale'], idx_list, _get_new_tokens)
772+
elif images:
773+
pixel_values, grid_thws = zip(
774+
*(visual_tokenizer.preprocess(image=image, min_pixels=min_pixels, max_pixels=max_pixels)
775+
for image in images))
776+
encoded['pixel_values'] = torch.cat(pixel_values, dim=0)
777+
encoded['grid_thws'] = torch.cat(grid_thws, dim=0)
778+
779+
num_image_atoms = encoded['grid_thws'].prod(dim=-1)
780+
num_image_atoms //= visual_tokenizer.vit.config.hidden_stride**2
781+
num_image_atoms //= visual_tokenizer.vit.config.temporal_patch_size
782+
783+
def _get_new_tokens(i):
784+
token_len = num_image_atoms[i].item()
785+
return [-301] + [-300] * token_len + [-302]
786+
787+
input_ids, encoded['labels'], encoded['loss_scale'] = self._extend_tokens(
788+
input_ids, encoded['labels'], encoded['loss_scale'], idx_list, _get_new_tokens)
789+
790+
encoded['input_ids'] = input_ids
791+
return encoded
792+
793+
def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, Any]:
794+
inputs_embeds = model.merge_multimodal(
795+
input_ids=inputs['input_ids'],
796+
pixel_values=inputs.pop('pixel_values', None),
797+
grid_thws=inputs.pop('grid_thws', None))
798+
return {'inputs_embeds': inputs_embeds}
799+
800+
def _data_collator_mm_data(self, batch: List[Dict[str, Any]]) -> Dict[str, Any]:
801+
res = super()._data_collator_mm_data(batch)
802+
grid_thws = self.concat_tensor(batch, 'grid_thws', 0)
803+
if grid_thws is not None:
804+
res['grid_thws'] = grid_thws
805+
return res
806+
807+
808+
register_template(QwenTemplateMeta(
809+
MLLMTemplateType.ovis2_5,
810+
template_cls=Ovis2_5Template,
811+
default_system=None,
812+
))
813+
814+
734815
@dataclass
735816
class MarcoO1TemplateMeta(QwenTemplateMeta):
736817
default_system: Optional[str] = """

swift/llm/template/vision_utils.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,3 +295,12 @@ def load_video_ovis2(video_path, num_frames):
295295
frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
296296
frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
297297
return frames
298+
299+
300+
def load_video_ovis2_5(video_path, num_frames):
301+
from moviepy.editor import VideoFileClip
302+
with VideoFileClip(video_path) as clip:
303+
total_frames = int(clip.fps * clip.duration)
304+
indices = [int(i * total_frames / num_frames) for i in range(num_frames)]
305+
frames = [Image.fromarray(clip.get_frame(t)) for t in (idx / clip.fps for idx in indices)]
306+
return frames

0 commit comments

Comments
 (0)