Skip to content

Commit 53efa65

Browse files
hellopaheJintao-Huang
authored andcommitted
[model] Add support for Keye-VL-1_5-8B (#5815)
1 parent 9279844 commit 53efa65

File tree

8 files changed

+51
-12
lines changed

8 files changed

+51
-12
lines changed

docs/source/Instruction/命令行参数.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -691,7 +691,7 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
691691

692692
以下参数的含义可以在对应模型官方repo或者其推理代码中找到相应含义。
693693

694-
### qwen2_vl, qvq, qwen2_5_vl, mimo_vl, keye_vl
694+
### qwen2_vl, qvq, qwen2_5_vl, mimo_vl, keye_vl, keye_vl_1_5
695695
参数含义同`qwen_vl_utils`或者`qwen_omni_utils`库,可以查看[这里](https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L24)
696696

697697
- IMAGE_FACTOR: 默认为28。

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -905,6 +905,7 @@
905905
|[moonshotai/Kimi-VL-A3B-Thinking](https://modelscope.cn/models/moonshotai/Kimi-VL-A3B-Thinking)|kimi_vl|kimi_vl|transformers<4.49|&#x2718;|-|[moonshotai/Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking)|
906906
|[moonshotai/Kimi-VL-A3B-Thinking-2506](https://modelscope.cn/models/moonshotai/Kimi-VL-A3B-Thinking-2506)|kimi_vl|kimi_vl|transformers<4.49|&#x2718;|-|[moonshotai/Kimi-VL-A3B-Thinking-2506](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506)|
907907
|[Kwai-Keye/Keye-VL-8B-Preview](https://modelscope.cn/models/Kwai-Keye/Keye-VL-8B-Preview)|keye_vl|keye_vl|keye_vl_utils|&#x2718;|vision|[Kwai-Keye/Keye-VL-8B-Preview](https://huggingface.co/Kwai-Keye/Keye-VL-8B-Preview)|
908+
|[Kwai-Keye/Keye-VL-1_5-8B](https://modelscope.cn/models/Kwai-Keye/Keye-VL-1_5-8B)|keye_vl_1_5|keye_vl|keye_vl_utils>=1.5.2|&#x2718;|vision|[Kwai-Keye/Keye-VL-1_5-8B](https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B)|
908909
|[rednote-hilab/dots.ocr](https://modelscope.cn/models/rednote-hilab/dots.ocr)|dots_ocr|dots_ocr|transformers>=4.51.0|&#x2718;|-|[rednote-hilab/dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr)|
909910
|[LLM-Research/Phi-3-vision-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct)|phi3_vision|phi3_vision|transformers>=4.36|&#x2718;|vision|[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)|
910911
|[LLM-Research/Phi-3.5-vision-instruct](https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct)|phi3_vision|phi3_vision|transformers>=4.36|&#x2718;|vision|[microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)|

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -709,7 +709,7 @@ Specific model arguments can be set using `--model_kwargs` or environment variab
709709

710710
The definitions of the parameters listed below can be found in each model’s official repository or in its inference code.
711711

712-
### qwen2_vl, qvq, qwen2_5_vl, mimo_vl, keye_vl
712+
### qwen2_vl, qvq, qwen2_5_vl, mimo_vl, keye_vl, keye_vl_1_5
713713
The parameter meanings are the same as in the `qwen_vl_utils` or `qwen_omni_utils` library. You can refer to [here](https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L24)
714714

715715
- IMAGE_FACTOR: Default is 28

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -905,6 +905,7 @@ The table below introduces the models integrated with ms-swift:
905905
|[moonshotai/Kimi-VL-A3B-Thinking](https://modelscope.cn/models/moonshotai/Kimi-VL-A3B-Thinking)|kimi_vl|kimi_vl|transformers<4.49|&#x2718;|-|[moonshotai/Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking)|
906906
|[moonshotai/Kimi-VL-A3B-Thinking-2506](https://modelscope.cn/models/moonshotai/Kimi-VL-A3B-Thinking-2506)|kimi_vl|kimi_vl|transformers<4.49|&#x2718;|-|[moonshotai/Kimi-VL-A3B-Thinking-2506](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506)|
907907
|[Kwai-Keye/Keye-VL-8B-Preview](https://modelscope.cn/models/Kwai-Keye/Keye-VL-8B-Preview)|keye_vl|keye_vl|keye_vl_utils|&#x2718;|vision|[Kwai-Keye/Keye-VL-8B-Preview](https://huggingface.co/Kwai-Keye/Keye-VL-8B-Preview)|
908+
|[Kwai-Keye/Keye-VL-1_5-8B](https://modelscope.cn/models/Kwai-Keye/Keye-VL-1_5-8B)|keye_vl_1_5|keye_vl|keye_vl_utils>=1.5.2|&#x2718;|vision|[Kwai-Keye/Keye-VL-1_5-8B](https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B)|
908909
|[rednote-hilab/dots.ocr](https://modelscope.cn/models/rednote-hilab/dots.ocr)|dots_ocr|dots_ocr|transformers>=4.51.0|&#x2718;|-|[rednote-hilab/dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr)|
909910
|[LLM-Research/Phi-3-vision-128k-instruct](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct)|phi3_vision|phi3_vision|transformers>=4.36|&#x2718;|vision|[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)|
910911
|[LLM-Research/Phi-3.5-vision-instruct](https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct)|phi3_vision|phi3_vision|transformers>=4.36|&#x2718;|vision|[microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)|

swift/llm/model/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,7 @@ class MLLMModelType:
243243
step_audio = 'step_audio'
244244
kimi_vl = 'kimi_vl'
245245
keye_vl = 'keye_vl'
246+
keye_vl_1_5 = 'keye_vl_1_5'
246247
dots_ocr = 'dots_ocr'
247248

248249
phi3_vision = 'phi3_vision'

swift/llm/model/model/mllm.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,22 @@ def get_model_tokenizer_keye_vl(model_dir: str, *args, **kwargs):
205205
requires=['keye_vl_utils'],
206206
))
207207

208+
register_model(
209+
ModelMeta(
210+
MLLMModelType.keye_vl_1_5,
211+
[
212+
ModelGroup([
213+
Model('Kwai-Keye/Keye-VL-1_5-8B', 'Kwai-Keye/Keye-VL-1_5-8B'),
214+
]),
215+
],
216+
TemplateType.keye_vl,
217+
get_model_tokenizer_keye_vl,
218+
model_arch=ModelArch.keye_vl,
219+
architectures=['KeyeVL1_5ForConditionalGeneration'],
220+
tags=['vision'],
221+
requires=['keye_vl_utils>=1.5.2'],
222+
))
223+
208224

209225
def get_model_tokenizer_dots_ocr(model_dir, *args, **kwargs):
210226
model_cls = get_class_from_dynamic_module('modeling_dots_vision.DotsVisionTransformer', model_dir)

swift/llm/model/model/qwen.py

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -666,19 +666,25 @@ def patch_qwen_vl_utils(vision_process):
666666
'fps_max_frames',
667667
]:
668668
type_func = float if key == 'fps' else int
669-
if not hasattr(vision_process, key.upper()):
669+
default_value = getattr(vision_process, key.upper(), None)
670+
if default_value is None:
671+
# Skip keys not supported by the specific vision_process implementation
670672
continue
671-
val = get_env_args(key, type_func, getattr(vision_process, key.upper()))
673+
val = get_env_args(key, type_func, default_value)
672674
setattr(vision_process, key.upper(), val)
673675
res[key] = val
674-
_read_video_decord = vision_process._read_video_decord
675-
676-
def _new_read_video_decord(ele: dict):
677-
from swift.llm import load_file
678-
ele['video'] = load_file(ele['video'])
679-
return _read_video_decord(ele)
680-
681-
vision_process.VIDEO_READER_BACKENDS['decord'] = _new_read_video_decord
676+
# Patch decord video reader if available
677+
_read_video_decord = getattr(vision_process, '_read_video_decord', None)
678+
if _read_video_decord is not None:
679+
680+
def _new_read_video_decord(ele: dict):
681+
from swift.llm import load_file
682+
ele['video'] = load_file(ele['video'])
683+
return _read_video_decord(ele)
684+
685+
backends = getattr(vision_process, 'VIDEO_READER_BACKENDS', None)
686+
if isinstance(backends, dict):
687+
backends['decord'] = _new_read_video_decord
682688
vision_process._patch = True
683689
return res
684690

tests/test_align/test_template/test_vision.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -616,6 +616,20 @@ def test_keye_vl():
616616
assert response == response2
617617

618618

619+
def test_keye_vl_1_5():
620+
pt_engine = PtEngine('Kwai-Keye/Keye-VL-1_5-8B')
621+
messages = [{'role': 'user', 'content': '<image><image>What is the difference between the two images?'}]
622+
images = [
623+
'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png',
624+
'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png'
625+
]
626+
pt_engine.default_template.template_backend = 'swift'
627+
response = _infer_model(pt_engine, messages=messages, images=images)
628+
pt_engine.default_template.template_backend = 'jinja'
629+
response2 = _infer_model(pt_engine, messages=messages, images=images)
630+
assert response == response2
631+
632+
619633
def test_dots_ocr():
620634
# https://github.com/modelscope/ms-swift/issues/2122
621635
pt_engine = PtEngine('rednote-hilab/dots.ocr')

0 commit comments

Comments
 (0)