Skip to content

Conversation

@openvino-dev-samples
Copy link
Contributor

@openvino-dev-samples openvino-dev-samples commented Aug 7, 2025

Depends on PR

As LLM of minicpmv4 switched to Llama
https://huggingface.co/openbmb/MiniCPM-V-4/blob/main/modeling_minicpmv.py#L26

What does this PR do?

Conversion cmd-line for openbmb/MiniCPM-V-4 or MiniCPM-V-4_5:

optimum-cli export openvino --model openbmb/MiniCPM-V-4_5 MiniCPM-V-4_5-ov --trust-remote-code --weight-format fp16 --task image-text-to-text

Inference of MiniCPM-V-4_5 using OpenVINO backend:

from optimum.intel.openvino import OVModelForVisualCausalLM
from transformers import AutoProcessor
from PIL import Image
import requests

model_id = "openbmb/MiniCPM-V-4_5"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

prompt= "<|im_start|>user\n(<image>./</image>)\nWhat is unusual on this image?<|im_end|>\n<|im_start|>assistant\n"
image = Image.open(requests.get("https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11", stream=True).raw).convert('RGB')

model = OVModelForVisualCausalLM.from_pretrained(model_dir, trust_remote_code=True)

inputs = processor([prompt], [image], return_tensors="pt")

result  = model.generate(**inputs, max_new_tokens=20)

print(processor.tokenizer.batch_decode(result[:, inputs["input_ids"].shape[1]:]))

Before submitting

  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@openvino-dev-samples openvino-dev-samples changed the title add support for minicpm4v [OpenVINO]add support for minicpm4v Aug 7, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@openvino-dev-samples
Copy link
Contributor Author

openvino-dev-samples commented Aug 8, 2025

@IlyasMoutawwakil could help to take a look ?

@IlyasMoutawwakil
Copy link
Member

Thanks for the fix ! let's create a tiny random model with llama as the decoder to test this 🤗 tell me you need help with that !

@openvino-dev-samples
Copy link
Contributor Author

Thanks for the fix ! let's create a tiny random model with llama as the decoder to test this 🤗 tell me you need help with that !

But i guess we need merge this PR first ? otherwise test case will not work

@IlyasMoutawwakil
Copy link
Member

@openvino-dev-samples no need to merge it now, you can simply pin that PR in setup.py so that the tests would run with it 🤗
We will merge both PRs once everything works together.

@openvino-dev-samples
Copy link
Contributor Author

@openvino-dev-samples no need to merge it now, you can simply pin that PR in setup.py so that the tests would run with it 🤗 We will merge both PRs once everything works together.

Hi since minicpmv4 and minicpmv share a same model type, but different LLM. It is possible to add both of them in utils_tests.py ?

@IlyasMoutawwakil
Copy link
Member

@openvino-dev-samples yes, you can name it minicpmv4 in utils_tests.py

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Aug 18, 2025

Hi @openvino-dev-samples it would be faster if you made sure the minicpmv4 tests pass locally, the ci is slow and shouldn't be used as a testing mechanism, only use it for validating when local tests are already passing.

@openvino-dev-samples
Copy link
Contributor Author

Hi @openvino-dev-samples it would be faster if you made sure the minicpmv4 tests pass locally, the ci is slow and shouldn't be used as a testing mechanism, only use it for validating when local tests are already passing.

Sorry for that, and i fully understand, but i always met connection issue in local test case, e.g

huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/models/katuni4ka/tiny-random-qwen2.5-vl/tree/main?recursive=True&expand=False

@IlyasMoutawwakil
Copy link
Member

but i always met connection issue in local test case

you can target minicpmv tests specifically to avoid this issue with pytest -k "minicpmv"

@openvino-dev-samples openvino-dev-samples changed the title [OpenVINO]add support for minicpm4v [OpenVINO]add support for minicpmv4/4_5 Aug 27, 2025
"minicpm3": "katuni4ka/tiny-random-minicpm3",
"minicpmv": "katuni4ka/tiny-random-minicpmv-2_6",
"minicpmv4": "snake7gun/minicpm-v-4-tiny",
"minicpmv4_5": "snake7gun/tiny-minicpmv-4_5",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

158M model size, it makes sense to try to reduce the size

Copy link
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add tests for inference to test generate() method and compare with transformers

@openvino-dev-samples
Copy link
Contributor Author

@IlyasMoutawwakil could you help to trigger CI, thanks

if isinstance(behavior, str) and not isinstance(behavior, MiniCPMVConfigBehavior):
behavior = MiniCPMVConfigBehavior(behavior)

model_mapping = {2.6: "llama", 4.0: "qwen2", 4.5: "qwen3"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use str for versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may i understand why?the version in model's config is a number:
https://huggingface.co/openbmb/MiniCPM-V-4_5/blob/main/config.json#L3

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay I see ! thanks for the clarification.
(it's generally a bad idea to use numbers for versions: 4.0 becomes 4 and 4.10 and 4.1 are the same version 😅)

if isinstance(behavior, str) and not isinstance(behavior, MiniCPMVConfigBehavior):
behavior = MiniCPMVConfigBehavior(behavior)

model_mapping = {2.6: "llama", 4.0: "qwen2", 4.5: "qwen3"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bad idea to make decision about architecture based on model version in general.
I think you should parse class model object and use isinstance for inner objects to make decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its a better approach in this case, but I dont know if we can access the modeling file at this stage.

fix CI
@nikita-savelyevv
Copy link
Collaborator

@openvino-dev-samples Please fix the failed tests

@openvino-dev-samples
Copy link
Contributor Author

@openvino-dev-samples Please fix the failed tests

done

Copy link
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do additional patch for temporal_ids as we discussed. Without this, functionality is limited.

Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the PR description according to this comment #1491 (comment)

max_size = self.config.vision_config.image_size // self.config.vision_config.patch_size
self._pos_embeds = torch.from_numpy(self._get_2d_sincos_pos_embed(self.embed_dim, max_size)).float()
self.max_size = (max_size, max_size)
self.max_temporal_size = 72000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 72000? Should this value be loaded from the config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1945 to +1949
all_temporal_ids = None
if temporal_ids is not None:
all_temporal_ids = []
for t in temporal_ids:
all_temporal_ids.extend(t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all_temporal_ids = None
if temporal_ids is not None:
all_temporal_ids = []
for t in temporal_ids:
all_temporal_ids.extend(t)
all_temporal_ids = [t for seq_t in temporal_ids for t in seq_t] if temporal_ids is not None else None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +2015 to +2016
# example: [[-1], [-1], [2, 6, 9]]
temporal_ids_flatten = list(chain.from_iterable(temporal_ids))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to do an additional flattening pass here? As I understand all_temporal_ids is prepared already flattened inside get_vision_embeddings(). If not needed, I'd remove flattening logic from get_vision_embeddings() and keep it only here.

if max_temporal_size > -1:
temporal_pos_emb = True
if max_temporal_size > self.max_temporal_size:
self._adjust_temporal_pos_cache(max_temporal_size, "cpu")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a definition of self._adjust_temporal_pos_cache(). Since the tests pass, this means the code does not reach this point in any of existing tests. Please clarify this. Ideally, every scenario should be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to align with original model.

if temporal_ids_flatten[i] == -1:
pos_embed_temporal.append(torch.zeros(self.embed_dim, dtype=torch.float32, device="cpu"))
else:
pos_embed_temporal.append(self.temporal_pos_embed[temporal_ids_flatten[i]].to(torch.float32)) # D
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is self.temporal_pos_embed defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


def resampling(self, x, tgt_sizes):
def resampling(self, x, tgt_sizes, temporal_ids=None):
from itertools import chain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be imported at the top of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this imports is used by minicpmv only, so i think it can be left here. e.g https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/modeling_visual_language.py#L1229


self._adjust_pos_cache(tgt_sizes)

temporal_pos_emb = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me these names are a bit confusing: temporal_pos_emb, pos_embed_temporal, self.temporal_pos_embed, temporal_embed. I would suggest to rename these variable to something more meaningful. For example, use_temporal_pos_embed instead of temporal_pos_emb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i only created temporal_embed, and other of them are from original modeling file directly.

1, 0, 2
) # BLD => L * B * D
res = torch.from_numpy(self.resampler(image_feature=x, pos_embed=pos_embed, key_padding_mask=key_padding_mask))
if temporal_pos_emb:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if temporal_pos_emb:
if len(pos_embed_temporal) > 0:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 767 to +774
if is_transformers_version("<", "4.49"):
expected = {"llama4", "qwen2_5_vl", "phi4mm"}
expected = {"llama4", "qwen2_5_vl", "phi4mm", "minicpmv4", "minicpmv4_5"}
elif is_transformers_version("<", "4.51"):
expected = {"llama4", "phi4mm"}
elif is_transformers_version("<", "4.52"):
expected = set()
else:
expected = {"llava-qwen2", "phi3_v", "phi4mm", "minicpmo"}
expected = {"llava-qwen2", "phi3_v", "phi4mm", "minicpmo", "minicpmv4", "minicpmv4_5"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this, I get an understanding that minicpmv4/minicpmv4_5 are supported for transformers 4.49 .. 4.51. Is this correct? If so, please set MIN_TRANSFORMERS_VERSION = "4.49.0" and MAX_TRANSFORMERS_VERSION = "4.51.3" for MiniCPMVOpenVINOConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see any limitation on these 2 models. They can share same same version of transformers with minicpm-v-2.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants