Skip to content

Conversation

@LiTingyu1997
Copy link
Contributor

@LiTingyu1997 LiTingyu1997 commented Dec 11, 2025

What does this PR do?

Add

  • models.tranformers.Phi4

Usage

Phi4MultimodalForCausalLM

from transformers import AutoTokenizer
from mindone.transformers import Phi4MultimodalForCausalLM
import mindspore as ms
model = Phi4MultimodalForCausalLM.from_pretrained("TBA")
tokenizer = AutoTokenizer.from_pretrained("TBA")
prompt = "This is an example script ."
inputs = tokenizer(prompt, return_tensors="np")
# Generate
generate_ids = model.generate(ms.tensor(inputs.input_ids), max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        'This is an example script .\n Certainly! Below is a sample script that demonstrates a simple task, such as calculating the sum'

Problem

Since the original repository code of transformers also cannot load weights for testing, this PR only aligns the code with the original repository; the forward accuracy has been verified.

Fixes # (issue)

Adds # (feature)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @LiTingyu1997, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the mindone.transformers library by incorporating the Phi4 multimodal model. The changes introduce the necessary components for handling both image and audio inputs alongside text, enabling advanced multimodal AI applications. This integration includes the model's architecture, specialized feature extraction for audio, fast image processing, and updates to the auto-configuration system, all supported by new comprehensive test suites.

Highlights

  • New Model Integration: Introduces the Phi4 multimodal model into the mindone.transformers library, including its core architecture, feature extractors, and processors.
  • Multimodal Capabilities: Adds support for both image and audio processing within the Phi4 model, enabling multimodal inputs for causal language modeling.
  • Auto-Configuration Updates: Integrates Phi4Multimodal configurations, models, image processors, and general processors into the automatic configuration and instantiation system.
  • Performance Benchmarks: Includes performance metrics for MusicgenForConditionalGeneration on Mindspore 2.6.0, showcasing weight load times and steps per second across different precisions (fp16, fp32, bf16).
  • Comprehensive Testing: Adds dedicated unit tests for the Phi4Multimodal feature extractor, image processor, and the main model to ensure correctness and stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Phi4Multimodal model. While the overall structure for adding the new model is in place, the implementation files (modeling_phi4_multimodal.py, feature_extraction_phi4_multimodal.py, image_processing_phi4_multimodal_fast.py) contain a significant amount of PyTorch-specific code that is not compatible with MindSpore. This includes using methods like .view(), .permute(), .contiguous(), .item(), and various in-place operations (*_()) which will cause AttributeErrors at runtime. The accompanying test files also incorrectly use PyTorch tensors and assertions, failing to validate the MindSpore implementation. The code requires a thorough review and porting to the MindSpore API, likely using mindspore.ops and mindspore.mint where appropriate, and removing PyTorch-specific patterns. I've added several critical comments to highlight these API incompatibilities.

Comment on lines 233 to 250
frames = frames.clone()
# concerned batch indices
to_mask_batch_idxs = mindspore.mint.arange(batch_size)[audio_lengths != audio_lengths.max()]
if to_mask_batch_idxs.numel() > 0:
batch_idxs_down = (audio_lengths[to_mask_batch_idxs] - self.win_length) // self.hop_length + 1
batch_idxs_up = (audio_lengths[to_mask_batch_idxs] // self.hop_length) - 1
offset_idx = batch_idxs_down.min()
max_idx = batch_idxs_up.max()

mask = mindspore.mint.arange(
max_idx - offset_idx,
).expand(to_mask_batch_idxs.shape[0], -1)
mask = ((batch_idxs_down - offset_idx).unsqueeze(1) <= mask) & (
mask < (batch_idxs_up - offset_idx).unsqueeze(1)
)
mask = mask.unsqueeze(-1).expand(-1, -1, self.win_length)
masked_frames = frames[to_mask_batch_idxs, offset_idx:max_idx].masked_fill_(mask, 0)
frames[to_mask_batch_idxs, offset_idx:max_idx] = masked_frames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code uses PyTorch-specific tensor methods like .clone() and .masked_fill_() which are not available on standard mindspore.Tensor objects. This will cause an AttributeError at runtime.

To fix this, you should use the equivalent functions from mindspore.ops or mindspore.mint and adapt the logic to be compatible with MindSpore's execution model. For example, tensor.clone() can be replaced with tensor.copy(), and masked_fill_() can be replaced with the functional version mindspore.mint.masked_fill.

spec_power = spec**2

# apply triangular mel filter bank
mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .to(device, dtype) method signature is not valid for a mindspore.Tensor. Device placement in MindSpore is typically handled by the context (ms.set_context) rather than on a per-tensor basis like in PyTorch. The device parameter seems to be a leftover from a PyTorch implementation and is not used correctly here. This will raise an AttributeError.

Suggested change
mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32)
mel_filters = mindspore.Tensor(self.mel_filters, dtype=mindspore.float32)

Comment on lines 199 to 205
hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5)
hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The methods .permute() and .contiguous() are from the PyTorch API and are not available on mindspore.Tensor objects. This will cause an AttributeError.

To fix this, you can use mindspore.Tensor.transpose with a permutation tuple for .permute(), and remove .contiguous() as it's not necessary in MindSpore.

Suggested change
hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5)
hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous()
hd_image_reshape = hd_image_reshape.transpose((0, 2, 4, 1, 3, 5))
hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width)

Comment on lines 224 to 230
+ int(downsample_attention_mask.sum().item())
+ int(downsample_attention_mask[:, 0].sum().item())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .item() method is not available on mindspore.Tensor objects. To get a scalar value from a 0-dimensional tensor in MindSpore, you should use .asnumpy(). This will cause an AttributeError at runtime.

Suggested change
+ int(downsample_attention_mask.sum().item())
+ int(downsample_attention_mask[:, 0].sum().item())
+ int(downsample_attention_mask.sum().asnumpy())
+ int(downsample_attention_mask[:, 0].sum().asnumpy())

Comment on lines +122 to +125
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .view() method is a PyTorch API and is not available on mindspore.Tensor objects. The MindSpore equivalent is .reshape(). This will cause an AttributeError.

Suggested change
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
query_states = self.q_proj(hidden_states).reshape(hidden_shape).transpose(1, 2)
key_states = self.k_proj(hidden_states).reshape(hidden_shape).transpose(1, 2)
value_states = self.v_proj(hidden_states).reshape(hidden_shape).transpose(1, 2)

bucket_coords_w = mindspore.ops.bucketize(fractional_coords_w, boundaries, right=True)

pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten()
position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The methods .view() and .cpu() are from the PyTorch API and are not available on mindspore.Tensor objects. This will cause an AttributeError.

  • Use .reshape(-1) instead of .view(-1).
  • The .cpu() call should be removed. Device placement in MindSpore is handled differently, typically through context settings.
Suggested change
position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
position_ids[batch_idx][p_attn_mask.reshape(-1)] = pos_ids


# Temporarily disable autocast to avoid issue on bf16 tensors
# Ref: https://github.com/pytorch/pytorch/issues/132715
image_embeds = inputs_embeds.index_put(indices=positions_tuple, values=merged_img_set_tensor, accumulate=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The .index_put() method is a PyTorch API for in-place indexed assignment. This method does not exist for mindspore.Tensor objects and will cause an AttributeError. You might need to use mindspore.ops.tensor_scatter_elements or another approach to achieve a similar result in a way that is compatible with MindSpore.

Comment on lines 156 to 189
def test_call(self):
# Tests that all call wrap to encode_plus and batch_encode_plus
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
# create three inputs of length 800, 1000, and 1200
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
pt_speech_inputs = [torch.tensor(speech_input) for speech_input in speech_inputs]

# Test feature size
input_features = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features
max_audio_len = (1200 - feature_extractor.win_length) // feature_extractor.hop_length + 1
self.assertTrue(input_features.ndim == 3)
self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size)
self.assertTrue(input_features.shape[-2] == max_audio_len)

# Test not batched input
encoded_sequences_1 = feature_extractor(pt_speech_inputs[0], return_tensors="np").audio_input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").audio_input_features
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))

# Test batched
encoded_sequences_1 = feature_extractor(pt_speech_inputs, return_tensors="np").audio_input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))

# Test 2-D numpy arrays are batched.
speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
np_speech_inputs = np.asarray(speech_inputs)
pt_speech_inputs = torch.tensor(speech_inputs)
encoded_sequences_1 = feature_extractor(pt_speech_inputs, return_tensors="np").audio_input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test case appears to be incorrectly using PyTorch tensors (pt_speech_inputs) as input to the MindSpore feature extractor. The feature extractor expects MindSpore tensors or NumPy arrays. This will lead to incorrect test behavior or errors. The tests should be adapted to use MindSpore-native data types to properly validate the implementation.

Comment on lines 153 to 185
def test_cast_dtype_device(self):
for image_processing_class in self.image_processor_list:
if self.test_cast_dtype is not None:
# Initialize image_processor
image_processor = image_processing_class(**self.image_processor_dict)

# create random PyTorch tensors
image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, torchify=True)

encoding = image_processor(image_inputs, return_tensors="pt")
# for layoutLM compatibility
self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
self.assertEqual(encoding.image_pixel_values.dtype, torch.float32)

encoding = image_processor(image_inputs, return_tensors="pt").to(torch.float16)
self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
self.assertEqual(encoding.image_pixel_values.dtype, torch.float16)

encoding = image_processor(image_inputs, return_tensors="pt").to("cpu", torch.bfloat16)
self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
self.assertEqual(encoding.image_pixel_values.dtype, torch.bfloat16)

with self.assertRaises(TypeError):
_ = image_processor(image_inputs, return_tensors="pt").to(torch.bfloat16, "cpu")

# Try with text + image feature
encoding = image_processor(image_inputs, return_tensors="pt")
encoding.update({"input_ids": torch.LongTensor([[1, 2, 3], [4, 5, 6]])})
encoding = encoding.to(torch.float16)

self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
self.assertEqual(encoding.image_pixel_values.dtype, torch.float16)
self.assertEqual(encoding.input_ids.dtype, torch.long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test case is using PyTorch tensors and checks for PyTorch-specific attributes like .device and .dtype == torch.float16. For a MindSpore library, tests should use MindSpore tensors and MindSpore-specific assertions to correctly validate the implementation. This test needs to be rewritten for MindSpore.

Comment on lines 15 to 18
from .feature_extraction_phi4_multimodal import *
from .image_processing_phi4_multimodal_fast import *
from .modeling_phi4_multimodal import *
from .processing_phi4_multimodal import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using import * is generally discouraged as it can pollute the namespace and make it unclear which names are being imported. It's better to explicitly import the required classes. This also helps static analysis tools and improves code readability.

Suggested change
from .feature_extraction_phi4_multimodal import *
from .image_processing_phi4_multimodal_fast import *
from .modeling_phi4_multimodal import *
from .processing_phi4_multimodal import *
from .feature_extraction_phi4_multimodal import Phi4MultimodalFeatureExtractor
from .image_processing_phi4_multimodal_fast import Phi4MultimodalImageProcessorFast
from .modeling_phi4_multimodal import (
Phi4MultimodalAudioModel,
Phi4MultimodalAudioPreTrainedModel,
Phi4MultimodalForCausalLM,
Phi4MultimodalModel,
Phi4MultimodalPreTrainedModel,
Phi4MultimodalVisionModel,
Phi4MultimodalVisionPreTrainedModel,
)
from .processing_phi4_multimodal import Phi4MultimodalProcessor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant