feat(transformers): add Phi4 #1468

LiTingyu1997 · 2025-12-11T06:35:56Z

What does this PR do?

Add

models.tranformers.Phi4

Usage

Phi4MultimodalForCausalLM

from transformers import AutoTokenizer
from mindone.transformers import Phi4MultimodalForCausalLM
import mindspore as ms
model = Phi4MultimodalForCausalLM.from_pretrained("TBA")
tokenizer = AutoTokenizer.from_pretrained("TBA")
prompt = "This is an example script ."
inputs = tokenizer(prompt, return_tensors="np")
# Generate
generate_ids = model.generate(ms.tensor(inputs.input_ids), max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        'This is an example script .\n Certainly! Below is a sample script that demonstrates a simple task, such as calculating the sum'

Problem

Since the original repository code of transformers also cannot load weights for testing, this PR only aligns the code with the original repository; the forward accuracy has been verified.

Fixes # (issue)

Adds # (feature)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
documentation guidelines
Did you build and run the code without any errors?
Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

gemini-code-assist · 2025-12-11T06:36:19Z

Summary of Changes

Hello @LiTingyu1997, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the mindone.transformers library by incorporating the Phi4 multimodal model. The changes introduce the necessary components for handling both image and audio inputs alongside text, enabling advanced multimodal AI applications. This integration includes the model's architecture, specialized feature extraction for audio, fast image processing, and updates to the auto-configuration system, all supported by new comprehensive test suites.

Highlights

New Model Integration: Introduces the Phi4 multimodal model into the mindone.transformers library, including its core architecture, feature extractors, and processors.
Multimodal Capabilities: Adds support for both image and audio processing within the Phi4 model, enabling multimodal inputs for causal language modeling.
Auto-Configuration Updates: Integrates Phi4Multimodal configurations, models, image processors, and general processors into the automatic configuration and instantiation system.
Performance Benchmarks: Includes performance metrics for MusicgenForConditionalGeneration on Mindspore 2.6.0, showcasing weight load times and steps per second across different precisions (fp16, fp32, bf16).
Comprehensive Testing: Adds dedicated unit tests for the Phi4Multimodal feature extractor, image processor, and the main model to ensure correctness and stability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the Phi4Multimodal model. While the overall structure for adding the new model is in place, the implementation files (modeling_phi4_multimodal.py, feature_extraction_phi4_multimodal.py, image_processing_phi4_multimodal_fast.py) contain a significant amount of PyTorch-specific code that is not compatible with MindSpore. This includes using methods like .view(), .permute(), .contiguous(), .item(), and various in-place operations (*_()) which will cause AttributeErrors at runtime. The accompanying test files also incorrectly use PyTorch tensors and assertions, failing to validate the MindSpore implementation. The code requires a thorough review and porting to the MindSpore API, likely using mindspore.ops and mindspore.mint where appropriate, and removing PyTorch-specific patterns. I've added several critical comments to highlight these API incompatibilities.

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/feature_extraction_phi4_multimodal.py

+            frames = frames.clone()
+            # concerned batch indices
+            to_mask_batch_idxs = mindspore.mint.arange(batch_size)[audio_lengths != audio_lengths.max()]
+            if to_mask_batch_idxs.numel() > 0:
+                batch_idxs_down = (audio_lengths[to_mask_batch_idxs] - self.win_length) // self.hop_length + 1
+                batch_idxs_up = (audio_lengths[to_mask_batch_idxs] // self.hop_length) - 1
+                offset_idx = batch_idxs_down.min()
+                max_idx = batch_idxs_up.max()
+
+                mask = mindspore.mint.arange(
+                    max_idx - offset_idx,
+                ).expand(to_mask_batch_idxs.shape[0], -1)
+                mask = ((batch_idxs_down - offset_idx).unsqueeze(1) <= mask) & (
+                    mask < (batch_idxs_up - offset_idx).unsqueeze(1)
+                )
+                mask = mask.unsqueeze(-1).expand(-1, -1, self.win_length)
+                masked_frames = frames[to_mask_batch_idxs, offset_idx:max_idx].masked_fill_(mask, 0)
+                frames[to_mask_batch_idxs, offset_idx:max_idx] = masked_frames


This block of code uses PyTorch-specific tensor methods like .clone() and .masked_fill_() which are not available on standard mindspore.Tensor objects. This will cause an AttributeError at runtime.

To fix this, you should use the equivalent functions from mindspore.ops or mindspore.mint and adapt the logic to be compatible with MindSpore's execution model. For example, tensor.clone() can be replaced with tensor.copy(), and masked_fill_() can be replaced with the functional version mindspore.mint.masked_fill.

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/feature_extraction_phi4_multimodal.py

+        spec_power = spec**2
+
+        # apply triangular mel filter bank
+        mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32)


The .to(device, dtype) method signature is not valid for a mindspore.Tensor. Device placement in MindSpore is typically handled by the context (ms.set_context) rather than on a per-tensor basis like in PyTorch. The device parameter seems to be a leftover from a PyTorch implementation and is not used correctly here. This will raise an AttributeError.

Suggested change

mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32)

mel_filters = mindspore.Tensor(self.mel_filters, dtype=mindspore.float32)

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/image_processing_phi4_multimodal_fast.py

+            hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5)
+            hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous()


The methods .permute() and .contiguous() are from the PyTorch API and are not available on mindspore.Tensor objects. This will cause an AttributeError.

To fix this, you can use mindspore.Tensor.transpose with a permutation tuple for .permute(), and remove .contiguous() as it's not necessary in MindSpore.

Suggested change

hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5)

hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous()

hd_image_reshape = hd_image_reshape.transpose((0, 2, 4, 1, 3, 5))

hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width)

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/image_processing_phi4_multimodal_fast.py

+                + int(downsample_attention_mask.sum().item())
+                + int(downsample_attention_mask[:, 0].sum().item())


The .item() method is not available on mindspore.Tensor objects. To get a scalar value from a 0-dimensional tensor in MindSpore, you should use .asnumpy(). This will cause an AttributeError at runtime.

Suggested change

+ int(downsample_attention_mask.sum().item())

+ int(downsample_attention_mask[:, 0].sum().item())

+ int(downsample_attention_mask.sum().asnumpy())

+ int(downsample_attention_mask[:, 0].sum().asnumpy())

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py

+        query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)


The .view() method is a PyTorch API and is not available on mindspore.Tensor objects. The MindSpore equivalent is .reshape(). This will cause an AttributeError.

Suggested change

query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)

key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)

value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)

query_states = self.q_proj(hidden_states).reshape(hidden_shape).transpose(1, 2)

key_states = self.k_proj(hidden_states).reshape(hidden_shape).transpose(1, 2)

value_states = self.v_proj(hidden_states).reshape(hidden_shape).transpose(1, 2)

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py

+            bucket_coords_w = mindspore.ops.bucketize(fractional_coords_w, boundaries, right=True)
+
+            pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten()
+            position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids


The methods .view() and .cpu() are from the PyTorch API and are not available on mindspore.Tensor objects. This will cause an AttributeError.

Use .reshape(-1) instead of .view(-1).

The .cpu() call should be removed. Device placement in MindSpore is handled differently, typically through context settings.

Suggested change

position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids

position_ids[batch_idx][p_attn_mask.reshape(-1)] = pos_ids

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py

+
+        # Temporarily disable autocast to avoid issue on bf16 tensors
+        # Ref: https://github.com/pytorch/pytorch/issues/132715
+        image_embeds = inputs_embeds.index_put(indices=positions_tuple, values=merged_img_set_tensor, accumulate=False)


The .index_put() method is a PyTorch API for in-place indexed assignment. This method does not exist for mindspore.Tensor objects and will cause an AttributeError. You might need to use mindspore.ops.tensor_scatter_elements or another approach to achieve a similar result in a way that is compatible with MindSpore.

gemini-code-assist · 2025-12-11T06:40:23Z

tests/transformers_tests/models/phi4_multimodal/test_feature_extraction_phi4_multimodal.py

+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+        # create three inputs of length 800, 1000, and 1200
+        speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
+        np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
+        pt_speech_inputs = [torch.tensor(speech_input) for speech_input in speech_inputs]
+
+        # Test feature size
+        input_features = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features
+        max_audio_len = (1200 - feature_extractor.win_length) // feature_extractor.hop_length + 1
+        self.assertTrue(input_features.ndim == 3)
+        self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size)
+        self.assertTrue(input_features.shape[-2] == max_audio_len)
+
+        # Test not batched input
+        encoded_sequences_1 = feature_extractor(pt_speech_inputs[0], return_tensors="np").audio_input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").audio_input_features
+        self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
+
+        # Test batched
+        encoded_sequences_1 = feature_extractor(pt_speech_inputs, return_tensors="np").audio_input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
+
+        # Test 2-D numpy arrays are batched.
+        speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
+        np_speech_inputs = np.asarray(speech_inputs)
+        pt_speech_inputs = torch.tensor(speech_inputs)
+        encoded_sequences_1 = feature_extractor(pt_speech_inputs, return_tensors="np").audio_input_features
+        encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features
+        for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
+            self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))


This test case appears to be incorrectly using PyTorch tensors (pt_speech_inputs) as input to the MindSpore feature extractor. The feature extractor expects MindSpore tensors or NumPy arrays. This will lead to incorrect test behavior or errors. The tests should be adapted to use MindSpore-native data types to properly validate the implementation.

gemini-code-assist · 2025-12-11T06:40:23Z

tests/transformers_tests/models/phi4_multimodal/test_image_processing_phi4_multimodal.py

+    def test_cast_dtype_device(self):
+        for image_processing_class in self.image_processor_list:
+            if self.test_cast_dtype is not None:
+                # Initialize image_processor
+                image_processor = image_processing_class(**self.image_processor_dict)
+
+                # create random PyTorch tensors
+                image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, torchify=True)
+
+                encoding = image_processor(image_inputs, return_tensors="pt")
+                # for layoutLM compatibility
+                self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
+                self.assertEqual(encoding.image_pixel_values.dtype, torch.float32)
+
+                encoding = image_processor(image_inputs, return_tensors="pt").to(torch.float16)
+                self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
+                self.assertEqual(encoding.image_pixel_values.dtype, torch.float16)
+
+                encoding = image_processor(image_inputs, return_tensors="pt").to("cpu", torch.bfloat16)
+                self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
+                self.assertEqual(encoding.image_pixel_values.dtype, torch.bfloat16)
+
+                with self.assertRaises(TypeError):
+                    _ = image_processor(image_inputs, return_tensors="pt").to(torch.bfloat16, "cpu")
+
+                # Try with text + image feature
+                encoding = image_processor(image_inputs, return_tensors="pt")
+                encoding.update({"input_ids": torch.LongTensor([[1, 2, 3], [4, 5, 6]])})
+                encoding = encoding.to(torch.float16)
+
+                self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu"))
+                self.assertEqual(encoding.image_pixel_values.dtype, torch.float16)
+                self.assertEqual(encoding.input_ids.dtype, torch.long)


This test case is using PyTorch tensors and checks for PyTorch-specific attributes like .device and .dtype == torch.float16. For a MindSpore library, tests should use MindSpore tensors and MindSpore-specific assertions to correctly validate the implementation. This test needs to be rewritten for MindSpore.

gemini-code-assist · 2025-12-11T06:40:23Z

mindone/transformers/models/phi4_multimodal/__init__.py

+from .feature_extraction_phi4_multimodal import *
+from .image_processing_phi4_multimodal_fast import *
+from .modeling_phi4_multimodal import *
+from .processing_phi4_multimodal import *


Using import * is generally discouraged as it can pollute the namespace and make it unclear which names are being imported. It's better to explicitly import the required classes. This also helps static analysis tools and improves code readability.

Suggested change

from .feature_extraction_phi4_multimodal import *

from .image_processing_phi4_multimodal_fast import *

from .modeling_phi4_multimodal import *

from .processing_phi4_multimodal import *

from .feature_extraction_phi4_multimodal import Phi4MultimodalFeatureExtractor

from .image_processing_phi4_multimodal_fast import Phi4MultimodalImageProcessorFast

from .modeling_phi4_multimodal import (

Phi4MultimodalAudioModel,

Phi4MultimodalAudioPreTrainedModel,

Phi4MultimodalForCausalLM,

Phi4MultimodalModel,

Phi4MultimodalPreTrainedModel,

Phi4MultimodalVisionModel,

Phi4MultimodalVisionPreTrainedModel,

)

from .processing_phi4_multimodal import Phi4MultimodalProcessor

LiTingyu1997 requested a review from vigo999 as a code owner December 11, 2025 06:35

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

LiTingyu1997 force-pushed the phi4 branch from ab14fbf to ba92b25 Compare December 17, 2025 03:45

LiTingyu1997 force-pushed the phi4 branch from ed38873 to 6833920 Compare December 24, 2025 08:16

fix

b465c93

LiTingyu1997 force-pushed the phi4 branch from 2d61701 to b465c93 Compare December 24, 2025 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(transformers): add Phi4 #1468

feat(transformers): add Phi4 #1468

Uh oh!

LiTingyu1997 commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32)
	mel_filters = mindspore.Tensor(self.mel_filters, dtype=mindspore.float32)

		hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5)
		hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous()

		+ int(downsample_attention_mask.sum().item())
		+ int(downsample_attention_mask[:, 0].sum().item())

	position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
	position_ids[batch_idx][p_attn_mask.reshape(-1)] = pos_ids

-from .feature_extraction_phi4_multimodal import *
-from .image_processing_phi4_multimodal_fast import *
-from .modeling_phi4_multimodal import *
-from .processing_phi4_multimodal import *
+from .feature_extraction_phi4_multimodal import Phi4MultimodalFeatureExtractor
+from .image_processing_phi4_multimodal_fast import Phi4MultimodalImageProcessorFast
+from .modeling_phi4_multimodal import (
+    Phi4MultimodalAudioModel,
+    Phi4MultimodalAudioPreTrainedModel,
+    Phi4MultimodalForCausalLM,
+    Phi4MultimodalModel,
+    Phi4MultimodalPreTrainedModel,
+    Phi4MultimodalVisionModel,
+    Phi4MultimodalVisionPreTrainedModel,
+)
+from .processing_phi4_multimodal import Phi4MultimodalProcessor

feat(transformers): add Phi4 #1468

Are you sure you want to change the base?

feat(transformers): add Phi4 #1468

Uh oh!

Conversation

LiTingyu1997 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Add

Usage

Phi4MultimodalForCausalLM

Problem

Before submitting

Who can review?

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LiTingyu1997 commented Dec 11, 2025 •

edited

Loading