-
Notifications
You must be signed in to change notification settings - Fork 88
feat(transformers): add Phi4 #1468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @LiTingyu1997, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the Phi4Multimodal model. While the overall structure for adding the new model is in place, the implementation files (modeling_phi4_multimodal.py, feature_extraction_phi4_multimodal.py, image_processing_phi4_multimodal_fast.py) contain a significant amount of PyTorch-specific code that is not compatible with MindSpore. This includes using methods like .view(), .permute(), .contiguous(), .item(), and various in-place operations (*_()) which will cause AttributeErrors at runtime. The accompanying test files also incorrectly use PyTorch tensors and assertions, failing to validate the MindSpore implementation. The code requires a thorough review and porting to the MindSpore API, likely using mindspore.ops and mindspore.mint where appropriate, and removing PyTorch-specific patterns. I've added several critical comments to highlight these API incompatibilities.
| frames = frames.clone() | ||
| # concerned batch indices | ||
| to_mask_batch_idxs = mindspore.mint.arange(batch_size)[audio_lengths != audio_lengths.max()] | ||
| if to_mask_batch_idxs.numel() > 0: | ||
| batch_idxs_down = (audio_lengths[to_mask_batch_idxs] - self.win_length) // self.hop_length + 1 | ||
| batch_idxs_up = (audio_lengths[to_mask_batch_idxs] // self.hop_length) - 1 | ||
| offset_idx = batch_idxs_down.min() | ||
| max_idx = batch_idxs_up.max() | ||
|
|
||
| mask = mindspore.mint.arange( | ||
| max_idx - offset_idx, | ||
| ).expand(to_mask_batch_idxs.shape[0], -1) | ||
| mask = ((batch_idxs_down - offset_idx).unsqueeze(1) <= mask) & ( | ||
| mask < (batch_idxs_up - offset_idx).unsqueeze(1) | ||
| ) | ||
| mask = mask.unsqueeze(-1).expand(-1, -1, self.win_length) | ||
| masked_frames = frames[to_mask_batch_idxs, offset_idx:max_idx].masked_fill_(mask, 0) | ||
| frames[to_mask_batch_idxs, offset_idx:max_idx] = masked_frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code uses PyTorch-specific tensor methods like .clone() and .masked_fill_() which are not available on standard mindspore.Tensor objects. This will cause an AttributeError at runtime.
To fix this, you should use the equivalent functions from mindspore.ops or mindspore.mint and adapt the logic to be compatible with MindSpore's execution model. For example, tensor.clone() can be replaced with tensor.copy(), and masked_fill_() can be replaced with the functional version mindspore.mint.masked_fill.
| spec_power = spec**2 | ||
|
|
||
| # apply triangular mel filter bank | ||
| mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .to(device, dtype) method signature is not valid for a mindspore.Tensor. Device placement in MindSpore is typically handled by the context (ms.set_context) rather than on a per-tensor basis like in PyTorch. The device parameter seems to be a leftover from a PyTorch implementation and is not used correctly here. This will raise an AttributeError.
| mel_filters = mindspore.Tensor.from_numpy(self.mel_filters).to(device, mindspore.float32) | |
| mel_filters = mindspore.Tensor(self.mel_filters, dtype=mindspore.float32) |
| hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5) | ||
| hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods .permute() and .contiguous() are from the PyTorch API and are not available on mindspore.Tensor objects. This will cause an AttributeError.
To fix this, you can use mindspore.Tensor.transpose with a permutation tuple for .permute(), and remove .contiguous() as it's not necessary in MindSpore.
| hd_image_reshape = hd_image_reshape.permute(0, 2, 4, 1, 3, 5) | |
| hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width).contiguous() | |
| hd_image_reshape = hd_image_reshape.transpose((0, 2, 4, 1, 3, 5)) | |
| hd_image_reshape = hd_image_reshape.reshape(-1, 3, size.height, size.width) |
| + int(downsample_attention_mask.sum().item()) | ||
| + int(downsample_attention_mask[:, 0].sum().item()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .item() method is not available on mindspore.Tensor objects. To get a scalar value from a 0-dimensional tensor in MindSpore, you should use .asnumpy(). This will cause an AttributeError at runtime.
| + int(downsample_attention_mask.sum().item()) | |
| + int(downsample_attention_mask[:, 0].sum().item()) | |
| + int(downsample_attention_mask.sum().asnumpy()) | |
| + int(downsample_attention_mask[:, 0].sum().asnumpy()) |
| query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2) | ||
| key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2) | ||
| value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .view() method is a PyTorch API and is not available on mindspore.Tensor objects. The MindSpore equivalent is .reshape(). This will cause an AttributeError.
| query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2) | |
| key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2) | |
| value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2) | |
| query_states = self.q_proj(hidden_states).reshape(hidden_shape).transpose(1, 2) | |
| key_states = self.k_proj(hidden_states).reshape(hidden_shape).transpose(1, 2) | |
| value_states = self.v_proj(hidden_states).reshape(hidden_shape).transpose(1, 2) |
| bucket_coords_w = mindspore.ops.bucketize(fractional_coords_w, boundaries, right=True) | ||
|
|
||
| pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten() | ||
| position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods .view() and .cpu() are from the PyTorch API and are not available on mindspore.Tensor objects. This will cause an AttributeError.
- Use
.reshape(-1)instead of.view(-1). - The
.cpu()call should be removed. Device placement in MindSpore is handled differently, typically through context settings.
| position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids | |
| position_ids[batch_idx][p_attn_mask.reshape(-1)] = pos_ids |
|
|
||
| # Temporarily disable autocast to avoid issue on bf16 tensors | ||
| # Ref: https://github.com/pytorch/pytorch/issues/132715 | ||
| image_embeds = inputs_embeds.index_put(indices=positions_tuple, values=merged_img_set_tensor, accumulate=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .index_put() method is a PyTorch API for in-place indexed assignment. This method does not exist for mindspore.Tensor objects and will cause an AttributeError. You might need to use mindspore.ops.tensor_scatter_elements or another approach to achieve a similar result in a way that is compatible with MindSpore.
| def test_call(self): | ||
| # Tests that all call wrap to encode_plus and batch_encode_plus | ||
| feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict()) | ||
| # create three inputs of length 800, 1000, and 1200 | ||
| speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)] | ||
| np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs] | ||
| pt_speech_inputs = [torch.tensor(speech_input) for speech_input in speech_inputs] | ||
|
|
||
| # Test feature size | ||
| input_features = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features | ||
| max_audio_len = (1200 - feature_extractor.win_length) // feature_extractor.hop_length + 1 | ||
| self.assertTrue(input_features.ndim == 3) | ||
| self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size) | ||
| self.assertTrue(input_features.shape[-2] == max_audio_len) | ||
|
|
||
| # Test not batched input | ||
| encoded_sequences_1 = feature_extractor(pt_speech_inputs[0], return_tensors="np").audio_input_features | ||
| encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").audio_input_features | ||
| self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3)) | ||
|
|
||
| # Test batched | ||
| encoded_sequences_1 = feature_extractor(pt_speech_inputs, return_tensors="np").audio_input_features | ||
| encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features | ||
| for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2): | ||
| self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3)) | ||
|
|
||
| # Test 2-D numpy arrays are batched. | ||
| speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)] | ||
| np_speech_inputs = np.asarray(speech_inputs) | ||
| pt_speech_inputs = torch.tensor(speech_inputs) | ||
| encoded_sequences_1 = feature_extractor(pt_speech_inputs, return_tensors="np").audio_input_features | ||
| encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").audio_input_features | ||
| for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2): | ||
| self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case appears to be incorrectly using PyTorch tensors (pt_speech_inputs) as input to the MindSpore feature extractor. The feature extractor expects MindSpore tensors or NumPy arrays. This will lead to incorrect test behavior or errors. The tests should be adapted to use MindSpore-native data types to properly validate the implementation.
| def test_cast_dtype_device(self): | ||
| for image_processing_class in self.image_processor_list: | ||
| if self.test_cast_dtype is not None: | ||
| # Initialize image_processor | ||
| image_processor = image_processing_class(**self.image_processor_dict) | ||
|
|
||
| # create random PyTorch tensors | ||
| image_inputs = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, torchify=True) | ||
|
|
||
| encoding = image_processor(image_inputs, return_tensors="pt") | ||
| # for layoutLM compatibility | ||
| self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu")) | ||
| self.assertEqual(encoding.image_pixel_values.dtype, torch.float32) | ||
|
|
||
| encoding = image_processor(image_inputs, return_tensors="pt").to(torch.float16) | ||
| self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu")) | ||
| self.assertEqual(encoding.image_pixel_values.dtype, torch.float16) | ||
|
|
||
| encoding = image_processor(image_inputs, return_tensors="pt").to("cpu", torch.bfloat16) | ||
| self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu")) | ||
| self.assertEqual(encoding.image_pixel_values.dtype, torch.bfloat16) | ||
|
|
||
| with self.assertRaises(TypeError): | ||
| _ = image_processor(image_inputs, return_tensors="pt").to(torch.bfloat16, "cpu") | ||
|
|
||
| # Try with text + image feature | ||
| encoding = image_processor(image_inputs, return_tensors="pt") | ||
| encoding.update({"input_ids": torch.LongTensor([[1, 2, 3], [4, 5, 6]])}) | ||
| encoding = encoding.to(torch.float16) | ||
|
|
||
| self.assertEqual(encoding.image_pixel_values.device, torch.device("cpu")) | ||
| self.assertEqual(encoding.image_pixel_values.dtype, torch.float16) | ||
| self.assertEqual(encoding.input_ids.dtype, torch.long) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case is using PyTorch tensors and checks for PyTorch-specific attributes like .device and .dtype == torch.float16. For a MindSpore library, tests should use MindSpore tensors and MindSpore-specific assertions to correctly validate the implementation. This test needs to be rewritten for MindSpore.
| from .feature_extraction_phi4_multimodal import * | ||
| from .image_processing_phi4_multimodal_fast import * | ||
| from .modeling_phi4_multimodal import * | ||
| from .processing_phi4_multimodal import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using import * is generally discouraged as it can pollute the namespace and make it unclear which names are being imported. It's better to explicitly import the required classes. This also helps static analysis tools and improves code readability.
| from .feature_extraction_phi4_multimodal import * | |
| from .image_processing_phi4_multimodal_fast import * | |
| from .modeling_phi4_multimodal import * | |
| from .processing_phi4_multimodal import * | |
| from .feature_extraction_phi4_multimodal import Phi4MultimodalFeatureExtractor | |
| from .image_processing_phi4_multimodal_fast import Phi4MultimodalImageProcessorFast | |
| from .modeling_phi4_multimodal import ( | |
| Phi4MultimodalAudioModel, | |
| Phi4MultimodalAudioPreTrainedModel, | |
| Phi4MultimodalForCausalLM, | |
| Phi4MultimodalModel, | |
| Phi4MultimodalPreTrainedModel, | |
| Phi4MultimodalVisionModel, | |
| Phi4MultimodalVisionPreTrainedModel, | |
| ) | |
| from .processing_phi4_multimodal import Phi4MultimodalProcessor |
What does this PR do?
Add
Usage
Phi4MultimodalForCausalLM
Problem
Since the original repository code of transformers also cannot load weights for testing, this PR only aligns the code with the original repository; the forward accuracy has been verified.
Fixes # (issue)
Adds # (feature)
Before submitting
What's New. Here are thedocumentation guidelines
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@xxx