Implement Llama 3.2 Vision (11B/90B) Architecture#2521
Implement Llama 3.2 Vision (11B/90B) Architecture#2521Vivek1106-04 wants to merge 18 commits intokeras-team:masterfrom
Conversation
Summary of ChangesHello @Vivek1106-04, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the full Llama 3.2 Vision architecture, enabling advanced multimodal capabilities within the Keras ecosystem. It achieves this by integrating a sophisticated two-stage vision encoder with the Llama 3 text model through a novel Gated Cross-Attention mechanism, allowing for dynamic and context-aware fusion of visual and textual information. The implementation is modular, providing distinct Keras layers for each component, and includes essential utilities for seamless weight conversion from HuggingFace and efficient fine-tuning strategies. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR introduces a comprehensive implementation of the Llama 3.2 Vision architecture. The code is well-structured and follows the repository's conventions for modularity and testing. I've identified a few areas for improvement:
- There's a critical issue with the
Llama3VisionProjectorimplementation, which uses a single linear layer instead of the required 2-layer MLP. This affects the model architecture, weight conversion script, and tests. - For consistency with the repository's style guide, the image input to the backbone should be named
pixel_valuesinstead ofimages. This requires minor updates to the backbone and preprocessor.
I've provided detailed suggestions to address these points. Once these are resolved, the implementation will be solid.
| class Llama3VisionProjector(keras.layers.Layer): | ||
| """Vision projector for the Llama 3.2 Vision model. | ||
|
|
||
| This layer projects vision encoder features into the text embedding space | ||
| using a single linear projection, enabling vision-language fusion. | ||
|
|
||
| Args: | ||
| input_dim: int. The dimension of the vision encoder output | ||
| (vision_output_dim from HuggingFace config). | ||
| output_dim: int. The dimension of the text decoder embeddings. | ||
| dtype: string or `keras.mixed_precision.DTypePolicy`. The dtype to use | ||
| for model computations and weights. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| input_dim, | ||
| output_dim, | ||
| dtype=None, | ||
| **kwargs, | ||
| ): | ||
| super().__init__(dtype=dtype, **kwargs) | ||
|
|
||
| # === Config === | ||
| self.input_dim = input_dim | ||
| self.output_dim = output_dim | ||
|
|
||
| # === Layers === | ||
| # Single linear projection matching HuggingFace architecture | ||
| self.projection = layers.Dense( | ||
| self.output_dim, | ||
| use_bias=True, | ||
| name="projection", | ||
| ) | ||
|
|
||
| def build(self, input_shape): | ||
| self.projection.build(input_shape) | ||
| super().build(input_shape) | ||
|
|
||
| def call(self, inputs): | ||
| return self.projection(inputs) | ||
|
|
||
| def get_config(self): | ||
| config = super().get_config() | ||
| config.update( | ||
| { | ||
| "input_dim": self.input_dim, | ||
| "output_dim": self.output_dim, | ||
| } | ||
| ) | ||
| return config |
There was a problem hiding this comment.
The Llama 3.2 Vision model uses a 2-layer MLP for its vision projector, but this implementation uses only a single Dense layer. This is incorrect and should be updated to a 2-layer MLP to match the original architecture. This will require adding an intermediate_dim parameter and an activation function (typically GELU).
class Llama3VisionProjector(keras.layers.Layer):
"""Vision projector for the Llama 3.2 Vision model.
This layer projects vision encoder features into the text embedding space
using a 2-layer MLP, enabling vision-language fusion.
Args:
input_dim: int. The dimension of the vision encoder output.
output_dim: int. The dimension of the text decoder embeddings.
intermediate_dim: int. The intermediate dimension of the MLP.
dtype: string or `keras.mixed_precision.DTypePolicy`. The dtype to use
for model computations and weights.
"""
def __init__(
self,
input_dim,
output_dim,
intermediate_dim,
dtype=None,
**kwargs,
):
super().__init__(dtype=dtype, **kwargs)
# === Config ===
self.input_dim = input_dim
self.output_dim = output_dim
self.intermediate_dim = intermediate_dim
# === Layers ===
# 2-layer MLP matching HuggingFace architecture
self.dense_1 = layers.Dense(
self.intermediate_dim,
activation="gelu",
name="dense_1",
)
self.dense_2 = layers.Dense(
self.output_dim,
name="dense_2",
)
def build(self, input_shape):
self.dense_1.build(input_shape)
dense_1_output_shape = list(input_shape)
dense_1_output_shape[-1] = self.intermediate_dim
self.dense_2.build(tuple(dense_1_output_shape))
super().build(input_shape)
def call(self, inputs):
x = self.dense_1(inputs)
return self.dense_2(x)
def get_config(self):
config = super().get_config()
config.update(
{
"input_dim": self.input_dim,
"output_dim": self.output_dim,
"intermediate_dim": self.intermediate_dim,
}
)
return config| def test_projector_basics(self): | ||
| self.run_layer_test( | ||
| cls=Llama3VisionProjector, | ||
| init_kwargs={ | ||
| "hidden_dim": 128, # Vision Encoder output | ||
| "output_dim": 256, # Text Model input | ||
| "intermediate_dim": 512, # Internal MLP size | ||
| }, | ||
| input_data=np.random.uniform(size=(2, 10, 128)).astype("float32"), | ||
| expected_output_shape=(2, 10, 256), # Should match output_dim | ||
| # dense_1 (kernel, bias) + dense_2 (kernel, bias) | ||
| expected_num_trainable_weights=4, | ||
| run_precision_checks=False, | ||
| ) | ||
|
|
||
| def test_defaults(self): | ||
| # Test that intermediate_dim defaults to output_dim if not set | ||
| projector = Llama3VisionProjector(hidden_dim=32, output_dim=64) | ||
| images = np.random.uniform(size=(2, 5, 32)).astype("float32") | ||
| outputs = projector(images) | ||
| self.assertEqual(outputs.shape, (2, 5, 64)) |
There was a problem hiding this comment.
The tests for Llama3VisionProjector are incorrect and will fail. They seem to be based on a different version of the implementation.
test_projector_basics: Theinit_kwargsare incorrect.hidden_dimshould beinput_dim. With the corrected 2-layer MLP projector,intermediate_dimis a required argument.test_defaults: This test is also incorrect. It's missing the requiredintermediate_dimargument and useshidden_diminstead ofinput_dim.
def test_projector_basics(self):
self.run_layer_test(
cls=Llama3VisionProjector,
init_kwargs={
"input_dim": 128, # Vision Encoder output
"output_dim": 256, # Text Model input
"intermediate_dim": 512, # Internal MLP size
},
input_data=np.random.uniform(size=(2, 10, 128)).astype("float32"),
expected_output_shape=(2, 10, 256), # Should match output_dim
# dense_1 (kernel, bias) + dense_2 (kernel, bias)
expected_num_trainable_weights=4,
run_precision_checks=False,
)
def test_projector_with_defaults(self):
# Test with minimal arguments.
projector = Llama3VisionProjector(
input_dim=32, output_dim=64, intermediate_dim=128
)
images = np.random.uniform(size=(2, 5, 32)).astype("float32")
outputs = projector(images)
self.assertEqual(outputs.shape, (2, 5, 64))| def convert_vision_projector_weights(keras_projector, hf_model): | ||
| """Convert vision projector weights.""" | ||
| # HF uses a single nn.Linear, Keras now uses single Dense layer | ||
| hf_proj = hf_model.model.multi_modal_projector | ||
|
|
||
| # Single Linear layer | ||
| keras_projector.projection.kernel.assign( | ||
| hf_proj.weight.T.detach().cpu().float().numpy() | ||
| ) | ||
| keras_projector.projection.bias.assign( | ||
| hf_proj.bias.detach().cpu().float().numpy() | ||
| ) |
There was a problem hiding this comment.
This conversion logic for the vision projector is incorrect. It assumes a single linear layer, but the Llama 3.2 Vision projector is a 2-layer MLP. This needs to be updated to convert weights for both dense_1 and dense_2 layers, similar to the logic in keras_hub/src/utils/transformers/convert_llama3_vision.py.
| def convert_vision_projector_weights(keras_projector, hf_model): | |
| """Convert vision projector weights.""" | |
| # HF uses a single nn.Linear, Keras now uses single Dense layer | |
| hf_proj = hf_model.model.multi_modal_projector | |
| # Single Linear layer | |
| keras_projector.projection.kernel.assign( | |
| hf_proj.weight.T.detach().cpu().float().numpy() | |
| ) | |
| keras_projector.projection.bias.assign( | |
| hf_proj.bias.detach().cpu().float().numpy() | |
| ) | |
| def convert_vision_projector_weights(keras_projector, hf_model): | |
| """Convert vision projector weights.""" | |
| hf_proj = hf_model.model.multi_modal_projector | |
| # Dense 1 (input projection) | |
| keras_projector.dense_1.kernel.assign( | |
| hf_proj.linear_1.weight.T.detach().cpu().float().numpy() | |
| ) | |
| keras_projector.dense_1.bias.assign( | |
| hf_proj.linear_1.bias.detach().cpu().float().numpy() | |
| ) | |
| # Dense 2 (output projection) | |
| keras_projector.dense_2.kernel.assign( | |
| hf_proj.linear_2.weight.T.detach().cpu().float().numpy() | |
| ) | |
| keras_projector.dense_2.bias.assign( | |
| hf_proj.linear_2.bias.detach().cpu().float().numpy() | |
| ) |
| # === Functional Model === | ||
| image_input = keras.Input( | ||
| shape=(vision_image_size, vision_image_size, vision_num_channels), | ||
| name="images", |
There was a problem hiding this comment.
For consistency with other vision models in the repository, the input name for image data should be pixel_values instead of images. This aligns with the repository's style guide (line 67). You'll also need to update the inputs dictionary in the super().__init__ call on line 225, the example in the docstring on line 74, and the corresponding test file keras_hub/src/models/llama3/llama3_vision_backbone_test.py.
| name="images", | |
| name="pixel_values", |
References
- The style guide specifies standardized names for model input arguments to ensure interoperability. For image models,
pixel_valuesis the recommended name for the input tensor containing image data. (link)
|
|
||
| if images is not None and self.image_converter is not None: | ||
| images = self.image_converter(images) | ||
| output["images"] = images |
There was a problem hiding this comment.
The output key for preprocessed images should be pixel_values to match the expected input name of the Llama3VisionBackbone. This ensures consistency and aligns with the repository's style guide (line 67). This change will also require updating the tests in keras_hub/src/models/llama3/llama3_vision_preprocessor_test.py to check for the pixel_values key in the output dictionary.
| output["images"] = images | |
| output["pixel_values"] = images |
References
- The style guide specifies standardized names for model input arguments to ensure interoperability. For image models,
pixel_valuesis the recommended name for the input tensor containing image data. (link)
Refactor: Vision ProjectorI have updated the This change ensures architectural parity with the official Hugging Face implementation (
|
Description
This PR implements the full Llama 3.2 Vision architecture, supporting the 11B and 90B multimodal variants. Unlike the early-fusion Llama 3 text models, this implementation uses Gated Cross-Attention (Late Fusion) to inject vision features into specific text decoder layers.
Key Features Implemented:
Llama3VisionBackbonerewritten to support Cross-Attention layers (instead of linear fusion).Llama3VisionCrossAttentionwith gated query-key-value injection at specific intervals (e.g., layers 3, 8, 13...).convert_llama3_vision.pyto port official Meta/HuggingFace weights.freeze_for_vision_adapter_training()to support efficient fine-tuning (freezing text backbone + local encoder).Fixes Implement Llama 3.2 Vision (Multimodal) Architecture #2470