Skip to content

Implement Llama 3.2 Vision (11B/90B) Architecture#2521

Open
Vivek1106-04 wants to merge 18 commits intokeras-team:masterfrom
Vivek1106-04:llama3-vision
Open

Implement Llama 3.2 Vision (11B/90B) Architecture#2521
Vivek1106-04 wants to merge 18 commits intokeras-team:masterfrom
Vivek1106-04:llama3-vision

Conversation

@Vivek1106-04
Copy link

Description

This PR implements the full Llama 3.2 Vision architecture, supporting the 11B and 90B multimodal variants. Unlike the early-fusion Llama 3 text models, this implementation uses Gated Cross-Attention (Late Fusion) to inject vision features into specific text decoder layers.

Key Features Implemented:

  • Core Architecture: Llama3VisionBackbone rewritten to support Cross-Attention layers (instead of linear fusion).
  • Vision Encoder: Full Two-Stage Encoder support (SigLIP-based local layers + Global transformer layers).
  • Cross-Attention: Implemented Llama3VisionCrossAttention with gated query-key-value injection at specific intervals (e.g., layers 3, 8, 13...).
  • Weight Support: Included convert_llama3_vision.py to port official Meta/HuggingFace weights.
  • Training Utils: Added freeze_for_vision_adapter_training() to support efficient fine-tuning (freezing text backbone + local encoder).
    Fixes Implement Llama 3.2 Vision (Multimodal) Architecture #2470

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Vivek1106-04, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the full Llama 3.2 Vision architecture, enabling advanced multimodal capabilities within the Keras ecosystem. It achieves this by integrating a sophisticated two-stage vision encoder with the Llama 3 text model through a novel Gated Cross-Attention mechanism, allowing for dynamic and context-aware fusion of visual and textual information. The implementation is modular, providing distinct Keras layers for each component, and includes essential utilities for seamless weight conversion from HuggingFace and efficient fine-tuning strategies.

Highlights

  • Full Llama 3.2 Vision Architecture: Implements the complete Llama 3.2 Vision architecture, supporting 11B and 90B multimodal variants.
  • Gated Cross-Attention (Late Fusion): Utilizes a Gated Cross-Attention mechanism for late fusion, injecting vision features into specific text decoder layers, a departure from early-fusion Llama 3 text models.
  • Two-Stage Vision Encoder: Includes a comprehensive two-stage vision encoder, combining SigLIP-based local layers with global transformer layers for robust visual feature extraction.
  • Modular Component Implementation: Introduces dedicated Keras layers for Llama3VisionBackbone, Llama3VisionCrossAttention, Llama3VisionEncoder, Llama3VisionProjector, Llama3VisionImageConverter, and Llama3VisionPreprocessor.
  • Weight Conversion and Fine-tuning Utilities: Provides convert_llama3_vision.py to port official Meta/HuggingFace weights and freeze_for_vision_adapter_training() for efficient fine-tuning by freezing the text backbone and local encoder.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a comprehensive implementation of the Llama 3.2 Vision architecture. The code is well-structured and follows the repository's conventions for modularity and testing. I've identified a few areas for improvement:

  • There's a critical issue with the Llama3VisionProjector implementation, which uses a single linear layer instead of the required 2-layer MLP. This affects the model architecture, weight conversion script, and tests.
  • For consistency with the repository's style guide, the image input to the backbone should be named pixel_values instead of images. This requires minor updates to the backbone and preprocessor.

I've provided detailed suggestions to address these points. Once these are resolved, the implementation will be solid.

Comment on lines 8 to 58
class Llama3VisionProjector(keras.layers.Layer):
"""Vision projector for the Llama 3.2 Vision model.

This layer projects vision encoder features into the text embedding space
using a single linear projection, enabling vision-language fusion.

Args:
input_dim: int. The dimension of the vision encoder output
(vision_output_dim from HuggingFace config).
output_dim: int. The dimension of the text decoder embeddings.
dtype: string or `keras.mixed_precision.DTypePolicy`. The dtype to use
for model computations and weights.
"""

def __init__(
self,
input_dim,
output_dim,
dtype=None,
**kwargs,
):
super().__init__(dtype=dtype, **kwargs)

# === Config ===
self.input_dim = input_dim
self.output_dim = output_dim

# === Layers ===
# Single linear projection matching HuggingFace architecture
self.projection = layers.Dense(
self.output_dim,
use_bias=True,
name="projection",
)

def build(self, input_shape):
self.projection.build(input_shape)
super().build(input_shape)

def call(self, inputs):
return self.projection(inputs)

def get_config(self):
config = super().get_config()
config.update(
{
"input_dim": self.input_dim,
"output_dim": self.output_dim,
}
)
return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The Llama 3.2 Vision model uses a 2-layer MLP for its vision projector, but this implementation uses only a single Dense layer. This is incorrect and should be updated to a 2-layer MLP to match the original architecture. This will require adding an intermediate_dim parameter and an activation function (typically GELU).

class Llama3VisionProjector(keras.layers.Layer):
    """Vision projector for the Llama 3.2 Vision model.

    This layer projects vision encoder features into the text embedding space
    using a 2-layer MLP, enabling vision-language fusion.

    Args:
        input_dim: int. The dimension of the vision encoder output.
        output_dim: int. The dimension of the text decoder embeddings.
        intermediate_dim: int. The intermediate dimension of the MLP.
        dtype: string or `keras.mixed_precision.DTypePolicy`. The dtype to use
            for model computations and weights.
    """

    def __init__(
        self,
        input_dim,
        output_dim,
        intermediate_dim,
        dtype=None,
        **kwargs,
    ):
        super().__init__(dtype=dtype, **kwargs)

        # === Config ===
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.intermediate_dim = intermediate_dim

        # === Layers ===
        # 2-layer MLP matching HuggingFace architecture
        self.dense_1 = layers.Dense(
            self.intermediate_dim,
            activation="gelu",
            name="dense_1",
        )
        self.dense_2 = layers.Dense(
            self.output_dim,
            name="dense_2",
        )

    def build(self, input_shape):
        self.dense_1.build(input_shape)
        dense_1_output_shape = list(input_shape)
        dense_1_output_shape[-1] = self.intermediate_dim
        self.dense_2.build(tuple(dense_1_output_shape))
        super().build(input_shape)

    def call(self, inputs):
        x = self.dense_1(inputs)
        return self.dense_2(x)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "input_dim": self.input_dim,
                "output_dim": self.output_dim,
                "intermediate_dim": self.intermediate_dim,
            }
        )
        return config

Comment on lines 10 to 30
def test_projector_basics(self):
self.run_layer_test(
cls=Llama3VisionProjector,
init_kwargs={
"hidden_dim": 128, # Vision Encoder output
"output_dim": 256, # Text Model input
"intermediate_dim": 512, # Internal MLP size
},
input_data=np.random.uniform(size=(2, 10, 128)).astype("float32"),
expected_output_shape=(2, 10, 256), # Should match output_dim
# dense_1 (kernel, bias) + dense_2 (kernel, bias)
expected_num_trainable_weights=4,
run_precision_checks=False,
)

def test_defaults(self):
# Test that intermediate_dim defaults to output_dim if not set
projector = Llama3VisionProjector(hidden_dim=32, output_dim=64)
images = np.random.uniform(size=(2, 5, 32)).astype("float32")
outputs = projector(images)
self.assertEqual(outputs.shape, (2, 5, 64))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The tests for Llama3VisionProjector are incorrect and will fail. They seem to be based on a different version of the implementation.

  • test_projector_basics: The init_kwargs are incorrect. hidden_dim should be input_dim. With the corrected 2-layer MLP projector, intermediate_dim is a required argument.
  • test_defaults: This test is also incorrect. It's missing the required intermediate_dim argument and uses hidden_dim instead of input_dim.
    def test_projector_basics(self):
        self.run_layer_test(
            cls=Llama3VisionProjector,
            init_kwargs={
                "input_dim": 128,  # Vision Encoder output
                "output_dim": 256,  # Text Model input
                "intermediate_dim": 512,  # Internal MLP size
            },
            input_data=np.random.uniform(size=(2, 10, 128)).astype("float32"),
            expected_output_shape=(2, 10, 256),  # Should match output_dim
            # dense_1 (kernel, bias) + dense_2 (kernel, bias)
            expected_num_trainable_weights=4,
            run_precision_checks=False,
        )

    def test_projector_with_defaults(self):
        # Test with minimal arguments.
        projector = Llama3VisionProjector(
            input_dim=32, output_dim=64, intermediate_dim=128
        )
        images = np.random.uniform(size=(2, 5, 32)).astype("float32")
        outputs = projector(images)
        self.assertEqual(outputs.shape, (2, 5, 64))

Comment on lines 208 to 219
def convert_vision_projector_weights(keras_projector, hf_model):
"""Convert vision projector weights."""
# HF uses a single nn.Linear, Keras now uses single Dense layer
hf_proj = hf_model.model.multi_modal_projector

# Single Linear layer
keras_projector.projection.kernel.assign(
hf_proj.weight.T.detach().cpu().float().numpy()
)
keras_projector.projection.bias.assign(
hf_proj.bias.detach().cpu().float().numpy()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This conversion logic for the vision projector is incorrect. It assumes a single linear layer, but the Llama 3.2 Vision projector is a 2-layer MLP. This needs to be updated to convert weights for both dense_1 and dense_2 layers, similar to the logic in keras_hub/src/utils/transformers/convert_llama3_vision.py.

Suggested change
def convert_vision_projector_weights(keras_projector, hf_model):
"""Convert vision projector weights."""
# HF uses a single nn.Linear, Keras now uses single Dense layer
hf_proj = hf_model.model.multi_modal_projector
# Single Linear layer
keras_projector.projection.kernel.assign(
hf_proj.weight.T.detach().cpu().float().numpy()
)
keras_projector.projection.bias.assign(
hf_proj.bias.detach().cpu().float().numpy()
)
def convert_vision_projector_weights(keras_projector, hf_model):
"""Convert vision projector weights."""
hf_proj = hf_model.model.multi_modal_projector
# Dense 1 (input projection)
keras_projector.dense_1.kernel.assign(
hf_proj.linear_1.weight.T.detach().cpu().float().numpy()
)
keras_projector.dense_1.bias.assign(
hf_proj.linear_1.bias.detach().cpu().float().numpy()
)
# Dense 2 (output projection)
keras_projector.dense_2.kernel.assign(
hf_proj.linear_2.weight.T.detach().cpu().float().numpy()
)
keras_projector.dense_2.bias.assign(
hf_proj.linear_2.bias.detach().cpu().float().numpy()
)

# === Functional Model ===
image_input = keras.Input(
shape=(vision_image_size, vision_image_size, vision_num_channels),
name="images",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other vision models in the repository, the input name for image data should be pixel_values instead of images. This aligns with the repository's style guide (line 67). You'll also need to update the inputs dictionary in the super().__init__ call on line 225, the example in the docstring on line 74, and the corresponding test file keras_hub/src/models/llama3/llama3_vision_backbone_test.py.

Suggested change
name="images",
name="pixel_values",
References
  1. The style guide specifies standardized names for model input arguments to ensure interoperability. For image models, pixel_values is the recommended name for the input tensor containing image data. (link)


if images is not None and self.image_converter is not None:
images = self.image_converter(images)
output["images"] = images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output key for preprocessed images should be pixel_values to match the expected input name of the Llama3VisionBackbone. This ensures consistency and aligns with the repository's style guide (line 67). This change will also require updating the tests in keras_hub/src/models/llama3/llama3_vision_preprocessor_test.py to check for the pixel_values key in the output dictionary.

Suggested change
output["images"] = images
output["pixel_values"] = images
References
  1. The style guide specifies standardized names for model input arguments to ensure interoperability. For image models, pixel_values is the recommended name for the input tensor containing image data. (link)

@sachinprasadhs sachinprasadhs self-requested a review January 14, 2026 05:17
@Vivek1106-04
Copy link
Author

Refactor: Vision Projector

I have updated the Llama3VisionProjector to use a single Dense layer instead of an MLP.

This change ensures architectural parity with the official Hugging Face implementation (MllamaMultiModalProjector), which uses a single nn.Linear layer:

self.multi_modal_projector = nn.Linear(config.vision_config.vision_output_dim, config.text_config.hidden_size, bias=True)

@sachinprasadhs sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model For PRs that contribute a new model to the Keras Hub registry.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Llama 3.2 Vision (Multimodal) Architecture

2 participants

Comments