Implement Llama 3.2 Vision (11B/90B) Architecture by Vivek1106-04 · Pull Request #2521 · keras-team/keras-hub

Vivek1106-04 · 2026-01-14T04:29:33Z

Description

This PR implements the full Llama 3.2 Vision architecture, supporting the 11B and 90B multimodal variants. Unlike the early-fusion Llama 3 text models, this implementation uses Gated Cross-Attention (Late Fusion) to inject vision features into specific text decoder layers.

Key Features Implemented:

Core Architecture: Llama3VisionBackbone rewritten to support Cross-Attention layers (instead of linear fusion).
Vision Encoder: Full Two-Stage Encoder support (SigLIP-based local layers + Global transformer layers).
Cross-Attention: Implemented Llama3VisionCrossAttention with gated query-key-value injection at specific intervals (e.g., layers 3, 8, 13...).
Weight Support: Included convert_llama3_vision.py to port official Meta/HuggingFace weights.
Training Utils: Added freeze_for_vision_adapter_training() to support efficient fine-tuning (freezing text backbone + local encoder).
Fixes Implement Llama 3.2 Vision (Multimodal) Architecture #2470

…tage Encoder)

gemini-code-assist · 2026-01-14T04:30:11Z

Summary of Changes

Hello @Vivek1106-04, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the full Llama 3.2 Vision architecture, enabling advanced multimodal capabilities within the Keras ecosystem. It achieves this by integrating a sophisticated two-stage vision encoder with the Llama 3 text model through a novel Gated Cross-Attention mechanism, allowing for dynamic and context-aware fusion of visual and textual information. The implementation is modular, providing distinct Keras layers for each component, and includes essential utilities for seamless weight conversion from HuggingFace and efficient fine-tuning strategies.

Highlights

Full Llama 3.2 Vision Architecture: Implements the complete Llama 3.2 Vision architecture, supporting 11B and 90B multimodal variants.
Gated Cross-Attention (Late Fusion): Utilizes a Gated Cross-Attention mechanism for late fusion, injecting vision features into specific text decoder layers, a departure from early-fusion Llama 3 text models.
Two-Stage Vision Encoder: Includes a comprehensive two-stage vision encoder, combining SigLIP-based local layers with global transformer layers for robust visual feature extraction.
Modular Component Implementation: Introduces dedicated Keras layers for Llama3VisionBackbone, Llama3VisionCrossAttention, Llama3VisionEncoder, Llama3VisionProjector, Llama3VisionImageConverter, and Llama3VisionPreprocessor.
Weight Conversion and Fine-tuning Utilities: Provides convert_llama3_vision.py to port official Meta/HuggingFace weights and freeze_for_vision_adapter_training() for efficient fine-tuning by freezing the text backbone and local encoder.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces a comprehensive implementation of the Llama 3.2 Vision architecture. The code is well-structured and follows the repository's conventions for modularity and testing. I've identified a few areas for improvement:

There's a critical issue with the Llama3VisionProjector implementation, which uses a single linear layer instead of the required 2-layer MLP. This affects the model architecture, weight conversion script, and tests.
For consistency with the repository's style guide, the image input to the backbone should be named pixel_values instead of images. This requires minor updates to the backbone and preprocessor.

I've provided detailed suggestions to address these points. Once these are resolved, the implementation will be solid.

gemini-code-assist · 2026-01-14T04:32:54Z

keras_hub/src/models/llama3/llama3_vision_projector.py

+class Llama3VisionProjector(keras.layers.Layer):
+    """Vision projector for the Llama 3.2 Vision model.
+
+    This layer projects vision encoder features into the text embedding space
+    using a single linear projection, enabling vision-language fusion.
+
+    Args:
+        input_dim: int. The dimension of the vision encoder output
+            (vision_output_dim from HuggingFace config).
+        output_dim: int. The dimension of the text decoder embeddings.
+        dtype: string or `keras.mixed_precision.DTypePolicy`. The dtype to use
+            for model computations and weights.
+    """
+
+    def __init__(
+        self,
+        input_dim,
+        output_dim,
+        dtype=None,
+        **kwargs,
+    ):
+        super().__init__(dtype=dtype, **kwargs)
+
+        # === Config ===
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+
+        # === Layers ===
+        # Single linear projection matching HuggingFace architecture
+        self.projection = layers.Dense(
+            self.output_dim,
+            use_bias=True,
+            name="projection",
+        )
+
+    def build(self, input_shape):
+        self.projection.build(input_shape)
+        super().build(input_shape)
+
+    def call(self, inputs):
+        return self.projection(inputs)
+
+    def get_config(self):
+        config = super().get_config()
+        config.update(
+            {
+                "input_dim": self.input_dim,
+                "output_dim": self.output_dim,
+            }
+        )
+        return config


The Llama 3.2 Vision model uses a 2-layer MLP for its vision projector, but this implementation uses only a single Dense layer. This is incorrect and should be updated to a 2-layer MLP to match the original architecture. This will require adding an intermediate_dim parameter and an activation function (typically GELU).

class Llama3VisionProjector(keras.layers.Layer): """Vision projector for the Llama 3.2 Vision model. This layer projects vision encoder features into the text embedding space using a 2-layer MLP, enabling vision-language fusion. Args: input_dim: int. The dimension of the vision encoder output. output_dim: int. The dimension of the text decoder embeddings. intermediate_dim: int. The intermediate dimension of the MLP. dtype: string or `keras.mixed_precision.DTypePolicy`. The dtype to use for model computations and weights. """ def __init__( self, input_dim, output_dim, intermediate_dim, dtype=None, **kwargs, ): super().__init__(dtype=dtype, **kwargs) # === Config === self.input_dim = input_dim self.output_dim = output_dim self.intermediate_dim = intermediate_dim # === Layers === # 2-layer MLP matching HuggingFace architecture self.dense_1 = layers.Dense( self.intermediate_dim, activation="gelu", name="dense_1", ) self.dense_2 = layers.Dense( self.output_dim, name="dense_2", ) def build(self, input_shape): self.dense_1.build(input_shape) dense_1_output_shape = list(input_shape) dense_1_output_shape[-1] = self.intermediate_dim self.dense_2.build(tuple(dense_1_output_shape)) super().build(input_shape) def call(self, inputs): x = self.dense_1(inputs) return self.dense_2(x) def get_config(self): config = super().get_config() config.update( { "input_dim": self.input_dim, "output_dim": self.output_dim, "intermediate_dim": self.intermediate_dim, } ) return config

gemini-code-assist · 2026-01-14T04:32:54Z

keras_hub/src/models/llama3/llama3_vision_projector_test.py

+    def test_projector_basics(self):
+        self.run_layer_test(
+            cls=Llama3VisionProjector,
+            init_kwargs={
+                "hidden_dim": 128,  # Vision Encoder output
+                "output_dim": 256,  # Text Model input
+                "intermediate_dim": 512,  # Internal MLP size
+            },
+            input_data=np.random.uniform(size=(2, 10, 128)).astype("float32"),
+            expected_output_shape=(2, 10, 256),  # Should match output_dim
+            # dense_1 (kernel, bias) + dense_2 (kernel, bias)
+            expected_num_trainable_weights=4,
+            run_precision_checks=False,
+        )
+
+    def test_defaults(self):
+        # Test that intermediate_dim defaults to output_dim if not set
+        projector = Llama3VisionProjector(hidden_dim=32, output_dim=64)
+        images = np.random.uniform(size=(2, 5, 32)).astype("float32")
+        outputs = projector(images)
+        self.assertEqual(outputs.shape, (2, 5, 64))


The tests for Llama3VisionProjector are incorrect and will fail. They seem to be based on a different version of the implementation.

test_projector_basics: The init_kwargs are incorrect. hidden_dim should be input_dim. With the corrected 2-layer MLP projector, intermediate_dim is a required argument.

test_defaults: This test is also incorrect. It's missing the required intermediate_dim argument and uses hidden_dim instead of input_dim.

def test_projector_basics(self): self.run_layer_test( cls=Llama3VisionProjector, init_kwargs={ "input_dim": 128, # Vision Encoder output "output_dim": 256, # Text Model input "intermediate_dim": 512, # Internal MLP size }, input_data=np.random.uniform(size=(2, 10, 128)).astype("float32"), expected_output_shape=(2, 10, 256), # Should match output_dim # dense_1 (kernel, bias) + dense_2 (kernel, bias) expected_num_trainable_weights=4, run_precision_checks=False, ) def test_projector_with_defaults(self): # Test with minimal arguments. projector = Llama3VisionProjector( input_dim=32, output_dim=64, intermediate_dim=128 ) images = np.random.uniform(size=(2, 5, 32)).astype("float32") outputs = projector(images) self.assertEqual(outputs.shape, (2, 5, 64))

gemini-code-assist · 2026-01-14T04:32:54Z

tools/checkpoint_conversion/convert_llama3_vision_checkpoints.py

+def convert_vision_projector_weights(keras_projector, hf_model):
+    """Convert vision projector weights."""
+    # HF uses a single nn.Linear, Keras now uses single Dense layer
+    hf_proj = hf_model.model.multi_modal_projector
+
+    # Single Linear layer
+    keras_projector.projection.kernel.assign(
+        hf_proj.weight.T.detach().cpu().float().numpy()
+    )
+    keras_projector.projection.bias.assign(
+        hf_proj.bias.detach().cpu().float().numpy()
+    )


This conversion logic for the vision projector is incorrect. It assumes a single linear layer, but the Llama 3.2 Vision projector is a 2-layer MLP. This needs to be updated to convert weights for both dense_1 and dense_2 layers, similar to the logic in keras_hub/src/utils/transformers/convert_llama3_vision.py.

Suggested change

def convert_vision_projector_weights(keras_projector, hf_model):

"""Convert vision projector weights."""

# HF uses a single nn.Linear, Keras now uses single Dense layer

hf_proj = hf_model.model.multi_modal_projector

# Single Linear layer

keras_projector.projection.kernel.assign(

hf_proj.weight.T.detach().cpu().float().numpy()

)

keras_projector.projection.bias.assign(

hf_proj.bias.detach().cpu().float().numpy()

)

def convert_vision_projector_weights(keras_projector, hf_model):

"""Convert vision projector weights."""

hf_proj = hf_model.model.multi_modal_projector

# Dense 1 (input projection)

keras_projector.dense_1.kernel.assign(

hf_proj.linear_1.weight.T.detach().cpu().float().numpy()

)

keras_projector.dense_1.bias.assign(

hf_proj.linear_1.bias.detach().cpu().float().numpy()

)

# Dense 2 (output projection)

keras_projector.dense_2.kernel.assign(

hf_proj.linear_2.weight.T.detach().cpu().float().numpy()

)

keras_projector.dense_2.bias.assign(

hf_proj.linear_2.bias.detach().cpu().float().numpy()

)

gemini-code-assist · 2026-01-14T04:32:54Z

keras_hub/src/models/llama3/llama3_vision_backbone.py

+        # === Functional Model ===
+        image_input = keras.Input(
+            shape=(vision_image_size, vision_image_size, vision_num_channels),
+            name="images",


For consistency with other vision models in the repository, the input name for image data should be pixel_values instead of images. This aligns with the repository's style guide (line 67). You'll also need to update the inputs dictionary in the super().__init__ call on line 225, the example in the docstring on line 74, and the corresponding test file keras_hub/src/models/llama3/llama3_vision_backbone_test.py.

Suggested change

name="images",

name="pixel_values",

References

The style guide specifies standardized names for model input arguments to ensure interoperability. For image models, pixel_values is the recommended name for the input tensor containing image data. ^(link)

gemini-code-assist · 2026-01-14T04:32:54Z

keras_hub/src/models/llama3/llama3_vision_preprocessor.py

+
+        if images is not None and self.image_converter is not None:
+            images = self.image_converter(images)
+            output["images"] = images


The output key for preprocessed images should be pixel_values to match the expected input name of the Llama3VisionBackbone. This ensures consistency and aligns with the repository's style guide (line 67). This change will also require updating the tests in keras_hub/src/models/llama3/llama3_vision_preprocessor_test.py to check for the pixel_values key in the output dictionary.

Suggested change

output["images"] = images

output["pixel_values"] = images

References

The style guide specifies standardized names for model input arguments to ensure interoperability. For image models, pixel_values is the recommended name for the input tensor containing image data. ^(link)

…ace spec

Vivek1106-04 · 2026-01-14T05:53:47Z

Refactor: Vision Projector

I have updated the Llama3VisionProjector to use a single Dense layer instead of an MLP.

This change ensures architectural parity with the official Hugging Face implementation (MllamaMultiModalProjector), which uses a single nn.Linear layer:

self.multi_modal_projector = nn.Linear(config.vision_config.vision_output_dim, config.text_config.hidden_size, bias=True)

Vivek1106-04 added 12 commits January 14, 2026 09:46

feat: Initial config and skeleton for llama 3.2 Vision

4f0632d

style: Apply linting and formatting fixes

de4ccfc

style: Apply linting and formatting fixes

1441c5a

chore: Update auto-generated API docs

f39b835

ci: Trigger restart

0ed8b94

ci: Trigger restart for Llama 3 Vision

9976f27

feat: Implement Llama 3.2 Vision Architecture (Cross-Attention, Two-S…

642da8f

…tage Encoder)

error checks

e0309f0

fix: updated tests

2e6c8b4

chore: Apply api_gen updates

670bffb

Refactor Llama3 Vision to Keras Hub style

2c9a6fe

update files

7fa74bc

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

sachinprasadhs self-requested a review January 14, 2026 05:17

refactor: Align Llama 3.2 Vision architecture with official Hugging F…

6f06823

…ace spec

Vivek1106-04 added 5 commits January 19, 2026 19:05

error fix

2ae9134

update __init__.py

0296fb1

update layers dimensions matching hugging face

2f3e209

Fixing Cross-Attention Query Norm Shape

4483f17

update conversion script

78c32ad

sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Llama 3.2 Vision (11B/90B) Architecture#2521

Implement Llama 3.2 Vision (11B/90B) Architecture#2521
Vivek1106-04 wants to merge 18 commits intokeras-team:masterfrom
Vivek1106-04:llama3-vision

Vivek1106-04 commented Jan 14, 2026

Uh oh!

gemini-code-assist bot commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

Vivek1106-04 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Vivek1106-04 commented Jan 14, 2026

Description

Key Features Implemented:

Uh oh!

gemini-code-assist bot commented Jan 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Vivek1106-04 commented Jan 14, 2026

Refactor: Vision Projector

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments