Adding ModernBert model by maitry63 · Pull Request #2477 · keras-team/keras-hub

maitry63 · 2025-12-18T07:11:15Z

Added ModernBert Implementation Files

Reference

Already closed refer PR

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have followed the Keras Hub Model contribution guidelines in making these changes.
I have followed the Keras Hub API design guidelines in making these changes.
I have signed the Contributor License Agreement.

gemini-code-assist · 2025-12-18T07:11:43Z

Summary of Changes

Hello @maitry63, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the Keras Hub's model offerings by introducing the ModernBert model. This new model is designed to leverage recent advancements in transformer architectures, providing a more efficient and performant alternative to traditional BERT implementations. The changes encompass the full model lifecycle, from core architectural components and a dedicated preprocessor to a Masked Language Model task for pre-training. Additionally, the PR includes updates to existing GPT-OSS model configurations, adding new safeguard variants and their checkpoint conversion support.

Highlights

ModernBert Model Integration: The core ModernBert model, a modernized BERT variant, has been added to Keras Hub, providing a new option for natural language processing tasks.
Architectural Enhancements: ModernBert incorporates advanced features such as Rotary Positional Embeddings (RoPE) and RMSNorm for improved efficiency and performance compared to traditional BERT architectures.
Comprehensive Components: The implementation includes the ModernBertBackbone, specialized ModernBertEncoderLayer, ModernBertAttention, ModernBertMLP, a ModernBertMaskedLM task for pre-training, and a ModernBertPreprocessor for data handling.
Unit Test Coverage: New unit tests have been added for the ModernBertBackbone and ModernBertMaskedLM to ensure the correctness, stability, and proper functionality of the new model components.
GPT-OSS Preset Expansion: Existing GPT-OSS model presets have been updated to include new 20B and 120B 'safeguard' versions, along with their corresponding checkpoint conversion mappings.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the ModernBert model, including its backbone, layers, a masked language modeling task, and a preprocessor. The implementation follows the repository's modular structure well. However, there are several critical issues that need to be addressed before merging.

Key issues include:

A critical bug in the ModernBertAttention layer invocation that will cause a TypeError.
Incorrect get_config implementation in ModernBertMaskedLM that breaks model serialization.
An unused dropout layer and redundant parameter assignments.
Several violations of the repository's contribution guidelines, such as missing test files for modernbert_layers.py and modernbert_preprocessor.py, and the absence of validation Colab notebooks.

I've provided specific comments and suggestions to fix these issues. Please also ensure that test coverage is added for the new layers and preprocessor, and that validation notebooks are provided to demonstrate numerical equivalence, as per the contribution guidelines.

gemini-code-assist · 2025-12-18T07:14:12Z

keras_hub/src/models/modernbert/modernbert_layers.py

+                 activation=gelu_approximate, layer_norm_epsilon=1e-5, **kwargs):
+        super().__init__(**kwargs)
+        self.attn_norm = layers.LayerNormalization(epsilon=layer_norm_epsilon, rms_scaling=True)
+        self.attn = ModernBertAttention(hidden_size, num_heads, rotary_embedding)


There is a critical issue in how ModernBertAttention is instantiated. The __init__ signature is (self, hidden_size, num_heads, head_dim, rotary_embedding=None, **kwargs), but it's being called with rotary_embedding passed to the head_dim argument. This will cause a TypeError. You need to calculate head_dim and pass it explicitly. Using keyword arguments is recommended to prevent such errors.

Suggested change

self.attn = ModernBertAttention(hidden_size, num_heads, rotary_embedding)

head_dim = hidden_size // num_heads

self.attn = ModernBertAttention(

hidden_size,

num_heads,

head_dim,

rotary_embedding=rotary_embedding,

)

gemini-code-assist · 2025-12-18T07:14:12Z

keras_hub/src/models/modernbert/modernbert_masked_lm.py

+    def get_config(self):
+        config = super().get_config()
+        config.update({
+            "backbone": self.backbone,
+            "preprocessor": self.preprocessor,
+        })
+        return config


This get_config implementation is incorrect and will break model serialization. It calls super().get_config(), which correctly returns a serializable config with backbone and preprocessor, but then it overwrites these with the raw, non-serializable layer objects. Since this class does not have any additional configuration parameters, you can simply remove this method and inherit get_config from the Task base class.

References

Write a get_config() which chains to super. ^(link)

gemini-code-assist · 2025-12-18T07:14:12Z

keras_hub/src/models/modernbert/modernbert_layers.py

+    def call(self, x, padding_mask=None):
+        # Attention Residual path
+        x = x + self.attn(self.attn_norm(x), padding_mask=padding_mask)
+        # MLP Residual path
+        x = x + self.mlp(self.mlp_norm(x))
+        return x


The dropout_layer is initialized in __init__ but is not used here. To prevent overfitting, dropout should be applied to the outputs of the attention and MLP layers within the residual connections.

def call(self, x, padding_mask=None): # Attention Residual path x = x + self.dropout_layer(self.attn(self.attn_norm(x), padding_mask=padding_mask)) # MLP Residual path x = x + self.dropout_layer(self.mlp(self.mlp_norm(x))) return x

gemini-code-assist · 2025-12-18T07:14:12Z

keras_hub/src/models/modernbert/modernbert_preprocessor.py

+            return_tensors="tf",
+            return_tensors= "ops",


The return_tensors argument is specified twice. The second value ("ops") will overwrite the first ("tf"). Please remove the duplicate argument. Using "ops" is correct for backend-agnostic code.

Suggested change

return_tensors="tf",

return_tensors= "ops",

return_tensors="ops",

gemini-code-assist · 2025-12-18T07:14:12Z

keras_hub/src/models/modernbert/modernbert_backbone.py

+        rotary_max_wavelength=160000,
+        layer_norm_epsilon=1e-5,
+        dtype=None,
+        **kwargs,) -> None:


The repository style guide (line 373) specifies that type hints should not be used in function signatures. Please remove the -> None from the __init__ method signature.

Suggested change

**kwargs,) -> None:

**kwargs):

References

KerasHub does not use type hints in function signatures or __init__ methods. Default values are okay(example: None, "string_value", etc). ^(link)

gemini-code-assist · 2025-12-18T07:14:13Z

keras_hub/src/models/modernbert/modernbert_backbone.py

+        self.vocabulary_size = vocabulary_size
+        self.hidden_dim = hidden_dim
+        self.intermediate_dim = intermediate_dim
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.rotary_max_wavelength = rotary_max_wavelength
+        self.layer_norm_epsilon = layer_norm_epsilon


These attributes are already assigned at the beginning of the __init__ method (lines 63-70). This block of code is redundant and can be removed to improve maintainability.

gemini-code-assist · 2025-12-18T07:14:13Z

keras_hub/src/models/modernbert/modernbert_layers.py

+    def __init__(self, hidden_size, intermediate_size, activation=gelu_approximate, **kwargs):
+        super().__init__(**kwargs)
+        self.wi_0 = layers.Dense(intermediate_size, name="wi_0")
+        self.wi_1 = layers.Dense(intermediate_size, name="wi_1")
+        self.wo = layers.Dense(hidden_size, name="wo")
+        self.activation = activation
+
+
+    def call(self, x):
+        return self.wo(self.activation(self.wi_0(x)) * self.wi_1(x))
+
+class ModernBertAttention(layers.Layer):
+    """ModernBERT Attention with Rotary Positional Embeddings (RoPE)."""
+    def __init__(self, hidden_size, num_heads, head_dim, rotary_embedding=None, **kwargs):
+        super().__init__(**kwargs)
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+        self.head_dim = head_dim
+        self.key_dim = head_dim
+        self.rotary_embedding = rotary_embedding
+        self.qkv = layers.Dense(hidden_size * 3, name="qkv")
+        self.out_dense = layers.Dense(hidden_size, name="out_dense")
+
+    def call(self, x, padding_mask=None):
+        batch_size = ops.shape(x)[0]
+        seq_len = ops.shape(x)[1]
+
+        qkv = self.qkv(x) # (batch, seq, hidden*3)
+        # Reshape to (batch, seq, 3, heads, head_dim)
+        qkv = ops.reshape(qkv, (batch_size, seq_len, 3, self.num_heads, self.head_dim))
+        q, k, v = ops.unstack(qkv, axis=2)
+
+        if self.rotary_embedding is not None:
+            q, k = self.rotary_embedding(q), self.rotary_embedding(k)
+
+        q = ops.transpose(q, (0, 2, 1, 3))
+        k = ops.transpose(k, (0, 2, 3, 1)) # Ready for matmul
+        v = ops.transpose(v, (0, 2, 1, 3))
+
+        scale = ops.cast(ops.sqrt(ops.cast(self.key_dim, x.dtype)), x.dtype)
+        scores = ops.matmul(q, k) / scale
+
+        # scores = ops.matmul(q, k) / ops.sqrt(ops.cast(self.head_dim, x.dtype))
+
+        if padding_mask is not None:
+            m = ops.cast(padding_mask[:, None, None, :], scores.dtype)
+            scores = scores + (1.0 - m) * -1e9
+
+
+        attn = ops.softmax(scores, axis=-1)
+        out = ops.matmul(attn, v)
+        out = ops.transpose(out, (0, 2, 1, 3))
+        # Flatten heads and head_dim back into hidden_dim
+        out = ops.reshape(out, (batch_size, seq_len, self.hidden_size))
+
+        return self.out_dense(out)
+
+class ModernBertEncoderLayer(layers.Layer):
+    """ModernBERT Encoder Layer implementation.
+
+    This layer implements a modernized Transformer block featuring:
+    1. Pre-Normalization (Norm-First architecture).
+    2. RMSNorm scaling (LayerNorm without additive bias).
+    3. Gated Linear Unit (GeGLU) activation in the MLP.
+
+    Args:
+        hidden_size: int. Dimensionality of the encoder layer.
+        intermediate_size: int. Dimensionality of the MLP intermediate layer.
+        num_heads: int. Number of attention heads.
+        rotary_embedding: `RotaryEmbedding` layer. Optional rotary positional encoding.
+        activation: function. Activation function for the MLP.
+        layer_norm_epsilon: float. Epsilon for the LayerNorm layers.
+    """
+    def __init__(self, hidden_size, intermediate_size, num_heads, 
+                 rotary_embedding=None,
+                 dropout=0.0, 
+                 activation=gelu_approximate, layer_norm_epsilon=1e-5, **kwargs):


For consistency with ModernBertBackbone and the repository's style guide examples (e.g., line 93), the argument names hidden_size and intermediate_size should be hidden_dim and intermediate_dim respectively across all classes in this file (ModernBertMLP, ModernBertAttention, ModernBertEncoderLayer). Please update the __init__ signatures and the class bodies accordingly.

References

Use standardized names for arg names that should be consistent with other models in the repository ^(link)

All class names must use CapWords (PascalCase). ^(link)

gemini-code-assist · 2025-12-18T07:14:13Z

keras_hub/src/models/modernbert/modernbert_preprocessor.py

+        probability_matrix = ops.random.uniform(ops.shape(input_ids))
+        mask = (probability_matrix < self.mlm_probability) & (input_ids != self.tokenizer.pad_token_id)
+
+        labels = ops.where(mask, labels, -100)
+        mask_token_mask = ops.random.uniform(ops.shape(input_ids)) < 0.8
+        input_ids = ops.where(mask & mask_token_mask, self.tokenizer.mask_token_id, input_ids)


This implementation of Masked Language Modeling (MLM) masking is a simplified version of the original BERT strategy. Currently, it replaces 80% of the selected tokens with [MASK] and leaves the other 20% unchanged. The standard BERT approach is to replace 80% with [MASK], 10% with a random token, and keep 10% as is. While this simplification might be intentional, it deviates from the standard and could impact pre-training. If the goal is to follow the original BERT pre-training strategy, consider implementing the full 80-10-10 rule.

maitry63 · 2026-01-12T02:58:17Z

Closing this in favor of a new, cleaner PR with the updated dynamic architecture and refactored codebase. Please follow the new PR here: [https://github.com//pull/2518]

SauravMaheshkar added 3 commits May 16, 2025 15:49

feat: init ModernBertBackbone

6bed56a

feat: update backbone + add tokenizer

5b6e62a

chore: api-gen

89e7dcb

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

maitry63 force-pushed the add-modernbert branch from 772d959 to 4b6d2a2 Compare December 22, 2025 05:27

divyashreepathihalli requested review from divyashreepathihalli and sachinprasadhs December 23, 2025 09:29

feat: add correct ModernBERT implementation with dynamic backbone

c5e4500

maitry63 force-pushed the add-modernbert branch from 03a2ebf to c5e4500 Compare January 12, 2026 02:30

maitry63 closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ModernBert model#2477

Adding ModernBert model#2477
maitry63 wants to merge 4 commits intokeras-team:masterfrom
maitry63:add-modernbert

maitry63 commented Dec 18, 2025

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

gemini-code-assist bot Dec 18, 2025

Uh oh!

maitry63 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return_tensors="tf",
	return_tensors= "ops",
	return_tensors="ops",

Conversation

maitry63 commented Dec 18, 2025

Added ModernBert Implementation Files

Reference

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

maitry63 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants