Skip to content

Adding ModernBert model#2477

Closed
maitry63 wants to merge 4 commits intokeras-team:masterfrom
maitry63:add-modernbert
Closed

Adding ModernBert model#2477
maitry63 wants to merge 4 commits intokeras-team:masterfrom
maitry63:add-modernbert

Conversation

@maitry63
Copy link

Added ModernBert Implementation Files

Reference

Already closed refer PR

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @maitry63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the Keras Hub's model offerings by introducing the ModernBert model. This new model is designed to leverage recent advancements in transformer architectures, providing a more efficient and performant alternative to traditional BERT implementations. The changes encompass the full model lifecycle, from core architectural components and a dedicated preprocessor to a Masked Language Model task for pre-training. Additionally, the PR includes updates to existing GPT-OSS model configurations, adding new safeguard variants and their checkpoint conversion support.

Highlights

  • ModernBert Model Integration: The core ModernBert model, a modernized BERT variant, has been added to Keras Hub, providing a new option for natural language processing tasks.
  • Architectural Enhancements: ModernBert incorporates advanced features such as Rotary Positional Embeddings (RoPE) and RMSNorm for improved efficiency and performance compared to traditional BERT architectures.
  • Comprehensive Components: The implementation includes the ModernBertBackbone, specialized ModernBertEncoderLayer, ModernBertAttention, ModernBertMLP, a ModernBertMaskedLM task for pre-training, and a ModernBertPreprocessor for data handling.
  • Unit Test Coverage: New unit tests have been added for the ModernBertBackbone and ModernBertMaskedLM to ensure the correctness, stability, and proper functionality of the new model components.
  • GPT-OSS Preset Expansion: Existing GPT-OSS model presets have been updated to include new 20B and 120B 'safeguard' versions, along with their corresponding checkpoint conversion mappings.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ModernBert model, including its backbone, layers, a masked language modeling task, and a preprocessor. The implementation follows the repository's modular structure well. However, there are several critical issues that need to be addressed before merging.

Key issues include:

  • A critical bug in the ModernBertAttention layer invocation that will cause a TypeError.
  • Incorrect get_config implementation in ModernBertMaskedLM that breaks model serialization.
  • An unused dropout layer and redundant parameter assignments.
  • Several violations of the repository's contribution guidelines, such as missing test files for modernbert_layers.py and modernbert_preprocessor.py, and the absence of validation Colab notebooks.

I've provided specific comments and suggestions to fix these issues. Please also ensure that test coverage is added for the new layers and preprocessor, and that validation notebooks are provided to demonstrate numerical equivalence, as per the contribution guidelines.

activation=gelu_approximate, layer_norm_epsilon=1e-5, **kwargs):
super().__init__(**kwargs)
self.attn_norm = layers.LayerNormalization(epsilon=layer_norm_epsilon, rms_scaling=True)
self.attn = ModernBertAttention(hidden_size, num_heads, rotary_embedding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical issue in how ModernBertAttention is instantiated. The __init__ signature is (self, hidden_size, num_heads, head_dim, rotary_embedding=None, **kwargs), but it's being called with rotary_embedding passed to the head_dim argument. This will cause a TypeError. You need to calculate head_dim and pass it explicitly. Using keyword arguments is recommended to prevent such errors.

Suggested change
self.attn = ModernBertAttention(hidden_size, num_heads, rotary_embedding)
head_dim = hidden_size // num_heads
self.attn = ModernBertAttention(
hidden_size,
num_heads,
head_dim,
rotary_embedding=rotary_embedding,
)

Comment on lines 51 to 57
def get_config(self):
config = super().get_config()
config.update({
"backbone": self.backbone,
"preprocessor": self.preprocessor,
})
return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This get_config implementation is incorrect and will break model serialization. It calls super().get_config(), which correctly returns a serializable config with backbone and preprocessor, but then it overwrites these with the raw, non-serializable layer objects. Since this class does not have any additional configuration parameters, you can simply remove this method and inherit get_config from the Task base class.

References
  1. Write a get_config() which chains to super. (link)

Comment on lines 92 to 97
def call(self, x, padding_mask=None):
# Attention Residual path
x = x + self.attn(self.attn_norm(x), padding_mask=padding_mask)
# MLP Residual path
x = x + self.mlp(self.mlp_norm(x))
return x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The dropout_layer is initialized in __init__ but is not used here. To prevent overfitting, dropout should be applied to the outputs of the attention and MLP layers within the residual connections.

    def call(self, x, padding_mask=None):
        # Attention Residual path
        x = x + self.dropout_layer(self.attn(self.attn_norm(x), padding_mask=padding_mask))
        # MLP Residual path
        x = x + self.dropout_layer(self.mlp(self.mlp_norm(x)))
        return x

Comment on lines 20 to 21
return_tensors="tf",
return_tensors= "ops",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return_tensors argument is specified twice. The second value ("ops") will overwrite the first ("tf"). Please remove the duplicate argument. Using "ops" is correct for backend-agnostic code.

Suggested change
return_tensors="tf",
return_tensors= "ops",
return_tensors="ops",

rotary_max_wavelength=160000,
layer_norm_epsilon=1e-5,
dtype=None,
**kwargs,) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repository style guide (line 373) specifies that type hints should not be used in function signatures. Please remove the -> None from the __init__ method signature.

Suggested change
**kwargs,) -> None:
**kwargs):
References
  1. KerasHub does not use type hints in function signatures or __init__ methods. Default values are okay(example: None, "string_value", etc). (link)

Comment on lines 134 to 141
self.vocabulary_size = vocabulary_size
self.hidden_dim = hidden_dim
self.intermediate_dim = intermediate_dim
self.num_layers = num_layers
self.num_heads = num_heads
self.dropout = dropout
self.rotary_max_wavelength = rotary_max_wavelength
self.layer_norm_epsilon = layer_norm_epsilon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These attributes are already assigned at the beginning of the __init__ method (lines 63-70). This block of code is redundant and can be removed to improve maintainability.

Comment on lines 8 to 84
def __init__(self, hidden_size, intermediate_size, activation=gelu_approximate, **kwargs):
super().__init__(**kwargs)
self.wi_0 = layers.Dense(intermediate_size, name="wi_0")
self.wi_1 = layers.Dense(intermediate_size, name="wi_1")
self.wo = layers.Dense(hidden_size, name="wo")
self.activation = activation


def call(self, x):
return self.wo(self.activation(self.wi_0(x)) * self.wi_1(x))

class ModernBertAttention(layers.Layer):
"""ModernBERT Attention with Rotary Positional Embeddings (RoPE)."""
def __init__(self, hidden_size, num_heads, head_dim, rotary_embedding=None, **kwargs):
super().__init__(**kwargs)
self.num_heads = num_heads
self.hidden_size = hidden_size
self.head_dim = head_dim
self.key_dim = head_dim
self.rotary_embedding = rotary_embedding
self.qkv = layers.Dense(hidden_size * 3, name="qkv")
self.out_dense = layers.Dense(hidden_size, name="out_dense")

def call(self, x, padding_mask=None):
batch_size = ops.shape(x)[0]
seq_len = ops.shape(x)[1]

qkv = self.qkv(x) # (batch, seq, hidden*3)
# Reshape to (batch, seq, 3, heads, head_dim)
qkv = ops.reshape(qkv, (batch_size, seq_len, 3, self.num_heads, self.head_dim))
q, k, v = ops.unstack(qkv, axis=2)

if self.rotary_embedding is not None:
q, k = self.rotary_embedding(q), self.rotary_embedding(k)

q = ops.transpose(q, (0, 2, 1, 3))
k = ops.transpose(k, (0, 2, 3, 1)) # Ready for matmul
v = ops.transpose(v, (0, 2, 1, 3))

scale = ops.cast(ops.sqrt(ops.cast(self.key_dim, x.dtype)), x.dtype)
scores = ops.matmul(q, k) / scale

# scores = ops.matmul(q, k) / ops.sqrt(ops.cast(self.head_dim, x.dtype))

if padding_mask is not None:
m = ops.cast(padding_mask[:, None, None, :], scores.dtype)
scores = scores + (1.0 - m) * -1e9


attn = ops.softmax(scores, axis=-1)
out = ops.matmul(attn, v)
out = ops.transpose(out, (0, 2, 1, 3))
# Flatten heads and head_dim back into hidden_dim
out = ops.reshape(out, (batch_size, seq_len, self.hidden_size))

return self.out_dense(out)

class ModernBertEncoderLayer(layers.Layer):
"""ModernBERT Encoder Layer implementation.

This layer implements a modernized Transformer block featuring:
1. Pre-Normalization (Norm-First architecture).
2. RMSNorm scaling (LayerNorm without additive bias).
3. Gated Linear Unit (GeGLU) activation in the MLP.

Args:
hidden_size: int. Dimensionality of the encoder layer.
intermediate_size: int. Dimensionality of the MLP intermediate layer.
num_heads: int. Number of attention heads.
rotary_embedding: `RotaryEmbedding` layer. Optional rotary positional encoding.
activation: function. Activation function for the MLP.
layer_norm_epsilon: float. Epsilon for the LayerNorm layers.
"""
def __init__(self, hidden_size, intermediate_size, num_heads,
rotary_embedding=None,
dropout=0.0,
activation=gelu_approximate, layer_norm_epsilon=1e-5, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with ModernBertBackbone and the repository's style guide examples (e.g., line 93), the argument names hidden_size and intermediate_size should be hidden_dim and intermediate_dim respectively across all classes in this file (ModernBertMLP, ModernBertAttention, ModernBertEncoderLayer). Please update the __init__ signatures and the class bodies accordingly.

References
  1. Use standardized names for arg names that should be consistent with other models in the repository (link)
  2. All class names must use CapWords (PascalCase). (link)

Comment on lines 27 to 32
probability_matrix = ops.random.uniform(ops.shape(input_ids))
mask = (probability_matrix < self.mlm_probability) & (input_ids != self.tokenizer.pad_token_id)

labels = ops.where(mask, labels, -100)
mask_token_mask = ops.random.uniform(ops.shape(input_ids)) < 0.8
input_ids = ops.where(mask & mask_token_mask, self.tokenizer.mask_token_id, input_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This implementation of Masked Language Modeling (MLM) masking is a simplified version of the original BERT strategy. Currently, it replaces 80% of the selected tokens with [MASK] and leaves the other 20% unchanged. The standard BERT approach is to replace 80% with [MASK], 10% with a random token, and keep 10% as is. While this simplification might be intentional, it deviates from the standard and could impact pre-training. If the goal is to follow the original BERT pre-training strategy, consider implementing the full 80-10-10 rule.

@maitry63
Copy link
Author

Closing this in favor of a new, cleaner PR with the updated dynamic architecture and refactored codebase. Please follow the new PR here: [https://github.com//pull/2518]

@maitry63 maitry63 closed this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants