Conversation
Summary of ChangesHello @maitry63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Keras Hub's model offerings by introducing the ModernBert model. This new model is designed to leverage recent advancements in transformer architectures, providing a more efficient and performant alternative to traditional BERT implementations. The changes encompass the full model lifecycle, from core architectural components and a dedicated preprocessor to a Masked Language Model task for pre-training. Additionally, the PR includes updates to existing GPT-OSS model configurations, adding new safeguard variants and their checkpoint conversion support. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the ModernBert model, including its backbone, layers, a masked language modeling task, and a preprocessor. The implementation follows the repository's modular structure well. However, there are several critical issues that need to be addressed before merging.
Key issues include:
- A critical bug in the
ModernBertAttentionlayer invocation that will cause aTypeError. - Incorrect
get_configimplementation inModernBertMaskedLMthat breaks model serialization. - An unused dropout layer and redundant parameter assignments.
- Several violations of the repository's contribution guidelines, such as missing test files for
modernbert_layers.pyandmodernbert_preprocessor.py, and the absence of validation Colab notebooks.
I've provided specific comments and suggestions to fix these issues. Please also ensure that test coverage is added for the new layers and preprocessor, and that validation notebooks are provided to demonstrate numerical equivalence, as per the contribution guidelines.
| activation=gelu_approximate, layer_norm_epsilon=1e-5, **kwargs): | ||
| super().__init__(**kwargs) | ||
| self.attn_norm = layers.LayerNormalization(epsilon=layer_norm_epsilon, rms_scaling=True) | ||
| self.attn = ModernBertAttention(hidden_size, num_heads, rotary_embedding) |
There was a problem hiding this comment.
There is a critical issue in how ModernBertAttention is instantiated. The __init__ signature is (self, hidden_size, num_heads, head_dim, rotary_embedding=None, **kwargs), but it's being called with rotary_embedding passed to the head_dim argument. This will cause a TypeError. You need to calculate head_dim and pass it explicitly. Using keyword arguments is recommended to prevent such errors.
| self.attn = ModernBertAttention(hidden_size, num_heads, rotary_embedding) | |
| head_dim = hidden_size // num_heads | |
| self.attn = ModernBertAttention( | |
| hidden_size, | |
| num_heads, | |
| head_dim, | |
| rotary_embedding=rotary_embedding, | |
| ) |
| def get_config(self): | ||
| config = super().get_config() | ||
| config.update({ | ||
| "backbone": self.backbone, | ||
| "preprocessor": self.preprocessor, | ||
| }) | ||
| return config |
There was a problem hiding this comment.
This get_config implementation is incorrect and will break model serialization. It calls super().get_config(), which correctly returns a serializable config with backbone and preprocessor, but then it overwrites these with the raw, non-serializable layer objects. Since this class does not have any additional configuration parameters, you can simply remove this method and inherit get_config from the Task base class.
References
- Write a
get_config()which chains to super. (link)
| def call(self, x, padding_mask=None): | ||
| # Attention Residual path | ||
| x = x + self.attn(self.attn_norm(x), padding_mask=padding_mask) | ||
| # MLP Residual path | ||
| x = x + self.mlp(self.mlp_norm(x)) | ||
| return x |
There was a problem hiding this comment.
The dropout_layer is initialized in __init__ but is not used here. To prevent overfitting, dropout should be applied to the outputs of the attention and MLP layers within the residual connections.
def call(self, x, padding_mask=None):
# Attention Residual path
x = x + self.dropout_layer(self.attn(self.attn_norm(x), padding_mask=padding_mask))
# MLP Residual path
x = x + self.dropout_layer(self.mlp(self.mlp_norm(x)))
return x| return_tensors="tf", | ||
| return_tensors= "ops", |
There was a problem hiding this comment.
| rotary_max_wavelength=160000, | ||
| layer_norm_epsilon=1e-5, | ||
| dtype=None, | ||
| **kwargs,) -> None: |
There was a problem hiding this comment.
The repository style guide (line 373) specifies that type hints should not be used in function signatures. Please remove the -> None from the __init__ method signature.
| **kwargs,) -> None: | |
| **kwargs): |
References
- KerasHub does not use type hints in function signatures or
__init__methods. Default values are okay(example: None, "string_value", etc). (link)
| self.vocabulary_size = vocabulary_size | ||
| self.hidden_dim = hidden_dim | ||
| self.intermediate_dim = intermediate_dim | ||
| self.num_layers = num_layers | ||
| self.num_heads = num_heads | ||
| self.dropout = dropout | ||
| self.rotary_max_wavelength = rotary_max_wavelength | ||
| self.layer_norm_epsilon = layer_norm_epsilon |
| def __init__(self, hidden_size, intermediate_size, activation=gelu_approximate, **kwargs): | ||
| super().__init__(**kwargs) | ||
| self.wi_0 = layers.Dense(intermediate_size, name="wi_0") | ||
| self.wi_1 = layers.Dense(intermediate_size, name="wi_1") | ||
| self.wo = layers.Dense(hidden_size, name="wo") | ||
| self.activation = activation | ||
|
|
||
|
|
||
| def call(self, x): | ||
| return self.wo(self.activation(self.wi_0(x)) * self.wi_1(x)) | ||
|
|
||
| class ModernBertAttention(layers.Layer): | ||
| """ModernBERT Attention with Rotary Positional Embeddings (RoPE).""" | ||
| def __init__(self, hidden_size, num_heads, head_dim, rotary_embedding=None, **kwargs): | ||
| super().__init__(**kwargs) | ||
| self.num_heads = num_heads | ||
| self.hidden_size = hidden_size | ||
| self.head_dim = head_dim | ||
| self.key_dim = head_dim | ||
| self.rotary_embedding = rotary_embedding | ||
| self.qkv = layers.Dense(hidden_size * 3, name="qkv") | ||
| self.out_dense = layers.Dense(hidden_size, name="out_dense") | ||
|
|
||
| def call(self, x, padding_mask=None): | ||
| batch_size = ops.shape(x)[0] | ||
| seq_len = ops.shape(x)[1] | ||
|
|
||
| qkv = self.qkv(x) # (batch, seq, hidden*3) | ||
| # Reshape to (batch, seq, 3, heads, head_dim) | ||
| qkv = ops.reshape(qkv, (batch_size, seq_len, 3, self.num_heads, self.head_dim)) | ||
| q, k, v = ops.unstack(qkv, axis=2) | ||
|
|
||
| if self.rotary_embedding is not None: | ||
| q, k = self.rotary_embedding(q), self.rotary_embedding(k) | ||
|
|
||
| q = ops.transpose(q, (0, 2, 1, 3)) | ||
| k = ops.transpose(k, (0, 2, 3, 1)) # Ready for matmul | ||
| v = ops.transpose(v, (0, 2, 1, 3)) | ||
|
|
||
| scale = ops.cast(ops.sqrt(ops.cast(self.key_dim, x.dtype)), x.dtype) | ||
| scores = ops.matmul(q, k) / scale | ||
|
|
||
| # scores = ops.matmul(q, k) / ops.sqrt(ops.cast(self.head_dim, x.dtype)) | ||
|
|
||
| if padding_mask is not None: | ||
| m = ops.cast(padding_mask[:, None, None, :], scores.dtype) | ||
| scores = scores + (1.0 - m) * -1e9 | ||
|
|
||
|
|
||
| attn = ops.softmax(scores, axis=-1) | ||
| out = ops.matmul(attn, v) | ||
| out = ops.transpose(out, (0, 2, 1, 3)) | ||
| # Flatten heads and head_dim back into hidden_dim | ||
| out = ops.reshape(out, (batch_size, seq_len, self.hidden_size)) | ||
|
|
||
| return self.out_dense(out) | ||
|
|
||
| class ModernBertEncoderLayer(layers.Layer): | ||
| """ModernBERT Encoder Layer implementation. | ||
|
|
||
| This layer implements a modernized Transformer block featuring: | ||
| 1. Pre-Normalization (Norm-First architecture). | ||
| 2. RMSNorm scaling (LayerNorm without additive bias). | ||
| 3. Gated Linear Unit (GeGLU) activation in the MLP. | ||
|
|
||
| Args: | ||
| hidden_size: int. Dimensionality of the encoder layer. | ||
| intermediate_size: int. Dimensionality of the MLP intermediate layer. | ||
| num_heads: int. Number of attention heads. | ||
| rotary_embedding: `RotaryEmbedding` layer. Optional rotary positional encoding. | ||
| activation: function. Activation function for the MLP. | ||
| layer_norm_epsilon: float. Epsilon for the LayerNorm layers. | ||
| """ | ||
| def __init__(self, hidden_size, intermediate_size, num_heads, | ||
| rotary_embedding=None, | ||
| dropout=0.0, | ||
| activation=gelu_approximate, layer_norm_epsilon=1e-5, **kwargs): |
There was a problem hiding this comment.
For consistency with ModernBertBackbone and the repository's style guide examples (e.g., line 93), the argument names hidden_size and intermediate_size should be hidden_dim and intermediate_dim respectively across all classes in this file (ModernBertMLP, ModernBertAttention, ModernBertEncoderLayer). Please update the __init__ signatures and the class bodies accordingly.
| probability_matrix = ops.random.uniform(ops.shape(input_ids)) | ||
| mask = (probability_matrix < self.mlm_probability) & (input_ids != self.tokenizer.pad_token_id) | ||
|
|
||
| labels = ops.where(mask, labels, -100) | ||
| mask_token_mask = ops.random.uniform(ops.shape(input_ids)) < 0.8 | ||
| input_ids = ops.where(mask & mask_token_mask, self.tokenizer.mask_token_id, input_ids) |
There was a problem hiding this comment.
This implementation of Masked Language Modeling (MLM) masking is a simplified version of the original BERT strategy. Currently, it replaces 80% of the selected tokens with [MASK] and leaves the other 20% unchanged. The standard BERT approach is to replace 80% with [MASK], 10% with a random token, and keep 10% as is. While this simplification might be intentional, it deviates from the standard and could impact pre-training. If the goal is to follow the original BERT pre-training strategy, consider implementing the full 80-10-10 rule.
772d959 to
4b6d2a2
Compare
03a2ebf to
c5e4500
Compare
|
Closing this in favor of a new, cleaner PR with the updated dynamic architecture and refactored codebase. Please follow the new PR here: [https://github.com//pull/2518] |
Added ModernBert Implementation Files
Reference
Already closed refer PR