Skip to content

Commit 0c9a72e

Browse files
paulpak58mlabonneCyrilvallez
authored
[Model] Lfm2Moe (#41401)
* [new-models] LFM2-MoE Signed-off-by: Paul Pak <[email protected]> * [docs] add in template lfm2_moe doc files Signed-off-by: Paul Pak <[email protected]> * [configuration] update configuration class Signed-off-by: Paul Pak <[email protected]> * [modular][lfm] minor: fix rotary_emb typo Signed-off-by: Paul Pak <[email protected]> * [modeling] modular/modeling files for Lfm2Moe Signed-off-by: Paul Pak <[email protected]> * [modeling][lfm2_moe] fix Lfm2Moe modular/modeling Signed-off-by: Paul Pak <[email protected]> * [configuration][lfm2_moe] update configuration keys with latest config changes Signed-off-by: Paul Pak <[email protected]> * [misc] make fixup Signed-off-by: Paul Pak <[email protected]> * [modular][lfm2_moe] address comments: dtype, mlp, buffers Signed-off-by: Paul Pak <[email protected]> * [configuration][lfm2_moe] add initializer_range Signed-off-by: Paul Pak <[email protected]> * [modular][lfm2_moe] include init_weights to pass test_initialization Signed-off-by: Paul Pak <[email protected]> * [tests][causal_lm] include pos_emb as possible rope attribute Signed-off-by: Paul Pak <[email protected]> * [modeling][lfm2_moe] remove load_balancing_loss_func due to lack of support for hooking expert biases Signed-off-by: Paul Pak <[email protected]> * [misc] make style Signed-off-by: Paul Pak <[email protected]> * [modeling][lfm2_moe] MoE refactor PR update in LFM2Moe Signed-off-by: Paul Pak <[email protected]> * [tests] lfm2_moe: unit tests Signed-off-by: Paul Pak <[email protected]> * [misc] update LFM2-8B-A1B repo id Signed-off-by: Paul Pak <[email protected]> * [tests] lfm2: update ModelTests for lfm2 Signed-off-by: Paul Pak <[email protected]> * Update LFM2 documentation Updated the LFM2 documentation to reflect the addition of a new model size and clarified architectural details. * Add Lfm2Moe documentation Add Lfm2Moe model documentation with overview and example usage. * [misc] fix ci Signed-off-by: Paul Pak <[email protected]> * [docs] remove trust_remote_code Signed-off-by: Paul Pak <[email protected]> * [misc] ci: fix modular Signed-off-by: Paul Pak <[email protected]> * reapply modular * simplify * remove static address and inplace op * simplify * simplify a bit more the modular * imports --------- Signed-off-by: Paul Pak <[email protected]> Co-authored-by: Maxime Labonne <[email protected]> Co-authored-by: Cyril Vallez <[email protected]> Co-authored-by: Cyril Vallez <[email protected]>
1 parent b4428d5 commit 0c9a72e

17 files changed

+1635
-22
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -562,6 +562,8 @@
562562
title: LED
563563
- local: model_doc/lfm2
564564
title: LFM2
565+
- local: model_doc/lfm2_moe
566+
title: LFM2Moe
565567
- local: model_doc/llama
566568
title: LLaMA
567569
- local: model_doc/llama2

docs/source/en/model_doc/lfm2.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,15 +23,15 @@ rendered properly in your Markdown viewer.
2323

2424
## Overview
2525

26-
[LFM2](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models) represents a new generation of Liquid Foundation Models developed by [Liquid AI](https://liquid.ai/), specifically designed for edge AI and on-device deployment.
26+
[LFM2](https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models) represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.
2727

28-
The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
28+
The models are available in four sizes (350M, 700M, 1.2B, and 2.6B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
2929

3030
## Architecture
3131

32-
The architecture consists of 16 blocks total: 10 double-gated short-range convolution blocks and 6 blocks of grouped query attention. This design stems from the concept of dynamical systems, where linear operations are modulated by input-dependent gates, allowing for "liquid" dynamics that can adapt in real-time. The short convolutions are particularly optimized for embedded SoC CPUs, making them ideal for devices that require fast, local inference without relying on cloud connectivity.
32+
The architecture consists of blocks of gated short convolution blocks and blocks of grouped query attention with QK layernorm. This design stems from the concept of dynamical systems, where linear operations are modulated by input-dependent gates. The short convolutions are particularly optimized for embedded SoC CPUs, making them ideal for devices that require fast, local inference without relying on cloud connectivity.
3333

34-
The key architectural innovation of LFM2 lies in its systematic approach to balancing quality, latency, and memory efficiency through our STAR neural architecture search engine. Using STAR, Liquid AI optimized the models for real-world performance on embedded hardware, measuring actual peak memory usage and inference speed on Qualcomm Snapdragon processors. This results in models that achieve 2x faster decode and prefill performance compared to similar-sized models, while maintaining superior benchmark performance across knowledge, mathematics, instruction following, and multilingual tasks.
34+
LFM2 was designed to maximize quality under strict speed and memory constraints. This was accomplished through a systematic architecture search to optimize the models for real-world performance on embedded hardware by measuring actual peak memory usage and inference speed on Qualcomm Snapdragon processors. This results in models that achieve 2x faster decode and prefill performance compared to similar-sized models, while maintaining superior benchmark performance across knowledge, mathematics, instruction following, and multilingual tasks.
3535

3636
## Example
3737

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
<!--Copyright 2025 the HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
6+
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
9+
Unless required by applicable law or agreed to in writing, software
10+
distributed under the License is distributed on an "AS IS" BASIS,
11+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
See the License for the specific language governing permissions and
13+
limitations under the License.
14+
15+
16+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
17+
18+
-->
19+
20+
21+
# Lfm2Moe
22+
23+
## Overview
24+
25+
LFM2-MoE is a Mixture-of-Experts (MoE) variant of [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38). The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.
26+
27+
LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
28+
29+
## Example
30+
31+
The following example shows how to generate an answer using the `AutoModelForCausalLM` class.
32+
33+
```python
34+
from transformers import AutoModelForCausalLM, AutoTokenizer
35+
36+
# Load model and tokenizer
37+
model_id = "LiquidAI/LFM2-8B-A1B"
38+
model = AutoModelForCausalLM.from_pretrained(
39+
model_id,
40+
device_map="auto",
41+
dtype="bfloat16",
42+
# attn_implementation="flash_attention_2" <- uncomment on compatible GPU
43+
)
44+
tokenizer = AutoTokenizer.from_pretrained(model_id)
45+
46+
# Generate answer
47+
prompt = "What is C. elegans?"
48+
input_ids = tokenizer.apply_chat_template(
49+
[{"role": "user", "content": prompt}],
50+
add_generation_prompt=True,
51+
return_tensors="pt",
52+
tokenize=True,
53+
).to(model.device)
54+
55+
output = model.generate(
56+
input_ids,
57+
do_sample=True,
58+
temperature=0.3,
59+
min_p=0.15,
60+
repetition_penalty=1.05,
61+
max_new_tokens=512,
62+
)
63+
64+
print(tokenizer.decode(output[0], skip_special_tokens=False))
65+
```
66+
67+
## Lfm2MoeConfig
68+
69+
[[autodoc]] Lfm2MoeConfig
70+
71+
## Lfm2MoeForCausalLM
72+
73+
[[autodoc]] Lfm2MoeForCausalLM
74+
75+
## Lfm2MoeModel
76+
77+
[[autodoc]] Lfm2MoeModel
78+
- forward
79+
80+
## Lfm2MoePreTrainedModel
81+
82+
[[autodoc]] Lfm2MoePreTrainedModel
83+
- forward

src/transformers/models/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,7 @@
186186
from .led import *
187187
from .levit import *
188188
from .lfm2 import *
189+
from .lfm2_moe import *
189190
from .lfm2_vl import *
190191
from .lightglue import *
191192
from .lilt import *

src/transformers/models/auto/configuration_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,7 @@
226226
("led", "LEDConfig"),
227227
("levit", "LevitConfig"),
228228
("lfm2", "Lfm2Config"),
229+
("lfm2_moe", "Lfm2MoeConfig"),
229230
("lfm2_vl", "Lfm2VlConfig"),
230231
("lightglue", "LightGlueConfig"),
231232
("lilt", "LiltConfig"),
@@ -670,6 +671,7 @@
670671
("led", "LED"),
671672
("levit", "LeViT"),
672673
("lfm2", "Lfm2"),
674+
("lfm2_moe", "Lfm2Moe"),
673675
("lfm2_vl", "Lfm2Vl"),
674676
("lightglue", "LightGlue"),
675677
("lilt", "LiLT"),

src/transformers/models/auto/modeling_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
226226
("led", "LEDModel"),
227227
("levit", "LevitModel"),
228228
("lfm2", "Lfm2Model"),
229+
("lfm2_moe", "Lfm2MoeModel"),
229230
("lfm2_vl", "Lfm2VlModel"),
230231
("lightglue", "LightGlueForKeypointMatching"),
231232
("lilt", "LiltModel"),
@@ -694,6 +695,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
694695
("jamba", "JambaForCausalLM"),
695696
("jetmoe", "JetMoeForCausalLM"),
696697
("lfm2", "Lfm2ForCausalLM"),
698+
("lfm2_moe", "Lfm2MoeForCausalLM"),
697699
("llama", "LlamaForCausalLM"),
698700
("llama4", "Llama4ForCausalLM"),
699701
("llama4_text", "Llama4ForCausalLM"),

src/transformers/models/lfm2/modeling_lfm2.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,6 @@ def __init__(
163163
dtype=self._dtype,
164164
device=device,
165165
)
166-
torch._dynamo.mark_static_address(conv_state)
167166
self.conv_cache.append(conv_state)
168167
self.key_cache.append(torch.tensor([]))
169168
self.value_cache.append(torch.tensor([]))
@@ -595,7 +594,6 @@ def __init__(self, config: Lfm2Config):
595594
self.layers = nn.ModuleList(
596595
[Lfm2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
597596
)
598-
self.rotary_emb = Lfm2RotaryEmbedding(config=config)
599597
self.gradient_checkpointing = False
600598
self.pos_emb = Lfm2RotaryEmbedding(config)
601599
self.embedding_norm = Lfm2RMSNorm(config.hidden_size, eps=config.norm_eps)

src/transformers/models/lfm2/modular_lfm2.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,6 @@ def __init__(
121121
dtype=self._dtype,
122122
device=device,
123123
)
124-
torch._dynamo.mark_static_address(conv_state)
125124
self.conv_cache.append(conv_state)
126125
self.key_cache.append(torch.tensor([]))
127126
self.value_cache.append(torch.tensor([]))
@@ -441,7 +440,7 @@ def __init__(self, config: Lfm2Config):
441440
self.pos_emb = Lfm2RotaryEmbedding(config)
442441
self.embedding_norm = Lfm2RMSNorm(config.hidden_size, eps=config.norm_eps)
443442
del self.norm
444-
del self.rotary_emv
443+
del self.rotary_emb
445444

446445
def forward(
447446
self,
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# coding=utf-8
2+
# Copyright 2025 the HuggingFace Team. All rights reserved.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
from typing import TYPE_CHECKING
17+
18+
from ...utils import _LazyModule
19+
from ...utils.import_utils import define_import_structure
20+
21+
22+
if TYPE_CHECKING:
23+
from .configuration_lfm2_moe import *
24+
from .modeling_lfm2_moe import *
25+
else:
26+
import sys
27+
28+
_file = globals()["__file__"]
29+
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from typing import Optional
15+
16+
from ...configuration_utils import PretrainedConfig
17+
18+
19+
class Lfm2MoeConfig(PretrainedConfig):
20+
r"""
21+
This is the configuration class to store the configuration of a [`Lfm2MoeModel`]. It is used to instantiate a LFM2 Moe
22+
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
23+
defaults will yield a similar configuration to that of the LFM2-8B-A1B model.
24+
e.g. [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B)
25+
26+
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
27+
documentation from [`PretrainedConfig`] for more information.
28+
29+
30+
Args:
31+
vocab_size (`int`, *optional*, defaults to 65536):
32+
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
33+
`inputs_ids` passed when calling [`Lfm2Model`]
34+
hidden_size (`int`, *optional*, defaults to 2048):
35+
Dimension of the hidden representations.
36+
intermediate_size (`int`, *optional*, defaults to 7168):
37+
Dimension of the MLP representations.
38+
moe_intermediate_size (`int`, *optional*, defaults to 1792):
39+
Intermediate size of the routed expert.
40+
num_hidden_layers (`int`, *optional*, defaults to 32):
41+
Number of hidden layers in the Transformer decoder.
42+
pad_token_id (`int`, *optional*, defaults to 0):
43+
Padding token id.
44+
bos_token_id (`int`, *optional*, defaults to 1):
45+
Beginning of stream token id.
46+
eos_token_id (`int`, *optional*, defaults to 2):
47+
End of stream token id.
48+
tie_word_embeddings (`bool`, *optional*, defaults to `True`):
49+
Whether to tie weight embeddings
50+
rope_theta (`float`, *optional*, defaults to 1000000.0):
51+
The base period of the RoPE embeddings.
52+
max_position_embeddings (`int`, *optional*, defaults to 128000):
53+
The maximum sequence length that this model might ever be used with.
54+
use_cache (`bool`, *optional*, defaults to `True`):
55+
Whether or not the model should return the last key/values attentions (not used by all models). Only
56+
relevant if `config.is_decoder=True`.
57+
norm_eps (`float`, *optional*, defaults to 1e-05):
58+
The epsilon used by the rms normalization layers.
59+
num_attention_heads (`int`, *optional*, defaults to 32):
60+
Number of attention heads for each attention layer in the Transformer decoder.
61+
num_key_value_heads (`int`, *optional*, defaults to 8):
62+
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
63+
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
64+
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
65+
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
66+
by meanpooling all the original heads within that group. For more details, check out [this
67+
paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to
68+
`num_attention_heads`.
69+
conv_bias (`bool`, *optional*, defaults to `False`):
70+
Whether to use bias in the conv layers.
71+
conv_L_cache (`int`, *optional*, defaults to 3):
72+
L_cache dim in the conv layers.
73+
num_dense_layers (`int`, *optional*, defaults to 2):
74+
Number of dense Lfm2MoeMLP layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
75+
num_experts_per_tok (`int`, *optional*, defaults to 4):
76+
Number of selected experts.
77+
num_experts (`int`, *optional*, defaults to 32):
78+
Number of routed experts.
79+
use_expert_bias (`bool`, *optional*, defaults to `True`):
80+
Whether to use the expert bias on the routing weights.
81+
routed_scaling_factor (`float`, *optional*, defaults to 1.0):
82+
Scaling factor for routed experts in MoE models.
83+
norm_topk_prob (`bool`, *optional*, defaults to `True`):
84+
Whether to normalize the topk probabilities.
85+
layer_types (`Optional`, *optional*):
86+
Type of each layers.
87+
88+
```python
89+
>>> from transformers import Lfm2MoeModel, Lfm2MoeConfig
90+
91+
>>> # Initializing a LFM2 Moe model
92+
>>> configuration = Lfm2MoeConfig()
93+
94+
>>> # Initializing a model from the LFM2-8B-A1B style configuration
95+
>>> model = Lfm2MoeModel(configuration)
96+
97+
>>> # Accessing the model configuration
98+
>>> configuration = model.config
99+
```"""
100+
101+
model_type = "lfm2_moe"
102+
keys_to_ignore_at_inference = ["past_key_values"]
103+
104+
def __init__(
105+
self,
106+
vocab_size: int = 65536,
107+
hidden_size: int = 2048,
108+
intermediate_size: int = 7168,
109+
moe_intermediate_size: int = 1792,
110+
num_hidden_layers: int = 32,
111+
pad_token_id: int = 0,
112+
bos_token_id: int = 1,
113+
eos_token_id: int = 2,
114+
tie_word_embeddings: bool = True,
115+
rope_theta: float = 1000000.0,
116+
max_position_embeddings: int = 128_000,
117+
use_cache: bool = True,
118+
norm_eps: float = 0.00001,
119+
num_attention_heads: int = 32,
120+
num_key_value_heads: int = 8,
121+
conv_bias: bool = False,
122+
conv_L_cache: int = 3,
123+
num_dense_layers: int = 2,
124+
num_experts_per_tok: int = 4,
125+
num_experts: int = 32,
126+
use_expert_bias: bool = True,
127+
routed_scaling_factor: float = 1.0,
128+
norm_topk_prob: bool = True,
129+
layer_types: Optional[list[str]] = None,
130+
**kwargs,
131+
):
132+
self.vocab_size = vocab_size
133+
self.hidden_size = hidden_size
134+
self.intermediate_size = intermediate_size
135+
self.num_hidden_layers = num_hidden_layers
136+
self.rope_theta = rope_theta
137+
self.max_position_embeddings = max_position_embeddings
138+
self.use_cache = use_cache
139+
self.norm_eps = norm_eps
140+
141+
# attn operator config
142+
self.num_attention_heads = num_attention_heads
143+
self.num_key_value_heads = num_key_value_heads
144+
145+
# custom operator config
146+
self.conv_bias = conv_bias
147+
self.conv_L_cache = conv_L_cache
148+
149+
# moe config
150+
self.num_dense_layers = num_dense_layers
151+
self.moe_intermediate_size = moe_intermediate_size
152+
self.num_experts_per_tok = num_experts_per_tok
153+
self.num_experts = num_experts
154+
self.use_expert_bias = use_expert_bias
155+
self.routed_scaling_factor = routed_scaling_factor
156+
self.norm_topk_prob = norm_topk_prob
157+
self.layer_types = layer_types
158+
159+
tie_word_embeddings = kwargs.get("tie_embedding", tie_word_embeddings) # to fit original config keys
160+
super().__init__(
161+
pad_token_id=pad_token_id,
162+
bos_token_id=bos_token_id,
163+
eos_token_id=eos_token_id,
164+
tie_word_embeddings=tie_word_embeddings,
165+
**kwargs,
166+
)
167+
168+
169+
__all__ = ["Lfm2MoeConfig"]

0 commit comments

Comments
 (0)