Commit f8c6422

authored and

committed

feat(models): Add initial implementation of GraniteMoeHybrid generated by Claude Code

This commit was entirely generated using Claude Code and the following prompt: --- I've got an in-depth feature request for you to add. I need you to add support for the GraniteMoeHybrid architecture to the `mlx-lm` project. The task is to extend the existing set of model architecture implementations in `mlx_lm/models` by adding a new module named `granitemoehybrid.py`. Here are a few key pointers on this model architecture: * It is a hybrid-recurrent model that uses `mamba2` for some layers (recurrent) and `granitemoe` for some layers (attention) * It is very similar to the `nemotron_h` architecture implemented in `mlx_lm/models/nemotron_h.py`, but with a few key differences * In `GraniteMoeHybrid`, each layer has either a `mamba2` block or a `granitemoe` attention block AND a MoE block, whereas in `nemotron_h`, each "layer" is a single block that is either `mamba2`, `attention` (llama), or `ffn` (not MoE). * The config for `GraniteMoeHybrid` uses the `layer_types` field to determine whether to use `mamba2` or `granitemoe` attention for each layer * The `transformers` implementation can be found at https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py * The config can be found at https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/configuration_granitemoehybrid.py * The PR adding support in `llama.cpp` is: ggml-org/llama.cpp#13550 * NOTE: In `llama.cpp`, I made the architecture slightly more flexible such that each layer could use either a MoE block OR a fully-connected FFN block after the recurrent/attention block * For the `granitemoe` attention, the architecture is very similar to standard `llama` attention, but it includes 4 additional scalar multipliers that are pulled from config: * `embedding_multiplier`: * Multiply the input embeddings by this scalar before the first layer * Used here in `transformers` https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py#L1347 * `attention_multiplier`: * Used as the scaling factor in standard attention in place of the default 1/sqrt(n_embed_head) * Used here in `transformers`: https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py#L217 The goal of this project is to create a fully working local implementation of the model in `mlx_lm`. You can find a local model to test with at /Users/ghart/models/granite-4.0-tiny-preview/. You can find a version of the `nemotron_h` model to test with at /Users/ghart/models/nvidia/NVIDIA-Nemotron-Nano-9B-v2/. To accomplish this project, you'll need to take the following steps: 1. Get a development environment working (you can use `uv` to manage your virtual env) and install the necessary dependencies 2. Run a sample inference with a model that is already known to work (eg `/Users/ghart/models/nvidia/NVIDIA-Nemotron-Nano-9B-v2/`) 3. Create the new module at `mlx_lm/models/granitemoehybrid.py` 4. Implement the model architecture, test, and iterate until you've got things working locally Once you've got it working, let me know and I'll review and commit --- Branch: GraniteHybrid Signed-off-by: Gabe Goodhart <[email protected]>

1 parent 4a085c7 commit f8c6422Copy full SHA for f8c6422

1 file changed

+521

-0

lines changed

mlx_lm/models
- granitemoehybrid.py

1 file changed

+521

-0

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit f8c6422

1 file changed

1 file changed

File tree

1 file changed

1 file changed

0 commit comments