Feature Request: Support for Microsoft's Phi-4-mini-flash-reasoning and Nvidia's Nemotron-nano-9b-v2

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

The addition of Microsoft's Phi-4-mini-flash-reasoning, which is a hybrid Transformer/Mamba model in the SambaY architecture, using a combination of sliding window attention (SWA) with Mamba1 and gated memory units (GMU), which allow for arbitrary-length text exchanges where the sliding window attention limits reference to more recent tokens. I would request also the addition of Nvidia's Nemotron-nano-9b-v2, which is also a hybrid Transformer/Mamba model that uses a more usual attention with Mamba2 layers (which makes it particularly similar to IBM's Granite 4.0 models, which also use these same types of layers). Nemotron-nano-9b-v2 has been a very good small (under 10B) coding model with fast inference, which is unusual nowadays. I am very inclined to try to add support for these myself, but I am unfamiliar with the Llama.cpp codebase and would certainly accept pointers to where and what I should focus my efforts from people more experienced with the codebase.

### Motivation

With the incorporation of support for hybrid models into Llama.cpp with IBM's Granite 4.0 models, the addition of other hybrid models can now be built over the codebase already implemented for Granite. The addition of other hybrid models would allow for better comparisons of different hybrid strategies under the same inference engine, allowing for a comparison of "apples to apples", so to speak, with relation to the amount of resource usage to model performance. Also, it is very interesting that in Microsoft's team paper for Phi-4-mini-flash-reasoning ([https://arxiv.org/abs/2507.06607](https://arxiv.org/abs/2507.06607)) they found that the Mamba1 layer was better than a Mamba2 layer for the SambaY architecture, which is intriguing since other hybrid models seem to just adopt Mamba2 as the mamba layer (possibly thinking that Mamba2 is necessarily better in every respect to Mamba1). The possibility of democratizing both these models with Llama.cpp support could instigate more people to investigate these questions of when Mamba2 is better than Mamba1 and vice versa. These models have been produced with several additional package requirements that are finnicky to get all working, so a streamlined implementation would allow more people to test these small LLM models.

### Possible Implementation

For Nvidia's Nemotron-nano-9b-v2, a good amount of the code implemented for IBM's Granite 4.0 models could be reused since both are Transformer/Mamba2 hybrid models, which could make the implementation be more straightforward and worthwhile. Microsoft's Phi-4-mini-flash-reasoning could be a little more demanding since it uses Mamba1, which I don't think is currently explicitly supported by Llama.cpp, though the fact that Mamba2 now is certainly diminishes the workload for implementing Mamba1 support, and gated memory units (GMU) have not yet been implemented in Llama.cpp. Although sliding window attention (SWA) has been implemented for the Gemma-3 models that use this attention variant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Support for Microsoft's Phi-4-mini-flash-reasoning and Nvidia's Nemotron-nano-9b-v2 #16450

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Support for Microsoft's Phi-4-mini-flash-reasoning and Nvidia's Nemotron-nano-9b-v2 #16450

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions