-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
The addition of Microsoft's Phi-4-mini-flash-reasoning, which is a hybrid Transformer/Mamba model in the SambaY architecture, using a combination of sliding window attention (SWA) with Mamba1 and gated memory units (GMU), which allow for arbitrary-length text exchanges where the sliding window attention limits reference to more recent tokens. I would request also the addition of Nvidia's Nemotron-nano-9b-v2, which is also a hybrid Transformer/Mamba model that uses a more usual attention with Mamba2 layers (which makes it particularly similar to IBM's Granite 4.0 models, which also use these same types of layers). Nemotron-nano-9b-v2 has been a very good small (under 10B) coding model with fast inference, which is unusual nowadays. I am very inclined to try to add support for these myself, but I am unfamiliar with the Llama.cpp codebase and would certainly accept pointers to where and what I should focus my efforts from people more experienced with the codebase.
Motivation
With the incorporation of support for hybrid models into Llama.cpp with IBM's Granite 4.0 models, the addition of other hybrid models can now be built over the codebase already implemented for Granite. The addition of other hybrid models would allow for better comparisons of different hybrid strategies under the same inference engine, allowing for a comparison of "apples to apples", so to speak, with relation to the amount of resource usage to model performance. Also, it is very interesting that in Microsoft's team paper for Phi-4-mini-flash-reasoning (https://arxiv.org/abs/2507.06607) they found that the Mamba1 layer was better than a Mamba2 layer for the SambaY architecture, which is intriguing since other hybrid models seem to just adopt Mamba2 as the mamba layer (possibly thinking that Mamba2 is necessarily better in every respect to Mamba1). The possibility of democratizing both these models with Llama.cpp support could instigate more people to investigate these questions of when Mamba2 is better than Mamba1 and vice versa. These models have been produced with several additional package requirements that are finnicky to get all working, so a streamlined implementation would allow more people to test these small LLM models.
Possible Implementation
For Nvidia's Nemotron-nano-9b-v2, a good amount of the code implemented for IBM's Granite 4.0 models could be reused since both are Transformer/Mamba2 hybrid models, which could make the implementation be more straightforward and worthwhile. Microsoft's Phi-4-mini-flash-reasoning could be a little more demanding since it uses Mamba1, which I don't think is currently explicitly supported by Llama.cpp, though the fact that Mamba2 now is certainly diminishes the workload for implementing Mamba1 support, and gated memory units (GMU) have not yet been implemented in Llama.cpp. Although sliding window attention (SWA) has been implemented for the Gemma-3 models that use this attention variant.