Hybrid Mamba-2 + Transformer 2.94B LLM (Nemotron-H style) — Korean 3B model pretrained from scratch on 7× NVIDIA B200 GPUs with SFT + DPO alignment
-
Updated
Mar 26, 2026 - Python
Hybrid Mamba-2 + Transformer 2.94B LLM (Nemotron-H style) — Korean 3B model pretrained from scratch on 7× NVIDIA B200 GPUs with SFT + DPO alignment
Rust-native MoE inference runtime with custom CUDA kernels for Blackwell GPUs. Includes DFlash speculative decoding, multi-tier Engram memory, and entropy-adaptive routing. Targets Qwen3.5-35B-A3B on a single RTX 5060 Ti 16GB.
Hybrid SSM-Attention language model on Apple Silicon with MLX — interleaving Mamba-2 and Transformer for efficient inference
llama.cpp fork with additional SOTA quants and improved performance
Add a description, image, and links to the mamba2 topic page so that developers can more easily learn about it.
To associate your repository with the mamba2 topic, visit your repo's landing page and select "manage topics."