Skip to content

Commit 0d232f4

Browse files
authored
Add blog for Nemotron v3 nano (#275)
* add docs * upd * fix
1 parent b575542 commit 0d232f4

File tree

4 files changed

+111
-0
lines changed

4 files changed

+111
-0
lines changed
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: "SGLang Adds Day-0 Support for the Highly Efficient, Open Nemotron 3 Nano Hybrid MoE Model"
3+
author: "NVIDIA Nemotron Team"
4+
date: "December 15, 2025"
5+
previewImg: /images/blog/nemotron-3-nano/benchmark.png
6+
---
7+
8+
We are excited to announce that SGLang supports the latest highly efficient NVIDIA Nemotron 3 Nano model on Day 0!
9+
10+
Nemotron 3 Nano, part of the newly announced open [Nemotron 3 family](https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/), is a compact MoE language model offering industry-leading compute efficiency and accuracy, enabling developers to build specialized agentic AI systems.
11+
12+
Nemotron 3 Nano is fully open with open-weights, datasets and recipes so developers can easily customize, optimize, and deploy the model on their infrastructure for maximum privacy and security. The chart below shows that Nemotron 3 Nano is in the most attractive quadrant in Artificial Analysis Openness vs Intelligence Index.
13+
14+
15+
![figure1](/images/blog/nemotron-3-nano/artificial_analysis.png)<small><center>NVIDIA Nemotron 3 Sets a New Standard for Open Source AI</center></small>
16+
17+
## TL;DR
18+
19+
20+
- Architecture:
21+
- Mixture of Experts (MoE) with Hybrid Transformer-Mamba Architecture
22+
- Supports Thinking Budget for providing optimal accuracy with minimum reasoning token generation
23+
- Accuracy
24+
- Leading accuracy on coding, scientific reasoning, math, and instruction following
25+
- Model size: 30B with 3.6B active parameters
26+
- Context length: 1M
27+
- Model input: Text
28+
- Model output: Text
29+
- Supported GPUs: NVIDIA RTX Pro 6000, DGX Spark, H100, B200.
30+
- Get started:
31+
- Download model weights from Hugging Face - [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)
32+
- [Run with SGLang for inference](https://cookbook.sglang.io/docs/NVIDIA/Nemotron3-Nano)
33+
- [Technical report](http://research.nvidia.com/labs/nemotron) to build custom, optimized models with Nemotron techniques.
34+
35+
## Installation and Quick Start
36+
37+
For an easier setup with SGLang, refer to our getting started cookbook, available [here](https://cookbook.sglang.io/docs/NVIDIA/Nemotron3-Nano) or through NVIDIA Brev [launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-36ikQZX0ZDTSCGE7YkqxiOKwKsj).
38+
39+
Run the command below to install dependencies:
40+
```bash
41+
pip install sglang==0.5.6.post2.dev7852+g8102e36b5 --extra-index-url https://sgl-project.github.io/whl/nightly/
42+
```
43+
44+
We can then serve this model:
45+
```bash
46+
# BF16
47+
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
48+
49+
# Swap out model name for FP8
50+
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-FP8 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
51+
```
52+
53+
Once the server is up and running, you can prompt the model using the below code snippets:
54+
55+
```python
56+
from openai import OpenAI
57+
58+
# The model name we used when launching the server.
59+
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16"
60+
61+
BASE_URL = f"http://localhost:30000/v1"
62+
API_KEY = "EMPTY" # SGLang server doesn't require an API key by default
63+
64+
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
65+
66+
resp = client.chat.completions.create(
67+
model=SERVED_MODEL_NAME,
68+
messages=[
69+
{"role": "system", "content": "You are a helpful AI assistant."},
70+
{"role": "user", "content": "Give me 3 bullet points about SGLang."}
71+
],
72+
temperature=0.6,
73+
max_tokens=512,
74+
)
75+
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)
76+
```
77+
78+
79+
## Nemotron 3 Nano provides highest efficiency with leading accuracy for building AI agents
80+
81+
Nemotron 3 Nano builds on the hybrid Mamba-Transformer architecture by replacing standard feed-forward network (FFN) layers with MoE layers and most of the attention layers with Mamba-2. This enables higher accuracy while using only a fraction of the active parameters. By leveraging MoE, Nemotron 3 Nano reduces compute demands and satisfies the tight latency constraints required for real-world deployment.
82+
83+
Nemotron 3 Nano’s hybrid Mamba-Transformer architecture boosts token throughput by up to 4x, allowing the model to reason more quickly while delivering higher accuracy. Its “thinking budget” feature helps avoid unnecessary computation, reducing overthinking and ensuring lower, more predictable inference costs.
84+
85+
![figure1](/images/blog/nemotron-3-nano/speed.png)<small><center>Nemotron 3 Nano delivers higher throughput with leading accuracy among open reasoning models</center></small>
86+
87+
88+
Trained on NVIDIA-curated, high-quality data, Nemotron 3 Nano leads on benchmarks such as SWE Bench Verified, GPQA Diamond, AIME 2025, Arena Hard v2, and IFBench delivering top-tier accuracy in coding, [reasoning](https://www.nvidia.com/en-us/glossary/ai-reasoning/), math and instruction following. This makes it ideal for building AI agents for various enterprise use cases including finance, cybersecurity, software development and retail.
89+
90+
![figure1](/images/blog/nemotron-3-nano/benchmark.png)<small><center>Nemotron 3 Nano provides leading accuracy on various popular academic benchmarks among open small reasoning models</center></small>
91+
92+
93+
94+
## Get Started
95+
96+
- Download Nemotron 3 Nano model weights from Hugging Face - [BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8)
97+
- Run with SGLang for inference using [this](https://cookbook.sglang.io/docs/NVIDIA/Nemotron3-Nano) cookbook or through this NVIDIA Brev [launchable](https://brev.nvidia.com/launchable/deploy?launchableID=env-36ikQZX0ZDTSCGE7YkqxiOKwKsj).
98+
99+
100+
## Further Reading
101+
- [Share your ideas](http://nemotron.ideas.nvidia.com/?ncid=so-othe-692335) and vote on what matters to help shape the future of Nemotron.
102+
- Stay up to date on [NVIDIA Nemotron](https://developer.nvidia.com/nemotron) by subscribing to NVIDIA news and following NVIDIA AI on [LinkedIn](https://www.linkedin.com/showcase/nvidia-ai/posts/?feedView=all), [X](https://x.com/NVIDIAAIDev), [YouTube](https://www.youtube.com/@NVIDIADeveloper), and the [Nemotron channel](https://discord.com/channels/1019361803752456192/1407781691698708682) on [Discord](https://discord.com/invite/nvidiadeveloper).
103+
104+
105+
## Acknowledgement
106+
107+
We thank all contributors for their efforts in developing and integrating Nemotron V3 Nano into SGLang.
108+
109+
**Nvidia Team**: Roi Koren, Max Xu, Netanel Haber, Tomer Bar Natan, Daniel Afrimi, Nirmal Kumar Juluru, Ann Guan and many more
110+
111+
**SGLang Team and community**: Baizhou Zhang, Jiajun Li, Ke Bao, Mingyi Lu, Richard Chen
157 KB
Loading
51.5 KB
Loading
219 KB
Loading

0 commit comments

Comments
 (0)