Skip to content

Commit 58a3047

Browse files
authored
Add DGX Spark GPT-OSS blog (#235)
1 parent 48a83bd commit 58a3047

File tree

6 files changed

+131
-0
lines changed

6 files changed

+131
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
title: "Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark"
3+
author: "Jerry Zhou"
4+
date: "November 3, 2025"
5+
previewImg: /images/blog/gpt_oss_on_nvidia_dgx_spark/preview.jpg
6+
---
7+
8+
We’ve got some exciting updates about the **NVIDIA DGX Spark**\! In the week following the official launch, we collaborated closely with NVIDIA and successfully brought **GPT-OSS 20B** and **GPT-OSS 120B** support to **SGLang** on the DGX Spark. The results are impressive: around **70 tokens/s** on GPT-OSS 20B and **50 tokens/s** on GPT-OSS 120B, which is state-of-the-art so far, and makes running a **local coding agent** on the DGX Spark fully viable.
9+
10+
![](/images/blog/gpt_oss_on_nvidia_dgx_spark/demo_1.png)
11+
12+
> We’ve updated our detailed benchmark results <a href="https://docs.google.com/spreadsheets/d/1SF1u0J2vJ-ou-R_Ry1JZQ0iscOZL8UKHpdVFr85tNLU/edit?usp=sharing" target="_blank">here</a>, and check out our demo video <a href="https://youtu.be/ApIVoTuWIss" target="_blank">here</a>.
13+
14+
In this post, you’ll learn how to:
15+
16+
* Run GPT-OSS 20B or 120B with SGLang on the DGX Spark
17+
* Benchmark performance locally
18+
* Hook it up to **Open WebUI** for chatting
19+
* Even run **Claude Code** entirely locally via **LMRouter**
20+
21+
## 1\. Preparing the Environment
22+
23+
Before launching SGLang, make sure you have the proper **tiktoken encodings** for OpenAI Harmony:
24+
25+
```bash
26+
mkdir -p ~/tiktoken_encodings
27+
wget -O ~/tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
28+
wget -O ~/tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
29+
```
30+
31+
## 2\. Launching SGLang with Docker
32+
33+
Now, launch the SGLang server with the following command:
34+
35+
```bash
36+
docker run --gpus all \
37+
--shm-size 32g \
38+
-p 30000:30000 \
39+
-v ~/.cache/huggingface:/root/.cache/huggingface -v ~/tiktoken_encodings:/tiktoken_encodings \
40+
--env "HF_TOKEN=<secret>" --env "TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings" \
41+
--ipc=host \
42+
lmsysorg/sglang:spark \
43+
python3 -m sglang.launch_server --model-path openai/gpt-oss-20b --host 0.0.0.0 --port 30000 --reasoning-parser gpt-oss --tool-call-parser gpt-oss
44+
```
45+
46+
Replace `<secret>` with your **Hugging Face access token**. If you’d like to run **GPT-OSS 120B**, simply change the model path to: `openai/gpt-oss-120b`. This model is roughly 6× larger than the 20B version, so it will take a bit longer to load. For best performance and stability, consider enabling **swap memory** on your DGX Spark.
47+
48+
## 3\. Testing the Server
49+
50+
Once SGLang is running, you can send OpenAI-compatible requests directly:
51+
52+
```bash
53+
curl http://localhost:30000/v1/chat/completions \
54+
-H "Content-Type: application/json" \
55+
-d '{
56+
"messages": [
57+
{
58+
"role": "system",
59+
"content": "You are a helpful assistant."
60+
},
61+
{
62+
"role": "user",
63+
"content": "How many letters are there in the word SGLang?"
64+
}
65+
]
66+
}'
67+
```
68+
69+
![](/images/blog/gpt_oss_on_nvidia_dgx_spark/demo_2.jpg)
70+
71+
## 4\. Benchmarking Performance
72+
73+
A quick way to benchmark throughput is to request a long output, such as:
74+
75+
```bash
76+
curl http://localhost:30000/v1/chat/completions \
77+
-H "Content-Type: application/json" \
78+
-d '{
79+
"messages": [
80+
{
81+
"role": "system",
82+
"content": "You are a helpful assistant."
83+
},
84+
{
85+
"role": "user",
86+
"content": "Generate a long story. The only requirement is long."
87+
}
88+
]
89+
}'
90+
```
91+
92+
You should see around **70 tokens per second** with GPT-OSS 20B under typical conditions.
93+
94+
## 5\. Running a Local Chatbot (Open WebUI)
95+
96+
To set up a friendly local chat interface, you can install **Open WebUI** on your DGX Spark and point it to your running SGLang backend: `http://localhost:30000/v1`. Follow the <a href="https://github.com/open-webui/open-webui?tab=readme-ov-file#how-to-install-" target="_blank">Open WebUI installation instructions</a> to get it up and running. Once connected, you’ll be able to chat seamlessly with your local GPT-OSS instance. No internet required.
97+
98+
![](/images/blog/gpt_oss_on_nvidia_dgx_spark/demo_3.jpg)
99+
100+
## 6\. Running Claude Code Entirely Locally
101+
102+
With a local GPT-OSS model running, you can even connect **Claude Code** through <a href="https://github.com/LMRouter/lmrouter" target="_blank">**LMRouter**</a>, which is able to convert Anthropic-style requests into OpenAI-compatible ones.
103+
104+
### Step 1: Create the LMRouter Config
105+
106+
Save <a href="https://gist.github.com/yvbbrjdr/0514a32124682f97370dda9c09c3349c" target="_blank">this file</a> as `lmrouter-sglang.yaml`.
107+
108+
### Step 2: Launch LMRouter
109+
110+
Install <a href="https://pnpm.io/installation" target="_blank">**pnpm**</a> (if not already installed), then run:
111+
112+
```bash
113+
pnpx @lmrouter/cli lmrouter-sglang.yaml
114+
```
115+
116+
### Step 3: Start Claude Code
117+
118+
Install **Claude Code** following its <a href="https://www.claude.com/product/claude-code" target="_blank">setup guide</a>, then launch it as follows:
119+
120+
```bash
121+
ANTHROPIC_BASE_URL=http://localhost:3000/anthropic \
122+
ANTHROPIC_AUTH_TOKEN=sk-sglang claude
123+
```
124+
125+
That’s it\! You can now use **Claude Code locally**, powered entirely by **GPT-OSS 20B or 120B on your DGX Spark**.
126+
127+
![](/images/blog/gpt_oss_on_nvidia_dgx_spark/demo_4.jpg)
128+
129+
## 7\. Conclusion
130+
131+
With these steps, you can fully unlock the potential of the **DGX Spark**, turning it into a local AI powerhouse capable of running multi-tens-of-billion-parameter models interactively.
724 KB
Loading
1.69 MB
Loading
481 KB
Loading
1.53 MB
Loading
193 KB
Loading

0 commit comments

Comments
 (0)