|
| 1 | +--- |
| 2 | +title: "Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark" |
| 3 | +author: "Jerry Zhou" |
| 4 | +date: "November 3, 2025" |
| 5 | +previewImg: /images/blog/gpt_oss_on_nvidia_dgx_spark/preview.jpg |
| 6 | +--- |
| 7 | + |
| 8 | +We’ve got some exciting updates about the **NVIDIA DGX Spark**\! In the week following the official launch, we collaborated closely with NVIDIA and successfully brought **GPT-OSS 20B** and **GPT-OSS 120B** support to **SGLang** on the DGX Spark. The results are impressive: around **70 tokens/s** on GPT-OSS 20B and **50 tokens/s** on GPT-OSS 120B, which is state-of-the-art so far, and makes running a **local coding agent** on the DGX Spark fully viable. |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | +> We’ve updated our detailed benchmark results <a href="https://docs.google.com/spreadsheets/d/1SF1u0J2vJ-ou-R_Ry1JZQ0iscOZL8UKHpdVFr85tNLU/edit?usp=sharing" target="_blank">here</a>, and check out our demo video <a href="https://youtu.be/ApIVoTuWIss" target="_blank">here</a>. |
| 13 | +
|
| 14 | +In this post, you’ll learn how to: |
| 15 | + |
| 16 | +* Run GPT-OSS 20B or 120B with SGLang on the DGX Spark |
| 17 | +* Benchmark performance locally |
| 18 | +* Hook it up to **Open WebUI** for chatting |
| 19 | +* Even run **Claude Code** entirely locally via **LMRouter** |
| 20 | + |
| 21 | +## 1\. Preparing the Environment |
| 22 | + |
| 23 | +Before launching SGLang, make sure you have the proper **tiktoken encodings** for OpenAI Harmony: |
| 24 | + |
| 25 | +```bash |
| 26 | +mkdir -p ~/tiktoken_encodings |
| 27 | +wget -O ~/tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken" |
| 28 | +wget -O ~/tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken" |
| 29 | +``` |
| 30 | + |
| 31 | +## 2\. Launching SGLang with Docker |
| 32 | + |
| 33 | +Now, launch the SGLang server with the following command: |
| 34 | + |
| 35 | +```bash |
| 36 | +docker run --gpus all \ |
| 37 | + --shm-size 32g \ |
| 38 | + -p 30000:30000 \ |
| 39 | + -v ~/.cache/huggingface:/root/.cache/huggingface -v ~/tiktoken_encodings:/tiktoken_encodings \ |
| 40 | + --env "HF_TOKEN=<secret>" --env "TIKTOKEN_ENCODINGS_BASE=/tiktoken_encodings" \ |
| 41 | + --ipc=host \ |
| 42 | + lmsysorg/sglang:spark \ |
| 43 | + python3 -m sglang.launch_server --model-path openai/gpt-oss-20b --host 0.0.0.0 --port 30000 --reasoning-parser gpt-oss --tool-call-parser gpt-oss |
| 44 | +``` |
| 45 | + |
| 46 | +Replace `<secret>` with your **Hugging Face access token**. If you’d like to run **GPT-OSS 120B**, simply change the model path to: `openai/gpt-oss-120b`. This model is roughly 6× larger than the 20B version, so it will take a bit longer to load. For best performance and stability, consider enabling **swap memory** on your DGX Spark. |
| 47 | + |
| 48 | +## 3\. Testing the Server |
| 49 | + |
| 50 | +Once SGLang is running, you can send OpenAI-compatible requests directly: |
| 51 | + |
| 52 | +```bash |
| 53 | +curl http://localhost:30000/v1/chat/completions \ |
| 54 | + -H "Content-Type: application/json" \ |
| 55 | + -d '{ |
| 56 | + "messages": [ |
| 57 | + { |
| 58 | + "role": "system", |
| 59 | + "content": "You are a helpful assistant." |
| 60 | + }, |
| 61 | + { |
| 62 | + "role": "user", |
| 63 | + "content": "How many letters are there in the word SGLang?" |
| 64 | + } |
| 65 | + ] |
| 66 | + }' |
| 67 | +``` |
| 68 | + |
| 69 | + |
| 70 | + |
| 71 | +## 4\. Benchmarking Performance |
| 72 | + |
| 73 | +A quick way to benchmark throughput is to request a long output, such as: |
| 74 | + |
| 75 | +```bash |
| 76 | +curl http://localhost:30000/v1/chat/completions \ |
| 77 | + -H "Content-Type: application/json" \ |
| 78 | + -d '{ |
| 79 | + "messages": [ |
| 80 | + { |
| 81 | + "role": "system", |
| 82 | + "content": "You are a helpful assistant." |
| 83 | + }, |
| 84 | + { |
| 85 | + "role": "user", |
| 86 | + "content": "Generate a long story. The only requirement is long." |
| 87 | + } |
| 88 | + ] |
| 89 | + }' |
| 90 | +``` |
| 91 | + |
| 92 | +You should see around **70 tokens per second** with GPT-OSS 20B under typical conditions. |
| 93 | + |
| 94 | +## 5\. Running a Local Chatbot (Open WebUI) |
| 95 | + |
| 96 | +To set up a friendly local chat interface, you can install **Open WebUI** on your DGX Spark and point it to your running SGLang backend: `http://localhost:30000/v1`. Follow the <a href="https://github.com/open-webui/open-webui?tab=readme-ov-file#how-to-install-" target="_blank">Open WebUI installation instructions</a> to get it up and running. Once connected, you’ll be able to chat seamlessly with your local GPT-OSS instance. No internet required. |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +## 6\. Running Claude Code Entirely Locally |
| 101 | + |
| 102 | +With a local GPT-OSS model running, you can even connect **Claude Code** through <a href="https://github.com/LMRouter/lmrouter" target="_blank">**LMRouter**</a>, which is able to convert Anthropic-style requests into OpenAI-compatible ones. |
| 103 | + |
| 104 | +### Step 1: Create the LMRouter Config |
| 105 | + |
| 106 | +Save <a href="https://gist.github.com/yvbbrjdr/0514a32124682f97370dda9c09c3349c" target="_blank">this file</a> as `lmrouter-sglang.yaml`. |
| 107 | + |
| 108 | +### Step 2: Launch LMRouter |
| 109 | + |
| 110 | +Install <a href="https://pnpm.io/installation" target="_blank">**pnpm**</a> (if not already installed), then run: |
| 111 | + |
| 112 | +```bash |
| 113 | +pnpx @lmrouter/cli lmrouter-sglang.yaml |
| 114 | +``` |
| 115 | + |
| 116 | +### Step 3: Start Claude Code |
| 117 | + |
| 118 | +Install **Claude Code** following its <a href="https://www.claude.com/product/claude-code" target="_blank">setup guide</a>, then launch it as follows: |
| 119 | + |
| 120 | +```bash |
| 121 | +ANTHROPIC_BASE_URL=http://localhost:3000/anthropic \ |
| 122 | +ANTHROPIC_AUTH_TOKEN=sk-sglang claude |
| 123 | +``` |
| 124 | + |
| 125 | +That’s it\! You can now use **Claude Code locally**, powered entirely by **GPT-OSS 20B or 120B on your DGX Spark**. |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +## 7\. Conclusion |
| 130 | + |
| 131 | +With these steps, you can fully unlock the potential of the **DGX Spark**, turning it into a local AI powerhouse capable of running multi-tens-of-billion-parameter models interactively. |
0 commit comments