You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025-10-29-sglang-jax.md
+11-5Lines changed: 11 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,18 +66,24 @@ To implement tree-based MTP-Verify, SGLang-jax adds non-causal mask support on t
66
66
We currently support Eagle2 and Eagle3, and plan to continue optimizing the kernel implementation and add support for different attention backends at various MTP stages.
67
67
68
68
## TPU Performance
69
-
After all the optimizations, SGLang-Jax can match or outperform other TPU inference solutions.
69
+
After all the optimizations, SGLang-Jax matches or outperforms other TPU inference solutions.
70
70
71
71
### Setup
72
-
We benchmarked SGLang-Jax against vLLM-TPU. Full instructions are available [here](https://github.com/sgl-project/sglang-jax/issues/270).
73
-
We used `Qwen/Qwen3-32B`, TPU v6e-4, SGLang-jax (version: main-af32f095880ff676ed23eec19bc79584b5e20717), and vLLM-tpu (vllm-tpu==0.11.1).
72
+
To highlight SGLang-Jax’s efficiency on TPUs, we benchmarked it against vLLM-TPU using the `Qwen/Qwen3-32B` model on a TPU v6e-4 host. We tested SGLang-Jax (commit: `main-af32f095880ff676ed23eec19bc79584b5e20717`) and vLLM-TPU (version `vllm-tpu==0.11.1`).
We also compared TPU v6e-4 against H100 GPUs using SGLang on both.
75
+
Full benchmark instructions are available [here](https://github.com/sgl-project/sglang-jax/issues/270).
74
76
75
77
### Results
78
+
On Qwen3-32B, SGLang-Jax and vLLM-TPU deliver nearly identical prefill performance, but SGLang-Jax pulls ahead slightly during decoding. Both use similar kernels, resulting in comparable input throughput. However, SGLang-Jax supports an overlap scheduler, which reduces ITL and boosts output throughput.
<pstyle="color:gray; text-align: center;">match vllm-tpu on prefill because of similar kernel optimizations. outperform vllm-tpu on decode thanks to overlap scheduler. </p>
81
+
<pstyle="color:gray; text-align: center;">SGLang-Jax vs. vLLM-TPU on TPU v6e</p>
82
+
83
+
We also compared TPUs to GPUs by pitting four v6e chips against two H100s—a configuration that roughly aligns in price, HBM capacity, and peak bf16 TFLOPS. The TPU consistently achieves higher input throughput and outperforms on output throughput in several scenarios.
<pstyle="color:gray; text-align: center;">the TPU setup achieves lower latency (TTFT and ITL) and higher input throughput across various batch sizes</p>
86
+
<pstyle="color:gray; text-align: center;">SGLang-Jax on TPU vs. SGLang on GPU</p>
0 commit comments