lm-sys
diff --git a/‎blog/2025-05-05-large-scale-ep.md‎
Lines changed: 2 additions & 2 deletions b/‎blog/2025-05-05-large-scale-ep.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎public/images/blog/large_scale_ep/e2e-prefill-decode.png‎
800 KB b/‎public/images/blog/large_scale_ep/e2e-prefill-decode.png‎
800 KB
diff --git a/‎public/images/blog/large_scale_ep/parallel-design.png‎
-40 Bytes b/‎public/images/blog/large_scale_ep/parallel-design.png‎
-40 Bytes
diff --git a/‎src/styles/globals.css‎
Lines changed: 4 additions & 0 deletions b/‎src/styles/globals.css‎
Lines changed: 4 additions & 0 deletions
@@ -53,7 +53,7 @@ DeepSeek employs **Multi-head Latent Attention (MLA)** to effectively model comp
 Despite using only three dense FFN layers, DeepSeek-V3's computation can significantly increase peak memory usage, potentially leading to system crashes if not carefully managed. To address this, we adopt **Data Parallelism (DP)** over tensor parallelism (TP), leveraging the following advantages:
 
 - **Enhanced Scalability**: With an intermediate dimension of 18,432, high TP degrees (e.g., TP32) result in inefficient fragmentation into small-unit segments (e.g., 576 units), which are not divisible by 128—a common alignment boundary for modern GPUs such as H100. This misalignment hampers computational efficiency and memory utilization. DP provides a more scalable solution by avoiding fragmentation, ensuring balanced workload distribution across devices.
-- **Optimized Memory Efficiency**: Traditionally, TP reduces memory usage as worker size increases, but this advantage diminishes under DP attention. In a pure TP setup, memory demand for a single-layer Transformer model scales with DP size as: $$\text{Memory}=\frac{N_{\text{param}}}{\text{TP}}+(1+k)N_{\text{hidden\_state}}\cdot \text{DP}\notag$$ Here, $N_{\text{hidden\_state}}=n_\text{token}\times n_\text{intermediate\_size}$ is the size of the hidden state on each device (DP rank), $N_{\text{param}}=n_\text{hidden\_size}\times n_\text{intermediate\_size}$ is the number of model parameters, and $k$ is a coefficient representing extra memory overhead from CUDA Graph duplication. By assuming $\text{DP}=\text{TP}$, this memory usage function is minimized when $\text{TP}=\sqrt{\frac{N_{\text{param}}}{(1+k)N_{\text{hidden\_state}}}}$. DeepSeek-V3 uses an intermediate size of 18,432. During the prefill phase, CUDA Graph is typically disabled, so $k = 0$. However, the token size per device can easily exceed 2,048, resulting in an optimal TP size of 3 or less. In the decode phase, a practical configuration might use 128 tokens per device and set $k = 3$. In this case, the memory-optimal TP size is 6. In both phases, a lower TP degree minimizes memory usage per device. As a result, DP may offer a more memory-efficient approach for scaling compared to relying solely on TP.
+- **Optimized Memory Efficiency**: Traditionally, TP reduces memory usage as worker size increases, but this advantage diminishes under DP attention. In a pure TP setup, memory demand for a single-layer Transformer model scales with DP size as: $$\text{Memory}=\frac{N_{\text{param}}}{\text{TP}}+(1+k)N_{\text{hidden\_state}}\cdot \text{DP}\notag$$ Here, $N_{\text{hidden\_state}}=n_\text{token}\times n_\text{hidden\_size}$ is the size of the hidden state on each device (DP rank), $N_{\text{param}}=n_\text{intermediate\_size}\times n_\text{hidden\_size}$ is the number of model parameters, and $k$ is a coefficient representing extra memory overhead from CUDA Graph duplication. By assuming $\text{DP}=\text{TP}$, this memory usage function is minimized when $\text{TP}=\sqrt{\frac{N_{\text{param}}}{(1+k)N_{\text{hidden\_state}}}}$. DeepSeek-V3 uses an intermediate size of 18,432. During the prefill phase, CUDA Graph is typically disabled, so $k = 0$. However, the token size per device can easily exceed 2,048, resulting in an optimal TP size of 3 or less. In the decode phase, a practical configuration might use 128 tokens per device and set $k = 3$. In this case, the memory-optimal TP size is 6. In both phases, a lower TP degree minimizes memory usage per device. As a result, DP may offer a more memory-efficient approach for scaling compared to relying solely on TP.
 - **Minimized Communication Overhead**: In pure TP, each FFN necessitates two all-reduce operations, resulting in substantial communication overhead. By leveraging DP, we optimize this process to a single reduce-scatter following the prior attention layer and an all-gather before the next, reducing communication costs by 50%. Furthermore, when attention is also computed under pure DP, inter-device communication is entirely eliminated, significantly enhancing overall efficiency.
 
 The integration of DP dense FFN with DP attention is illustrated in the left figure below. Users can enable this feature by setting `--moe-dense-tp-size=1`.
@@ -441,7 +441,7 @@ While our implementation of SGLang for DeepSeek-V3 inference demonstrates signif
 2. **Sequence Length Constraints**: Limited to shorter sequences due to the use of 96 GPUs. Expanding GPU resources would support longer sequences, essential for specific applications.
 3. **Multi-Token Prediction (MTP) Integration**: SGLang supports MTP but lacks full integration with DP attention, reducing efficiency in mixed parallelism configurations.
 4. **EPLB Distribution**: The experiments in this blog utilizes in-distribution data for Expert Parallelism Load Balancer (EPLB), which may not reflect real-world variability. Future work should experiment performances when having distribution shifts.
-5. **Flexible Tensor Parallelism (TP) Sizes**: For DeepSeek-V3, memory-optimal TP sizes are small but larger than 1. Currently, SGLang only supports pure TP or DP, leading to suboptimal memory use. Flexible TP options are needed.
+5. **Flexible Tensor Parallelism (TP) Sizes**: For DeepSeek-V3, memory-optimal TP sizes for dense FFNs are small but larger than 1. Currently, SGLang only supports pure TP or DP, leading to suboptimal memory use. Flexible TP options are needed.
 6. **Blackwell Support**: Currently, our implementation supports only the NVIDIA Hopper architecture. We are actively working to extend compatibility to the next-generation Blackwell architecture. If you are interested in supporting or sponsoring this development, welcome to contact [[email protected]](mailto:[email protected]).
 
 
 
@@ -267,3 +267,7 @@ a:hover {
 a:hover {
   opacity: 75%;
 }
+
+table {
+  word-wrap: break-word;
+}
Original file line number	Diff line number	Diff line change
`@@ -267,3 +267,7 @@ a:hover {`
`267`	`267`	`a:hover {`
`268`	`268`	`opacity: 75%;`
`269`	`269`	`}`
	`270`	`+`
	`271`	`+table {`
	`272`	`+ word-wrap: break-word;`
	`273`	`+}`