Skip to content

Commit 7b8e63e

Browse files
authored
break word (#143)
* break word * no message * Add files via upload * minor * no message --------- Co-authored-by: Cheng Wan <[email protected]>
1 parent ed9cc54 commit 7b8e63e

File tree

4 files changed

+6
-2
lines changed

4 files changed

+6
-2
lines changed

blog/2025-05-05-large-scale-ep.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ DeepSeek employs **Multi-head Latent Attention (MLA)** to effectively model comp
5353
Despite using only three dense FFN layers, DeepSeek-V3's computation can significantly increase peak memory usage, potentially leading to system crashes if not carefully managed. To address this, we adopt **Data Parallelism (DP)** over tensor parallelism (TP), leveraging the following advantages:
5454

5555
- **Enhanced Scalability**: With an intermediate dimension of 18,432, high TP degrees (e.g., TP32) result in inefficient fragmentation into small-unit segments (e.g., 576 units), which are not divisible by 128—a common alignment boundary for modern GPUs such as H100. This misalignment hampers computational efficiency and memory utilization. DP provides a more scalable solution by avoiding fragmentation, ensuring balanced workload distribution across devices.
56-
- **Optimized Memory Efficiency**: Traditionally, TP reduces memory usage as worker size increases, but this advantage diminishes under DP attention. In a pure TP setup, memory demand for a single-layer Transformer model scales with DP size as: $$\text{Memory}=\frac{N_{\text{param}}}{\text{TP}}+(1+k)N_{\text{hidden\_state}}\cdot \text{DP}\notag$$ Here, $N_{\text{hidden\_state}}=n_\text{token}\times n_\text{intermediate\_size}$ is the size of the hidden state on each device (DP rank), $N_{\text{param}}=n_\text{hidden\_size}\times n_\text{intermediate\_size}$ is the number of model parameters, and $k$ is a coefficient representing extra memory overhead from CUDA Graph duplication. By assuming $\text{DP}=\text{TP}$, this memory usage function is minimized when $\text{TP}=\sqrt{\frac{N_{\text{param}}}{(1+k)N_{\text{hidden\_state}}}}$. DeepSeek-V3 uses an intermediate size of 18,432. During the prefill phase, CUDA Graph is typically disabled, so $k = 0$. However, the token size per device can easily exceed 2,048, resulting in an optimal TP size of 3 or less. In the decode phase, a practical configuration might use 128 tokens per device and set $k = 3$. In this case, the memory-optimal TP size is 6. In both phases, a lower TP degree minimizes memory usage per device. As a result, DP may offer a more memory-efficient approach for scaling compared to relying solely on TP.
56+
- **Optimized Memory Efficiency**: Traditionally, TP reduces memory usage as worker size increases, but this advantage diminishes under DP attention. In a pure TP setup, memory demand for a single-layer Transformer model scales with DP size as: $$\text{Memory}=\frac{N_{\text{param}}}{\text{TP}}+(1+k)N_{\text{hidden\_state}}\cdot \text{DP}\notag$$ Here, $N_{\text{hidden\_state}}=n_\text{token}\times n_\text{hidden\_size}$ is the size of the hidden state on each device (DP rank), $N_{\text{param}}=n_\text{intermediate\_size}\times n_\text{hidden\_size}$ is the number of model parameters, and $k$ is a coefficient representing extra memory overhead from CUDA Graph duplication. By assuming $\text{DP}=\text{TP}$, this memory usage function is minimized when $\text{TP}=\sqrt{\frac{N_{\text{param}}}{(1+k)N_{\text{hidden\_state}}}}$. DeepSeek-V3 uses an intermediate size of 18,432. During the prefill phase, CUDA Graph is typically disabled, so $k = 0$. However, the token size per device can easily exceed 2,048, resulting in an optimal TP size of 3 or less. In the decode phase, a practical configuration might use 128 tokens per device and set $k = 3$. In this case, the memory-optimal TP size is 6. In both phases, a lower TP degree minimizes memory usage per device. As a result, DP may offer a more memory-efficient approach for scaling compared to relying solely on TP.
5757
- **Minimized Communication Overhead**: In pure TP, each FFN necessitates two all-reduce operations, resulting in substantial communication overhead. By leveraging DP, we optimize this process to a single reduce-scatter following the prior attention layer and an all-gather before the next, reducing communication costs by 50%. Furthermore, when attention is also computed under pure DP, inter-device communication is entirely eliminated, significantly enhancing overall efficiency.
5858

5959
The integration of DP dense FFN with DP attention is illustrated in the left figure below. Users can enable this feature by setting `--moe-dense-tp-size=1`.
@@ -441,7 +441,7 @@ While our implementation of SGLang for DeepSeek-V3 inference demonstrates signif
441441
2. **Sequence Length Constraints**: Limited to shorter sequences due to the use of 96 GPUs. Expanding GPU resources would support longer sequences, essential for specific applications.
442442
3. **Multi-Token Prediction (MTP) Integration**: SGLang supports MTP but lacks full integration with DP attention, reducing efficiency in mixed parallelism configurations.
443443
4. **EPLB Distribution**: The experiments in this blog utilizes in-distribution data for Expert Parallelism Load Balancer (EPLB), which may not reflect real-world variability. Future work should experiment performances when having distribution shifts.
444-
5. **Flexible Tensor Parallelism (TP) Sizes**: For DeepSeek-V3, memory-optimal TP sizes are small but larger than 1. Currently, SGLang only supports pure TP or DP, leading to suboptimal memory use. Flexible TP options are needed.
444+
5. **Flexible Tensor Parallelism (TP) Sizes**: For DeepSeek-V3, memory-optimal TP sizes for dense FFNs are small but larger than 1. Currently, SGLang only supports pure TP or DP, leading to suboptimal memory use. Flexible TP options are needed.
445445
6. **Blackwell Support**: Currently, our implementation supports only the NVIDIA Hopper architecture. We are actively working to extend compatibility to the next-generation Blackwell architecture. If you are interested in supporting or sponsoring this development, welcome to contact [[email protected]](mailto:[email protected]).
446446

447447

800 KB
Loading
-40 Bytes
Loading

src/styles/globals.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,3 +267,7 @@ a:hover {
267267
a:hover {
268268
opacity: 75%;
269269
}
270+
271+
table {
272+
word-wrap: break-word;
273+
}

0 commit comments

Comments
 (0)