Skip to content

Commit d5f837e

Browse files
committed
apply reviews
1 parent ea0fca2 commit d5f837e

File tree

2 files changed

+12
-12
lines changed

2 files changed

+12
-12
lines changed

docs/sphinx_doc/source/tutorial/trinity_trainer_configs.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Trainer Parameter Configuration Guide
22

33
This document provides recommended training configurations for Qwen3 series models on **NVIDIA A100 80GB** and **H20 96GB** GPUs.
4-
Based on model size (0.6B ~ 14B) and context length (`max_model_len`), we present feasible Trainer module setups across varying numbers of GPUs.
4+
Based on model size (0.6B ~ 14B) and context length (`model.max_model_len`), we present feasible Trainer module setups across varying numbers of GPUs.
55

66
> 💡 **Terminology**
77
>
@@ -12,8 +12,8 @@ Based on model size (0.6B ~ 14B) and context length (`max_model_len`), we presen
1212
> ```
1313
> - **Offload**: Enable **FSDP v2 + CPU Offload** to reduce GPU memory usage.
1414
> - **SP=N**: Use **Sequence Parallelism** with parallelism degree N (typically N ≤ number of GPUs).
15-
> - **Combined entries (e.g., `Env SP=2`)**: All listed conditions must be satisfied simultaneously.
16-
> - **“-”**: The combination of current hardware and configuration **cannot support training** for this model + sequence length.
15+
> - **Combined entries (e.g., `Env + SP=2`)**: All listed conditions must be satisfied simultaneously.
16+
> - **“-”**: The combination of current hardware and configuration **cannot support training** for this model under the given sequence length.
1717
1818
---
1919
@@ -37,7 +37,7 @@ model:
3737
3838
---
3939

40-
## 🖥️ A100 80GB GPU Configuration Recommendations
40+
## A100 80GB GPU Configuration Recommendations
4141

4242
> ⚠️ **Single-GPU Limitation**: Training models ≥4B or with context lengths >20K on a single A100 GPU places extreme pressure on VRAM. **Multi-GPU setups are strongly recommended**.
4343
@@ -138,7 +138,7 @@ model:
138138

139139
---
140140

141-
## 🧊 H20 96GB GPU Configuration Recommendations
141+
## H20 96GB GPU Configuration Recommendations
142142

143143
The H20 has larger VRAM (96GB) but lower compute performance compared to the A100.
144144

@@ -253,5 +253,5 @@ The H20 has larger VRAM (96GB) but lower compute performance compared to the A10
253253
- Step 1: Set `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
254254
- Step 2: Increase **Sequence Parallelism (SP)**
255255
- Step 3: Enable **FSDP v2 + CPU Offload**
256-
4. **Choosing SP parallelism degree**: Prefer values that are **common divisors of both GPU count and attention head count** (e.g., 2, 4) to avoid communication bottlenecks.
256+
4. **Choosing SP parallelism degree**: Prefer values that are **common divisors of both GPU count and attention head count** (e.g., 2, 4).
257257
5. **Prefer multi-GPU over single-GPU**: Even when VRAM appears sufficient, multi-GPU setups improve training efficiency and stability through parallelization.

docs/sphinx_doc/source_zh/tutorial/trinity_trainer_configs.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Trainer 参数配置指南
22

33
本文档为在 **NVIDIA A100 80GB****H20 96GB** 显卡上训练 Qwen3 系列模型提供推荐的训练配置建议。
4-
根据模型大小(0.6B ~ 14B)与上下文长度(`max_model_len`),我们给出了Trainer模块在不同 GPU 数量下的可行方案。
4+
根据模型大小(0.6B ~ 14B)与上下文长度(`model.max_model_len`),我们给出了Trainer模块在不同 GPU 数量下的可行方案。
55

66
> 💡 **术语说明**
77
>
@@ -12,8 +12,8 @@
1212
> ```
1313
> - **Offload**:需启用 **FSDP v2 + CPU Offload** 技术以节省显存。
1414
> - **SP=N**:表示使用 **Sequence Parallelism(序列并行)**,并行度为 N(通常 N ≤ GPU 数量)。
15-
> - **组合项(如 `Env SP=2`**:需同时满足所有列出的条件。
16-
> - **“-”**:当前硬件与配置组合下,**无法支持该模型+序列长度的训练**
15+
> - **组合项(如 `Env + SP=2`**:需同时满足所有列出的条件。
16+
> - **“-”**:当前硬件与配置组合下,无法支持该模型在此序列长度下进行训练
1717
1818
---
1919
@@ -37,7 +37,7 @@ model:
3737
3838
---
3939

40-
## 🖥️ A100 80GB 显卡配置建议
40+
## A100 80GB 显卡配置建议
4141

4242
> ⚠️ **单卡限制**:在 1 张 A100 上训练 ≥4B 模型或 >20K 上下文时,显存压力极大,**强烈建议使用多卡方案**
4343
@@ -138,7 +138,7 @@ model:
138138

139139
---
140140

141-
## 🧊 H20 96GB 显卡配置建议
141+
## H20 96GB 显卡配置建议
142142

143143
H20 显存更大(96GB),但计算能力弱于 A100。
144144

@@ -253,5 +253,5 @@ H20 显存更大(96GB),但计算能力弱于 A100。
253253
- 第一步:设置 `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
254254
- 第二步:增加 **Sequence Parallelism(SP)**
255255
- 第三步:启用 **FSDP v2 + CPU Offload**
256-
4. **SP 并行度选择**:建议设为 **GPU 数量与注意力头数的公因数**(如 2、4),避免通信瓶颈
256+
4. **SP 并行度选择**:建议设为 **GPU 数量与注意力头数的公因数**(如 2、4)。
257257
5. **多卡优于单卡**:即使显存足够,多卡也能通过并行提升训练效率与稳定性。

0 commit comments

Comments
 (0)