apply reviews

chenyushuo · chenyushuo · commit d5f837e995e7 · 2025-12-08T19:52:51.000+08:00
diff --git a/docs/sphinx_doc/source/tutorial/trinity_trainer_configs.md b/docs/sphinx_doc/source/tutorial/trinity_trainer_configs.md
@@ -1,7 +1,7 @@
 # Trainer Parameter Configuration Guide
 
 This document provides recommended training configurations for Qwen3 series models on **NVIDIA A100 80GB** and **H20 96GB** GPUs.
-Based on model size (0.6B ~ 14B) and context length (`max_model_len`), we present feasible Trainer module setups across varying numbers of GPUs.
+Based on model size (0.6B ~ 14B) and context length (`model.max_model_len`), we present feasible Trainer module setups across varying numbers of GPUs.
 
 > 💡 **Terminology**
 >
@@ -12,8 +12,8 @@ Based on model size (0.6B ~ 14B) and context length (`max_model_len`), we presen
 >   ```
 > - **Offload**: Enable **FSDP v2 + CPU Offload** to reduce GPU memory usage.
 > - **SP=N**: Use **Sequence Parallelism** with parallelism degree N (typically N ≤ number of GPUs).
-> - **Combined entries (e.g., `Env SP=2`)**: All listed conditions must be satisfied simultaneously.
-> - **“-”**: The combination of current hardware and configuration **cannot support training** for this model + sequence length.
+> - **Combined entries (e.g., `Env + SP=2`)**: All listed conditions must be satisfied simultaneously.
+> - **“-”**: The combination of current hardware and configuration **cannot support training** for this model under the given sequence length.
 
 ---
 
@@ -37,7 +37,7 @@ model:
 
 ---
 
-## 🖥️ A100 80GB GPU Configuration Recommendations
+## A100 80GB GPU Configuration Recommendations
 
 > ⚠️ **Single-GPU Limitation**: Training models ≥4B or with context lengths >20K on a single A100 GPU places extreme pressure on VRAM. **Multi-GPU setups are strongly recommended**.
 
@@ -138,7 +138,7 @@ model:
 
 ---
 
-## 🧊 H20 96GB GPU Configuration Recommendations
+## H20 96GB GPU Configuration Recommendations
 
 The H20 has larger VRAM (96GB) but lower compute performance compared to the A100.
 
@@ -253,5 +253,5 @@ The H20 has larger VRAM (96GB) but lower compute performance compared to the A10
    - Step 1: Set `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
    - Step 2: Increase **Sequence Parallelism (SP)**
    - Step 3: Enable **FSDP v2 + CPU Offload**
-4. **Choosing SP parallelism degree**: Prefer values that are **common divisors of both GPU count and attention head count** (e.g., 2, 4) to avoid communication bottlenecks.
+4. **Choosing SP parallelism degree**: Prefer values that are **common divisors of both GPU count and attention head count** (e.g., 2, 4).
 5. **Prefer multi-GPU over single-GPU**: Even when VRAM appears sufficient, multi-GPU setups improve training efficiency and stability through parallelization.
diff --git a/docs/sphinx_doc/source_zh/tutorial/trinity_trainer_configs.md b/docs/sphinx_doc/source_zh/tutorial/trinity_trainer_configs.md
@@ -1,7 +1,7 @@
 # Trainer 参数配置指南
 
 本文档为在 **NVIDIA A100 80GB** 和 **H20 96GB** 显卡上训练 Qwen3 系列模型提供推荐的训练配置建议。
-根据模型大小（0.6B ~ 14B）与上下文长度（`max_model_len`），我们给出了Trainer模块在不同 GPU 数量下的可行方案。
+根据模型大小（0.6B ~ 14B）与上下文长度（`model.max_model_len`），我们给出了Trainer模块在不同 GPU 数量下的可行方案。
 
 > 💡 **术语说明**
 >
@@ -12,8 +12,8 @@
 >   ```
 > - **Offload**：需启用 **FSDP v2 + CPU Offload** 技术以节省显存。
 > - **SP=N**：表示使用 **Sequence Parallelism（序列并行）**，并行度为 N（通常 N ≤ GPU 数量）。
-> - **组合项（如 `Env SP=2`）**：需同时满足所有列出的条件。
-> - **“-”**：当前硬件与配置组合下，**无法支持该模型+序列长度的训练**。
+> - **组合项（如 `Env + SP=2`）**：需同时满足所有列出的条件。
+> - **“-”**：当前硬件与配置组合下，无法支持该模型在此序列长度下进行训练。
 
 ---
 
@@ -37,7 +37,7 @@ model:
 
 ---
 
-## 🖥️ A100 80GB 显卡配置建议
+## A100 80GB 显卡配置建议
 
 > ⚠️ **单卡限制**：在 1 张 A100 上训练 ≥4B 模型或 >20K 上下文时，显存压力极大，**强烈建议使用多卡方案**。
 
@@ -138,7 +138,7 @@ model:
 
 ---
 
-## 🧊 H20 96GB 显卡配置建议
+## H20 96GB 显卡配置建议
 
 H20 显存更大（96GB），但计算能力弱于 A100。
 
@@ -253,5 +253,5 @@ H20 显存更大（96GB），但计算能力弱于 A100。
    - 第一步：设置 `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
    - 第二步：增加 **Sequence Parallelism（SP）**
    - 第三步：启用 **FSDP v2 + CPU Offload**
-4. **SP 并行度选择**：建议设为 **GPU 数量与注意力头数的公因数**（如 2、4），避免通信瓶颈。
+4. **SP 并行度选择**：建议设为 **GPU 数量与注意力头数的公因数**（如 2、4）。
 5. **多卡优于单卡**：即使显存足够，多卡也能通过并行提升训练效率与稳定性。