modelscope
diff --git a/‎docs/source/Instruction/GKD.md‎
Lines changed: 4 additions & 3 deletions b/‎docs/source/Instruction/GKD.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎docs/source/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 16 additions & 1 deletion b/‎docs/source/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 16 additions & 1 deletion
diff --git a/‎docs/source/Megatron-SWIFT/GKD.md‎
Lines changed: 73 additions & 0 deletions b/‎docs/source/Megatron-SWIFT/GKD.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎docs/source/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source_en/Instruction/GKD.md‎
Lines changed: 5 additions & 4 deletions b/‎docs/source_en/Instruction/GKD.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎docs/source_en/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 17 additions & 1 deletion b/‎docs/source_en/Megatron-SWIFT/Command-line-parameters.md‎
Lines changed: 17 additions & 1 deletion
diff --git a/‎docs/source_en/Megatron-SWIFT/GKD.md‎
Lines changed: 73 additions & 0 deletions b/‎docs/source_en/Megatron-SWIFT/GKD.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎docs/source_en/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/source_en/Megatron-SWIFT/Quick-start.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/megatron/rlhf/gkd/dense.sh‎
Lines changed: 39 additions & 0 deletions b/‎examples/megatron/rlhf/gkd/dense.sh‎
Lines changed: 39 additions & 0 deletions
@@ -103,7 +103,7 @@ elif seq_kd:
     y = teacher.generate(x)
     source = "teacher"
 else:
-    # Mode 3: Off-Policy 学习，使用数据集中的输出序列
+    # Mode 3: 使用数据集中的输出序列
     y = y_ground_truth
     source = "dataset"
 
@@ -128,7 +128,7 @@ loss = D_JSD(P_teacher(·|x,y), P_student(·|x,y))
 
 **数据来源**：$y \sim P_{\text{teacher}}(\cdot | x)$
 
-### Mode 3: Off-Policy 学习（其他情况）
+### Mode 3: 离线学习（其他情况）
 
 **数据来源**：$y = y^* \sim \text{Dataset}$
 
@@ -143,9 +143,10 @@ loss = D_JSD(P_teacher(·|x,y), P_student(·|x,y))
 |------|------|--------|---------|------|
 | `--teacher_model` | str | 必需 | - | 教师模型路径或模型 ID |
 | `--beta` | float | 0.5 | [0.0, 1.0] | 散度插值系数<br>• 0.0: Forward KL <br>• 0.5: JSD (平衡)<br>• 1.0: Reverse KL |
-| `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy 学习触发概率<br>• 0.0: 纯 Off-Policy<br>• 0.5: 混合策略<br>• 1.0: 纯 On-Policy |
+| `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy 学习触发概率<br>• 0.0: 离线学习<br>• 0.5: 混合策略<br>• 1.0: 纯 On-Policy |
 | `--seq_kd` | bool | False | True/False | 是否使用教师生成序列<br>• False: 非 on-policy 时使用数据集<br>• True: 非 on-policy 时使用教师生成 |
 | `--temperature` | float | 0.9 | > 0 | 生成采样温度，控制随机性 |
+| `--sft_alpha` | float | 0 | >= 0 | 混合一定比例的sft loss，对非student生成结果生效 |
 | `--max_completion_length` | int | 512 | > 0 | 生成时的最大 token 数 |
 
 ## 采样加速
 
@@ -326,7 +326,7 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 
 ## RLHF参数
 除了继承训练参数外，还支持以下参数：
-- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'grpo'、'kto'和'rm'。
+- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'grpo'、'kto'、'rm'和'gkd'。
 - loss_scale: 覆盖[基本参数](../Instruction/Command-line-parameters.md)中的loss_scale。默认为'last_round'。
 - calculate_per_token_loss: 覆盖Megatron参数，默认为False。
 
@@ -406,6 +406,21 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 
 内置奖励函数参数参考[文档](../Instruction/Command-line-parameters.md#奖励函数参数)
 
+### GKD参数
+- teacher_model: 教师模型的路径或模型 ID，必需参数。
+- teacher_model_type: 教师模型类型，默认为None，自动检测。
+- teacher_model_revision: 教师模型版本，默认为None。
+- beta: JSD 散度插值系数。0.0 代表 Forward KL，0.5 代表对称 JSD，1.0 代表 Reverse KL。默认为0.5。
+- lmbda: On-Policy 学习触发概率。0.0 代表纯 Off-Policy，1.0 代表纯 On-Policy。默认为0.5。
+- seq_kd: 是否使用教师生成的响应（Sequential KD），当前暂不支持。默认为False。
+- temperature: 用于采样和损失计算的温度参数。默认为0.9。
+- offload_teacher_model: 是否将教师模型卸载到 CPU 以节省 GPU 显存。默认为False。
+- sft_alpha: SFT 损失的混合系数，`loss = jsd_loss + sft_alpha * sft_loss`。当使用数据集响应（Off-Policy）时生效。默认为0。
+- max_completion_length: 生成时的最大 token 数。默认为512。
+- vllm_mode: 同 GRPO 参数，用于 On-Policy 生成。colocate 模式下在程序内部署 vLLM。
+  - 注意：On-Policy 生成需要启用 vLLM（`--use_vllm true --vllm_mode colocate/server`）。
+  - 当 `lmbda > 0` 但未启用 vLLM 时，将自动回退到 Off-Policy 模式。
+
 ## 导出参数
 这里介绍`megatron export`的参数（需"ms-swift>=3.10"），若要使用`swift export`导出命令，请参考[ms-swift命令行参数文档](../Instruction/Command-line-parameters.md#导出参数)。`megatron export`相比`swift export`，支持分布式和多机导出。Megatron导出参数继承自Megatron参数和基本参数。
 - 🔥to_mcore: HF格式权重转成Megatron格式。默认为False。
 
@@ -0,0 +1,73 @@
+# GKD
+
+**版本依赖**：ms-swift >= 3.12
+
+如果你是首次使用 GKD，请先参考 [GKD文档](../Instruction/GKD.md)。
+
+GKD（Generalized Knowledge Distillation，广义知识蒸馏）是一种将教师模型的知识迁移到学生模型的训练方法，通过计算两个模型输出分布之间的 Jensen-Shannon 散度（JSD）损失来实现知识蒸馏。
+
+## 功能支持
+
+Megatron GKD 当前已支持以下功能：
+
+- **训练模式**：全参数训练与 LoRA 微调
+- **并行策略**：支持上下文并行（CP）、流水线并行（PP）、张量并行（TP）和专家并行（EP）
+- **模型支持**：兼容 Megatron-SWIFT 中的 LLM 及 MLLM
+- **Teacher Offload**：支持将教师模型卸载到 CPU 以节省 GPU 显存
+- **在线生成**：支持使用 vLLM 进行学生模型的 on-policy 生成
+
+### 当前限制
+
+- **教师模型在线生成**（`seq_kd=True`）：当前 Sequential KD 模式下的教师模型生成暂不支持
+- **非vLLM生成**：On-policy 生成当前仅支持 vLLM
+- **教师模型使用与学生模型不同的并行参数**: 将在未来版本支持
+
+⚠️ 注意事项：
+- **On-policy 生成**：需要启用 vLLM（`--use_vllm true --vllm_mode colocate/server`）
+- 当 `lmbda > 0` 但未启用 vLLM 时，将自动回退到离线学习模式（使用数据集响应）
+- 当 `seq_kd=True` 时，由于教师生成暂不支持，将自动回退到离线学习模式，如需使用，请提前用[swift infer](../Instruction/Inference-and-deployment.md)推理数据集
+
+## 参数说明
+
+### GKD 特有参数
+
+| 参数 | 类型 | 默认值 | 说明 |
+|------|------|--------|------|
+| `--teacher_model` | str | 必需 | 教师模型路径或模型 ID |
+| `--beta` | float | 0.5 | JSD 散度插值系数：<br>• 0.0: Forward KL<br>• 0.5: 对称 JSD<br>• 1.0: Reverse KL |
+| `--lmbda` | float | 0.5 | On-Policy 学习触发概率：<br>• 0.0: 纯 Off-Policy<br>• 1.0: 纯 On-Policy |
+| `--seq_kd` | bool | False | 是否使用教师生成的响应（当前暂不支持） |
+| `--temperature` | float | 0.9 | 温度参数，用于采样和损失计算 |
+| `--sft_alpha` | float | 0 | 混合一定比例的sft loss，对非student生成结果生效 |
+| `--max_completion_length` | int | 512 | 生成时的最大 token 数 |
+
+### 批量相关参数
+
+与 Megatron SFT 相同，使用以下参数控制批量大小：
+
+| 参数 | 说明 |
+|------|------|
+| `--micro_batch_size` | 每张 GPU 的训练批次大小 |
+| `--global_batch_size` | 全局批次大小：`micro_batch_size × dp_size × gradient_accumulation_steps` |
+
+## 三种训练模式
+
+GKD 支持三种训练模式，通过 `lmbda` 和 `seq_kd` 参数控制：
+
+### Mode 1: On-Policy 学习
+- 触发条件：`random() < lmbda` 且 `use_vllm=True`
+- 数据来源：学生模型生成的响应
+
+### Mode 2: Sequential KD（当前暂不支持）
+- 触发条件：`random() >= lmbda` 且 `seq_kd=True`
+- 数据来源：教师模型生成的响应
+
+### Mode 3: Off-Policy 学习
+- 触发条件：其他情况
+- 数据来源：数据集中的标注响应
+
+## 参考
+
+更多参数请参考[命令行文档](./Command-line-parameters.md)
+
+训练脚本请参考 [Megatron GKD 脚本](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/gkd)
@@ -9,6 +9,7 @@ ms-swift引入了Megatron的并行技术来加速大模型的训练，包括数
 | 预训练 | ✅ | ✅| ✅ | ✅ | ✅ |
 | [指令监督微调](https://github.com/modelscope/ms-swift/tree/main/examples/megatron) | ✅ | ✅| ✅ | ✅ | ✅ |
 | [GRPO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/grpo) | ✅ | ✅| ✅ | ✅ | ✅ |
+| [GKD](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/gkd) | ✅ | ✅| ✅ | ✅ | ✅ |
 | [DPO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/dpo) | ✅ | ✅| ✅ | ✅ | ✅ |
 | [KTO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto) | ✅ | ✅| ✅ | ✅ | ✅ |
 | [RM](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm) | ✅ | ✅| ✅ | ✅ | ✅ |
 
@@ -1,6 +1,6 @@
 # GKD
 
-GKD (Generalized Knowledge Distillation) training algorithm is proposed in the paper [On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes](https://arxiv.org/pdf/2306.13649). This algorithm transfers knowledge from the teacher model to the student model by combining off-policy and on-policy learning strategies.
+GKD (Generalized Knowledge Distillation) training algorithm is proposed in the paper [On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes](https://arxiv.org/pdf/2306.13649). This algorithm transfers knowledge from the teacher model to the student model by combining offline and on-policy learning strategies.
 
 ## Loss Function
 
@@ -103,7 +103,7 @@ elif seq_kd:
     y = teacher.generate(x)
     source = "teacher"
 else:
-    # Mode 3: Off-Policy learning, use output sequence from dataset
+    # Mode 3: Offline learning, use output sequence from dataset
     y = y_ground_truth
     source = "dataset"
 
@@ -128,7 +128,7 @@ Set parameter `seq_kd=True`, when on-policy is not triggered, use teacher model
 
 **Data Source**: $y \sim P_{\text{teacher}}(\cdot | x)$
 
-### Mode 3: Off-Policy Learning (other cases)
+### Mode 3: Offline Learning (other cases)
 
 **Data Source**: $y = y^* \sim \text{Dataset}$
 
@@ -143,9 +143,10 @@ We can perform GKD training by setting the following parameters:
 |------|------|--------|---------|------|
 | `--teacher_model` | str | Required | - | Teacher model path or model ID |
 | `--beta` | float | 0.5 | [0.0, 1.0] | Divergence interpolation coefficient<br>• 0.0: Forward KL <br>• 0.5: JSD (balanced)<br>• 1.0: Reverse KL |
-| `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy learning trigger probability<br>• 0.0: Pure Off-Policy<br>• 0.5: Mixed strategy (**recommended**)<br>• 1.0: Pure On-Policy |
+| `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy learning trigger probability<br>• 0.0: Pure Offline<br>• 0.5: Mixed strategy (**recommended**)<br>• 1.0: Pure On-Policy |
 | `--seq_kd` | bool | False | True/False | Whether to use teacher-generated sequences<br>• False: Use dataset when not on-policy<br>• True: Use teacher generation when not on-policy |
 | `--temperature` | float | 0.9 | > 0 | Generation sampling temperature, controls randomness |
+| `--sft_alpha` | float | 0 | >= 0 | Mix in a proportion of SFT loss; applied to non-student-generated completions |
 | `--max_completion_length` | int | 512 | > 0 | Maximum number of tokens during generation |
 
 ## Sampling Acceleration
 
@@ -347,7 +347,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
 
 In addition to inheriting the training parameters, the following parameters are also supported:
 
-- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'grpo', 'kto', and 'rm' are available.
+- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'grpo', 'kto', 'rm', and 'gkd' are available.
 - loss_scale: Overrides the `loss_scale` in [basic parameters](../Instruction/Command-line-parameters.md). Default is 'last_round'.
 - calculate_per_token_loss: Overrides the Megatron parameter. Default is False.
 
@@ -430,6 +430,22 @@ In addition to inheriting the training parameters, the following parameters are
 
 Built-in reward function parameters refer to the [documentation](../Instruction/Command-line-parameters.md#reward-function-parameters).
 
+### GKD Parameters
+
+- teacher_model: Path or model ID of the teacher model. Required.
+- teacher_model_type: Teacher model type. Default is None, auto-detected.
+- teacher_model_revision: Teacher model version. Default is None.
+- beta: JSD divergence interpolation coefficient. 0.0 means Forward KL, 0.5 means symmetric JSD, 1.0 means Reverse KL. Default is 0.5.
+- lmbda: On-Policy learning probability. 0.0 means pure Off-Policy, 1.0 means pure On-Policy. Default is 0.5.
+- seq_kd: Whether to use teacher-generated responses (Sequential KD), not yet supported. Default is False.
+- temperature: Temperature for sampling and loss computation. Default is 0.9.
+- offload_teacher_model: Whether to offload teacher model to CPU to save GPU memory. Default is False.
+- sft_alpha: Mixing coefficient for SFT loss, `loss = jsd_loss + sft_alpha * sft_loss`. Takes effect when using dataset responses (Off-Policy). Default is 0.
+- max_completion_length: Maximum tokens for generation. Default is 512.
+- vllm_mode: Same as GRPO parameter, used for On-Policy generation. Colocate mode deploys vLLM within the program.
+  - Note: On-Policy generation requires vLLM (`--use_vllm true --vllm_mode colocate/server`).
+  - When `lmbda > 0` but vLLM is not enabled, it will automatically fall back to Off-Policy mode.
+
 ## Export Parameters
 
 This section introduces the parameters for `megatron export` (requires "ms-swift>=3.10"). To use the `swift export` command for exporting, please refer to the [ms-swift Command Line Parameters Documentation](../Instruction/Command-line-parameters.md#export-arguments). Compared to `swift export`, `megatron export` supports distributed and multi-node exporting. Megatron export parameters inherit from Megatron parameters and basic parameters.
 
@@ -0,0 +1,73 @@
+# GKD
+
+**Version Requirement**: ms-swift >= 3.12
+
+If you are new to GKD, please refer to the [GKD Documentation](../Instruction/GKD.md) first.
+
+GKD (Generalized Knowledge Distillation) is a training method that transfers knowledge from a teacher model to a student model by computing the Jensen-Shannon Divergence (JSD) loss between their output distributions.
+
+## Feature Support
+
+Megatron GKD currently supports the following features:
+
+- **Training Modes**: Full parameter training and LoRA fine-tuning
+- **Parallelism Strategies**: Context Parallel (CP), Pipeline Parallel (PP), Tensor Parallel (TP), and Expert Parallel (EP)
+- **Model Support**: Compatible with LLMs and MLLMs in Megatron-SWIFT
+- **Teacher Offload**: Supports offloading teacher model to CPU to save GPU memory
+- **Online Generation**: Supports on-policy generation using vLLM for student model
+
+### Current Limitations
+
+- **Teacher Model Online Generation** (`seq_kd=True`): Teacher model generation in Sequential KD mode is not yet supported
+- **Non-vLLM Generation**: On-policy generation currently only supports vLLM
+- **Teacher model with different parallel parameters**: Will be supported in future versions
+
+⚠️ Notes:
+- **On-policy Generation**: Requires vLLM (`--use_vllm true --vllm_mode colocate/server`)
+- When `lmbda > 0` but vLLM is not enabled, it will automatically fall back to off-policy mode (using dataset responses)
+- When `seq_kd=True`, since teacher generation is not yet supported, it will automatically fall back to off-policy mode. If needed, please use [swift infer](../Instruction/Inference-and-deployment.md) to pre-generate responses for the dataset
+
+## Parameters
+
+### GKD-specific Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `--teacher_model` | str | Required | Path or model ID of the teacher model |
+| `--beta` | float | 0.5 | JSD divergence interpolation coefficient:<br>• 0.0: Forward KL<br>• 0.5: Symmetric JSD<br>• 1.0: Reverse KL |
+| `--lmbda` | float | 0.5 | On-Policy learning probability:<br>• 0.0: Pure Off-Policy<br>• 1.0: Pure On-Policy |
+| `--seq_kd` | bool | False | Use teacher-generated responses (not yet supported) |
+| `--temperature` | float | 0.9 | Temperature for sampling and loss computation |
+| `--sft_alpha` | float | 0 | Mix in a  proportion of SFT loss; applied to non-student-generated completions |
+| `--max_completion_length` | int | 512 | Maximum tokens for generation |
+
+### Batch-related Parameters
+
+Same as Megatron SFT, use the following parameters to control batch size:
+
+| Parameter | Description |
+|-----------|-------------|
+| `--micro_batch_size` | Training batch size per GPU |
+| `--global_batch_size` | Global batch size: `micro_batch_size × dp_size × gradient_accumulation_steps` |
+
+## Three Training Modes
+
+GKD supports three training modes, controlled by `lmbda` and `seq_kd` parameters:
+
+### Mode 1: On-Policy Learning
+- Trigger: `random() < lmbda` and `use_vllm=True`
+- Data source: Responses generated by the student model
+
+### Mode 2: Sequential KD (Not Yet Supported)
+- Trigger: `random() >= lmbda` and `seq_kd=True`
+- Data source: Responses generated by the teacher model
+
+### Mode 3: Off-Policy Learning
+- Trigger: Other cases
+- Data source: Labeled responses from the dataset
+
+## Reference
+
+For more parameters, please refer to [Command-line Parameters](./Command-line-parameters.md)
+
+For training scripts, please refer to [Megatron GKD Scripts](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/gkd)
@@ -8,6 +8,7 @@ ms-swift incorporates Megatron's parallelization techniques to accelerate the tr
 | Pre-training           | ✅              | ✅    | ✅    | ✅          | ✅    |
 | [Supervised Fine-Tuning](https://github.com/modelscope/ms-swift/tree/main/examples/megatron) | ✅              | ✅    | ✅    | ✅          | ✅    |
 | [GRPO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/grpo)                   | ✅              | ✅    | ✅    | ✅          | ✅    |
+| [GKD](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/gkd)                    | ✅              | ✅    | ✅    | ✅          | ✅    |
 | [DPO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/dpo)                    | ✅              | ✅    | ✅    | ✅          | ✅    |
 | [KTO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto)                    | ✅              | ✅    | ✅    | ✅          | ✅    |
 | [RM](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm)                     | ✅              | ✅    | ✅    | ✅          | ✅    |
 
@@ -0,0 +1,39 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+megatron rlhf \
+    --rlhf_type gkd \
+    --model Qwen/Qwen3-8B-Base \
+    --teacher_model Qwen/Qwen3-32B \
+    --train_type lora \
+    --dataset AI-ModelScope/alpaca-gpt4-data-en#2000 AI-ModelScope/alpaca-gpt4-data-zh#2000 \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 1 \
+    --pipeline_model_parallel_size 2 \
+    --context_parallel_size 2 \
+    --seq_kd false \
+    --lmbda 1 \
+    --beta 1 \
+    --torch_dtype bfloat16 \
+    --micro_batch_size 2 \
+    --global_batch_size 16 \
+    --max_epochs 1 \
+    --lr 5e-6 \
+    --log_interval 1 \
+    --max_length 8192 \
+    --max_completion_length 8192 \
+    --attention_backend flash \
+    --use_vllm true \
+    --vllm_mode colocate \
+    --vllm_gpu_memory_utilization 0.5 \
+    --vllm_tensor_parallel_size 1 \
+    --vllm_max_model_len 16384 \
+    --sleep_level 1 \
+    --offload_teacher_model true \
+    --recompute_granularity selective \
+    --finetune \
+    --no_save_optim \
+    --no_save_rng \
+    --temperature 1.0 \
+    --padding_free true \
+    --sequence_parallel true