[grpo] support offloading reference model (#4554)

hjh0119 · Jintao-Huang · commit 97d646b267cc · 2025-06-13T10:50:18.000+08:00
* offload ref_model

* argument

* rm comment

* doc and fix

* refactor

* clean scripts

* rm unused dict

* rm offload_ref_model argument

* doc clean

* doc
diff --git a/README.md b/README.md
@@ -119,7 +119,7 @@ Running Environment:
 | transformers | >=4.33       | 4.51      |                                           |
 | modelscope   | >=1.23       |             |                                           |
 | peft | >=0.11,<0.16 | ||
-| trl | >=0.13,<0.18 | 0.17 |RLHF|
+| trl | >=0.13,<0.19 | 0.18 |RLHF|
 | deepspeed    | >=0.14       | 0.14.5 | Training                                  |
 | vllm         | >=0.5.1      | 0.8       | Inference/Deployment/Evaluation           |
 | lmdeploy     | >=0.5        | 0.8       | Inference/Deployment/Evaluation           |
diff --git a/README_CN.md b/README_CN.md
@@ -115,7 +115,7 @@ pip install -e .
 | transformers | >=4.33       | 4.51 ||
 | modelscope | >=1.23       |  ||
 | peft | >=0.11,<0.16 | ||
-| trl | >=0.13,<0.18 | 0.17 |RLHF|
+| trl | >=0.13,<0.19 | 0.18 |RLHF|
 | deepspeed | >=0.14       | 0.14.5 |训练|
 | vllm | >=0.5.1      | 0.8 |推理/部署/评测|
 | lmdeploy | >=0.5        | 0.8 |推理/部署/评测|
diff --git a/docs/source/GetStarted/SWIFT安装.md b/docs/source/GetStarted/SWIFT安装.md
@@ -75,7 +75,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
 | transformers | >=4.33       | 4.51 ||
 | modelscope | >=1.23       |  ||
 | peft | >=0.11,<0.16 | ||
-| trl | >=0.13,<0.18 | 0.17 |RLHF|
+| trl | >=0.13,<0.19 | 0.18 |RLHF|
 | deepspeed | >=0.14       | 0.14.5 |训练|
 | vllm | >=0.5.1      | 0.8 |推理/部署/评测|
 | lmdeploy | >=0.5        | 0.8 |推理/部署/评测|
diff --git a/docs/source/Instruction/GRPO.md b/docs/source/Instruction/GRPO.md
@@ -46,7 +46,7 @@ GRPO 训练框架支持集成高性能推理引擎（如 vLLM）来加速采样
 --sleep_level 1
 ```
 
-2. 在vLLM 推理阶段，释放训练模型和优化器占用的显存：
+2. 在vLLM 推理阶段，释放模型和优化器占用的显存：
 
 ```bash
 --offload_optimizer true \
@@ -222,7 +222,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass
   - vllm_enable_prefix_caching: vllm透传参数，默认为True.
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放.
   - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
-  - offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
+  - offload_model: 是否在vLLM推理时 offload 模型，默认为False。
   - gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
   - completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
   `total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -446,7 +446,7 @@ reward模型参数将在PPO、GRPO中使用。
   - vllm_enable_prefix_caching: vllm透传参数，默认为True。
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放
   - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
-  - offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
+  - offload_model: 是否在vLLM推理时 offload 模型，默认为False。
   - gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
   - completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
   `total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
diff --git a/docs/source_en/GetStarted/SWIFT-installation.md b/docs/source_en/GetStarted/SWIFT-installation.md
@@ -76,7 +76,7 @@ More images can be found [here](https://modelscope.cn/docs/intro/environment-set
 | transformers | >=4.33       | 4.51      |                                           |
 | modelscope   | >=1.23       |             |                                           |
 | peft         | >=0.11,<0.16 |             |                                           |
-| trl          | >=0.13,<0.18 | 0.17      | RLHF                                      |
+| trl          | >=0.13,<0.19 | 0.18      | RLHF                                      |
 | deepspeed    | >=0.14       | 0.14.5 | Training                                  |
 | vllm         | >=0.5.1      | 0.8       | Inference/Deployment/Evaluation           |
 | lmdeploy     | >=0.5        | 0.8       | Inference/Deployment/Evaluation           |
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -457,8 +457,8 @@ The meanings of the following parameters can be referenced [here](https://huggin
   - vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
-  - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM/LMDeploy. The default is `False`.
-  - offload_model: Whether to offload the model itself during inference with vLLM/LMDeploy. The default is `False`.
+  - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
+  - offload_model: Whether to offload the model during inference with vLLM. The default is `False`.
   - gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
   - completion_length_limit_scope: Specifies the scope of the `max_completion_length` limit in multi-turn conversations.
   When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
diff --git a/docs/source_en/Instruction/GRPO.md b/docs/source_en/Instruction/GRPO.md
@@ -53,7 +53,7 @@ When running in Colocate Mode , out-of-memory (OOM) errors are common due to sim
 --sleep_level 1
 ```
 
-2. Offload training model and optimizer memory during vLLM inference:
+2. Offload model and optimizer memory during vLLM inference:
 
 ```bash
 --offload_optimizer true \
@@ -232,7 +232,7 @@ Arguments
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
   - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
-  - offload_model: Whether to offload the model itself during inference with vLLM. The default is `False`.
+  - offload_model: Whether to offload the model during inference with vLLM. The default is `False`.
   - gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
   - completion_length_limit_scope: Specifies the scope of the `max_completion_length` limit in multi-turn conversations.
   When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
@@ -234,7 +234,6 @@ class GRPOArgumentsMixin:
     # Dr. GRPO, https://arxiv.org/abs/2503.20783
     scale_rewards: bool = True
 
-    # compatible with trl main branch(0.17.0.dev0)
     wandb_log_unique_prompts: Optional[bool] = None
     generation_batch_size: Optional[int] = None
     steps_per_generation: Optional[int] = None
diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py
@@ -37,8 +37,8 @@
 from swift.llm.model.utils import get_llm_model
 from swift.llm.template.template_inputs import StdTemplateInputs
 from swift.plugin import loss_scale_map, multi_turns, orms, rm_plugins
-from swift.utils import (JsonlWriter, gc_collect, get_device, get_logger, is_vllm_available, is_wandb_available,
-                         seed_worker)
+from swift.utils import (JsonlWriter, gc_collect, get_current_device, get_device, get_logger, is_vllm_available,
+                         is_wandb_available, seed_worker)
 from ..mixin import SwiftMixin
 from .rlhf_mixin import RLHFTrainerMixin
 from .utils import _ForwardRedirection, patch_lora_merge, patch_lora_unmerge, unwrap_model_for_generation
@@ -93,10 +93,6 @@ def __init__(self,
 
         self.processing_class = kwargs.get('template').tokenizer
 
-        # for offload model/optimizer
-        self.offload_modules = {}
-        self.offload_states = {}
-
         if not isinstance(reward_funcs, list):
             reward_funcs = [reward_funcs]
 
@@ -759,7 +755,9 @@ def _prefetch(self, dataloader: DataLoader):
     def _fast_infer(self, inputs: InputsType) -> Tuple[InputsType, OutputsType]:
         if self.vllm_mode == 'colocate' and self.args.sleep_level > 0:
             if self.args.offload_model:
-                self.offload_model()
+                self.offload_model(self.accelerator.unwrap_model(self.model))
+                if self.ref_model:
+                    self.offload_model(self.ref_model)
             if self.args.offload_optimizer:
                 self.offload_optimizer()
             if self.args.gc_collect_after_offload:
@@ -797,7 +795,9 @@ def _fast_infer(self, inputs: InputsType) -> Tuple[InputsType, OutputsType]:
             if self.args.gc_collect_after_offload:
                 gc_collect()
             if self.args.offload_model:
-                self.load_model()
+                self.load_model(self.accelerator.unwrap_model(self.model))
+                if self.ref_model:
+                    self.load_model(self.ref_model)
             if self.args.offload_optimizer:
                 self.load_optimizer()
         return inputs, outputs
@@ -1387,60 +1387,38 @@ def _queue(self):
             return self.train_queue
 
     @torch.no_grad()
-    def offload_model(self):
-        if len(self.offload_modules) > 0:
-            return
-        unwrapped_model = self.accelerator.unwrap_model(self.model)
-        for name, module in unwrapped_model.named_modules():
-            if isinstance(module, torch.nn.Embedding):
-                self.offload_modules[name] = module.weight.device
-                module.to('cpu')
-            elif not hasattr(module, 'device'):
-                pass
-            elif module.device.type != 'cpu':
-                self.offload_modules[name] = module.device
-                module.to('cpu')
+    def offload_model(self, model):
+        for param in model.parameters():
+            param.data = param.data.to(torch.device('cpu'), non_blocking=True)
 
     @torch.no_grad()
-    def load_model(self):
-        if len(self.offload_modules) == 0:
-            return
-        unwrapped_model = self.accelerator.unwrap_model(self.model)
-        for name, device in self.offload_modules.items():
-            module = unwrapped_model.get_submodule(name)
-            if isinstance(module, torch.nn.Embedding):
-                module.weight.to(device)
-            else:
-                module.to(device)
-        self.offload_modules.clear()
+    def load_model(self, model):
+        device = get_current_device()
+        for param in model.parameters():
+            param.data = param.data.to(device, non_blocking=True)
 
     @torch.no_grad()
     def offload_optimizer(self):
-        if len(self.offload_states) > 0:
-            return
         if not self.optimizer.state:
             return
         for param_group in self.optimizer.param_groups:
             for param in param_group['params']:
                 state = self.optimizer.state[param]
                 for key, value in state.items():
                     if isinstance(value, torch.Tensor):
-                        self.offload_states[key] = value.device
                         state[key] = value.to('cpu', non_blocking=True)
 
     @torch.no_grad()
     def load_optimizer(self):
-        if len(self.offload_states) == 0:
-            return
+        device = get_current_device()
         if not self.optimizer.state:
             return
         for param_group in self.optimizer.param_groups:
             for param in param_group['params']:
                 state = self.optimizer.state[param]
                 for key, value in state.items():
                     if isinstance(value, torch.Tensor):
-                        state[key] = value.to(self.offload_states[key], non_blocking=True)
-        self.offload_states.clear()
+                        state[key] = value.to(device, non_blocking=True)
 
     @contextmanager
     def multi_turn_completion_length_context(self):