[grpo] pass trainer state to reward funcs(#4779)

hjh0119 · web-flow · commit 4ddb7fa543ae · 2025-07-01T14:43:28.000+08:00
diff --git a/docs/source/Instruction/GRPO/DeveloperGuide/奖励函数.md b/docs/source/Instruction/GRPO/DeveloperGuide/奖励函数.md
@@ -1,6 +1,10 @@
 # 奖励函数
 ## 自定义奖励函数
-奖励函数接受模型生成的文本 completions 以及其他数据集中的列作为参数(kwargs)，并对模型生成的文本进行打分。以下是一个示例，展示了如何实现一个简单的长度奖励函数。该函数会在模型生成的文本长度超过 1024 时，给予 1.0 的奖励信号；否则，奖励信号为 0.0。
+奖励函数接受模型生成的文本 completions 其他数据集中的列以及训练器状态作为参数(kwargs)进行打分, 其中[训练器状态](https://huggingface.co/docs/transformers/main/main_classes/callback#transformers.TrainerState)包含训练的步数等信息。
+
+注意：模型输入相关的列（比如query，response）会被处理为 messages 键，原数据集中的 assistant response 会被舍弃，请使用额外的列进行保留。
+
+以下是一个示例，展示了如何实现一个简单的长度奖励函数。该函数会在模型生成的文本长度超过 1024 时，给予 1.0 的奖励信号；否则，奖励信号为 0.0。
 
 ```python
 from swift.plugin import ORM, orms
@@ -13,23 +17,27 @@ orms['dummy']= DummyLengthRewardFunction
 
 **获取数据集中的其他列**
 
-比如奖励函数需要获取数据集`solution`列作为辅助计算，以下是两种获取方式
+比如奖励函数需要获取数据集`solution`列、当前训练步数和总步数作为辅助计算，以下是两种获取方式
 
-第一种：在__call__入参中显式定义 solution 列名
+第一种：在__call__入参中显式定义列名
 ```python
-    def __call__(completions, solution, **kwargs):
+    def __call__(completions, solution, trainer_state, **kwargs):
         print(solution)
+        global_step = trainer_state.global_step
+        max_steps = trainer_state.max_steps
         ...
 ```
 
 第二种：在kwargs中获取
 ```python
     def __call__(completions, **kwargs):
         solution = kwargs.get('solution')
+        trainer_state = kwargs.get('trainer_state')
+        global_step = trainer_state.global_step
+        max_steps = trainer_state.max_steps
         ...
 ```
 
-注意：messages 相关的列（比如query，response）会被处理，以及原数据集中的 assistant response 会被舍弃，请使用额外的列进行保留。
 
 **使用自定义奖励函数**
 
diff --git a/docs/source_en/Instruction/GRPO/DeveloperGuide/reward_function.md b/docs/source_en/Instruction/GRPO/DeveloperGuide/reward_function.md
@@ -1,6 +1,10 @@
 # Reward Function
 ## Custom Reward Function
-The reward function takes the model-generated text `completions` and other columns from the dataset as parameters (`kwargs`) and scores the model-generated text. Below is an example demonstrating how to implement a simple length-based reward function. This function assigns a reward signal of 1.0 if the length of the model-generated text exceeds 1024; otherwise, the reward signal is 0.0.
+The reward function takes as arguments (via kwargs) the model-generated completions, other columns from the dataset, and the training state, and calculates a reward score. The [trainer state]() includes information such as the current training step.
+
+Note: The columns related to model input (such as query and response) are converted to the messages key. The original assistant response in the dataset will be discarded, so please use extra columns if you wish to retain it.
+
+Below is an example illustrating how to implement a simple length-based reward function. This function assigns a reward of 1.0 if the length of the generated completion exceeds 1024, and 0.0 otherwise.
 
 ```python
 from swift.plugin import ORM, orms
@@ -12,25 +16,28 @@ orms['dummy']= DummyLengthRewardFunction
 ```
 
 **Accessing Other Columns in the Dataset**
+For example, if the reward function needs to access the solution column from the dataset, as well as the current training step and the total number of steps for calculation, there are two ways to retrieve these values:
 
-For example, if the reward function needs to access the solution column from the dataset for auxiliary calculations, here are two ways to achieve this:
 
-Explicitly define the solution column name in the __call__ parameters:
+Explicitly define the column name in the __call__ parameters:
 ```python
-    def __call__(completions, solution, **kwargs):
+    def __call__(completions, solution, trainer_state, **kwargs):
         print(solution)
+        global_step = trainer_state.global_step
+        max_steps = trainer_state.max_steps
         ...
 ```
 
 Retrieve it from kwargs:
 ```python
     def __call__(completions, **kwargs):
         solution = kwargs.get('solution')
+        trainer_state = kwargs.get('trainer_state')
+        global_step = trainer_state.global_step
+        max_steps = trainer_state.max_steps
         ...
 ```
 
-Note: Columns related to messages (e.g., query, response) will be processed, and the original assistant responses in the dataset will be discarded. Use additional columns to retain such information.
-
 **Using Custom Reward Functions**
 
 You can add the reward function in [plugin program](https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/plugin/plugin.py), register it using the parameter `--external_plugins examples/train/grpo/plugin/plugin.py`, and specify it via the `reward_funcs` parameter.
diff --git a/examples/train/grpo/plugin/plugin.py b/examples/train/grpo/plugin/plugin.py
@@ -462,7 +462,9 @@ def __init__(self):
         self.format_max_possible = 1.0
         self.format_min_possible = 0.0
 
-    def __call__(self, completions, solution, global_step, **kwargs) -> List[float]:
+    def __call__(self, completions, solution, **kwargs) -> List[float]:
+        trainer_state = kwargs.get('trainer_state')
+        global_step = trainer_state.global_step
         max_possible_reward = self.format_max_possible
         min_possible_reward = self.format_min_possible
         # Two stage (Coarse) Setting, divide training into two phases. Format Reward in [0,0.5] if step < 30 else [0,1]
@@ -521,9 +523,11 @@ def __init__(self):
         self.length_min_possible = 0.0
 
     # customized reward functions: length
-    def __call__(self, completions, solution, global_step, **kwargs):
+    def __call__(self, completions, solution, **kwargs):
         max_possible_reward = self.length_max_possible
         min_possible_reward = self.length_min_possible
+        trainer_state = kwargs.get('trainer_state')
+        global_step = trainer_state.global_step
         # SCHEDULELENGTH: enable Dynamic Length Reward
         if os.getenv('SCHEDULELENGTH', 0) == '1':
             max_reward_len = (640 - 384) * global_step / 105 + 384
@@ -639,7 +643,9 @@ def compute_tool_call_reward(self, gt_tools, pd_tools, max_possible_reward, min_
         return (max_possible_reward - min_possible_reward) * score / local_max_possible + min_possible_reward
 
     # custoimzed reward functions: tool call correctness
-    def __call__(self, completions, solution, global_step, **kwargs):
+    def __call__(self, completions, solution, **kwargs):
+        trainer_state = kwargs.get('trainer_state')
+        global_step = trainer_state.global_step
         max_possible_reward = self.tool_max_possible
         min_possible_reward = self.tool_min_possible
         # two stage (Coarse) Setting, divide training into two phases.
diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py
@@ -907,7 +907,7 @@ def _score_completions(self, inputs: InputsType) -> Tuple[torch.Tensor, torch.Te
                 else:
                     # Repeat all input columns (but "messages" and "completion") to match the number of generations
                     reward_kwargs = RowPreprocessor.rows_to_batched(inputs)
-                    reward_kwargs['global_step'] = self.state.global_step
+                    reward_kwargs['trainer_state'] = self.state
                     output_reward_func = reward_func(completions, **reward_kwargs)
                 output_reward_func = [reward if reward is not None else torch.nan for reward in output_reward_func]
                 rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32, device=device)