diff --git a/README.md b/README.md index e2a2ae9c45..ba687f1dcb 100644 --- a/README.md +++ b/README.md @@ -73,7 +73,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | *Multi-step agentic RL* | + [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)
+ [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)
+ [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)
+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)
+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html) | | *Algorithm development* | + [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))
+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *Going deeper into Trinity-RFT* | + [Full configurations](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)
+ [Benchmark toolkit for quick verification and experimentation](./benchmark/README.md)
+ [GPU Resource and Training Configuration Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
+ [Understand the coordination between explorer and trainer](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/synchronizer.html) | +| *Going deeper into Trinity-RFT* | + [Full configurations](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)
+ [Benchmark toolkit for quick verification and experimentation](./benchmark/README.md)
+ [GPU Resource and Training Configuration Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
+ [Understand the coordination between explorer and trainer](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/synchronizer.html)
+ [How to align configuration with veRL](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/align_with_verl.html) | > [!NOTE] diff --git a/README_zh.md b/README_zh.md index 761e5c848f..ace996d51e 100644 --- a/README_zh.md +++ b/README_zh.md @@ -73,7 +73,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: | *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)
+ [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)
+ [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)
+ [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)
+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
+ [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [高级数据处理能力 & Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html) | | *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))
+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *深入认识 Trinity-RFT* | + [完整配置指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)
+ [用于快速验证和实验的 Benchmark 工具](./benchmark/README.md)
+ [GPU 资源与训练配置对应指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
+ [理解 explorer-trainer 同步逻辑](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/synchronizer.html) | +| *深入认识 Trinity-RFT* | + [完整配置指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)
+ [用于快速验证和实验的 Benchmark 工具](./benchmark/README.md)
+ [GPU 资源与训练配置对应指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
+ [理解 explorer-trainer 同步逻辑](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/synchronizer.html)
+ [如何与 verl 对齐配置](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/align_with_verl.html) | > [!NOTE] diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst index 815223f135..0f013ce7ab 100644 --- a/docs/sphinx_doc/source/index.rst +++ b/docs/sphinx_doc/source/index.rst @@ -25,6 +25,7 @@ Welcome to Trinity-RFT's documentation! tutorial/trinity_configs.md tutorial/trinity_gpu_configs.md tutorial/synchronizer.md + tutorial/align_with_verl.md .. toctree:: diff --git a/docs/sphinx_doc/source/main.md b/docs/sphinx_doc/source/main.md index 53903c675a..773a5666fc 100644 --- a/docs/sphinx_doc/source/main.md +++ b/docs/sphinx_doc/source/main.md @@ -31,7 +31,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | *Multi-step agentic RL* | + [Concatenated multi-turn workflow](/tutorial/example_multi_turn.md)
+ [General multi-step workflow](/tutorial/example_step_wise.md)
+ [ReAct workflow with an agent framework](/tutorial/example_react.md)
+ [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](/tutorial/develop_selector.md)
+ [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
+ [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
+ [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [Advanced data processing & human-in-the-loop](/tutorial/example_data_functionalities.md) | | *Algorithm development* | + [RL algorithm development with Trinity-RFT](/tutorial/example_mix_algo.md) (📝 [paper](https://arxiv.org/pdf/2508.11408))
+ [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
+ Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *Going deeper into Trinity-RFT* | + [Full configurations](/tutorial/trinity_configs.md)
+ [Benchmark toolkit for quick verification and experimentation](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
+ [GPU Resource and Training Configuration Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
+ [Understand the coordination between explorer and trainer](/tutorial/synchronizer.md) | +| *Going deeper into Trinity-RFT* | + [Full configurations](/tutorial/trinity_configs.md)
+ [Benchmark toolkit for quick verification and experimentation](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
+ [GPU Resource and Training Configuration Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
+ [Understand the coordination between explorer and trainer](/tutorial/synchronizer.md)
+ [How to align configuration with veRL](/tutorial/align_with_verl.md) | diff --git a/docs/sphinx_doc/source/tutorial/align_with_verl.md b/docs/sphinx_doc/source/tutorial/align_with_verl.md new file mode 100644 index 0000000000..0fad0d2884 --- /dev/null +++ b/docs/sphinx_doc/source/tutorial/align_with_verl.md @@ -0,0 +1,519 @@ +# How to align configuration with veRL + +This guide provides guidance for users familiar with [veRL](https://github.com/volcengine/verl) to align the parameters and metrics in Trinity-RFT with the ones in veRL. + +Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend (`trainer`), including the actor, reference, and critic models. The `explorer` module in Trinity-RFT is implemented based on [vllm](https://github.com/vllm-project/vllm), replacing veRL's native rollout engine. Besides, Trinity-RFT introduces a new module `buffer` to enhance RFT's full-lifecycle data pipeline, which can be understood as a further enhancement of veRL's RL dataset and DataProto. + + +## Parameter Mapping + +The core parameters in veRL are divided into these categories: `algorithm`, `data`, `actor_rollout_ref`, `critic`, `reward_model`, and `trainer`. +Trinity-RFT divides massive parameters of reinforcement fine-tuning into several parts according to their functions, e.g., `algorithm`, `model`, `buffer`, `explorer`, `trainer`, `monitor`, `synchronizer`, and `cluster`. + +Roughly speaking, the parameters in veRL are mapped to the following modules in Trinity-RFT: + +| Configuration | veRL | Trinity-RFT | +|:----------|:-----|:-----| +| Algorithm, e.g., advantage function| `algorithm` | `algorithm` | +| Training and evaluation tasksets | `data` | `buffer.explorer_input` | +| Batch size (💡 explained later) | `data.train_batch_size` and `actor_rollout_ref.actor.ppo_mini_batch_size` | `buffer.batch_size` and `buffer.train_batch_size` | +| Actor | `actor_rollout_ref.actor` | `model` and `trainer` | +| Rollout | `actor_rollout_ref.rollout` | `explorer.rollout_model` | +| Critic | `critic` | `trainer.trainer_config.critic` | +| Reward model | `reward_model` | `explorer.auxiliary_models` | +| Some global configurations | `trainer` | `monitor`, `synchronizer`, `cluster`, etc | + + +In the following, we show how to map the parameters in veRL to the ones in Trinity-RFT. Please refer to the [documentation](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html) for the detailed parameter configuration of Trinity-RFT. + +```{note} +To match the default training setup of veRL, we set `synchronizer.sync_style=fixed` and `synchronizer.sync_offset=0` in Trinity-RFT. +``` + +### Algorithm + +| veRL | Trinity-RFT | Note | +|:-----|:-----|:-----| +| `algorithm.adv_estimator` | `algorithm.advantage_fn` | Pass parameters with `algorithm.advantage_fn_args` | +| `algorithm.gamma` | `algorithm.advantage_fn_args.gamma` | Along with `algorithm.advantage_fn: ppo/reinforceplusplus` | +| `algorithm.lam` | `algorithm.advantage_fn_args.lam` | Along with `algorithm.advantage_fn: ppo` | +| `algorithm.use_kl_in_reward` | `algorithm.kl_penalty_fn` | Disable KL in reward by setting `algorithm.kl_penalty_fn=none` | +| `algorithm.kl_penalty` | `algorithm.kl_penalty_fn` | Choose from `k2`, `low_var_kl`, etc | +| `algorithm.kl_ctrl.kl_coef` | `algorithm.kl_penalty_fn_args.kl_coef` | - | + +💡 Detailed explanation: + +* Before using args of advantage function or policy loss function (e.g., `algorithm.kl_penalty_fn_args`), a good practice is to check the source code to ensure these parameters can be processed by the corresponding function properly. + + +### Data + +| veRL | Trinity-RFT | Note | +|:-----|:-----|:-----| +| `data.train_files` | `buffer.explorer_input.taskset.path` or `buffer.explorer_input.tasksets[i].path` | - | +| `data.val_files` | `buffer.explorer_input.eval_tasksets[i].path` | - | +| `data.prompt_key` | `buffer.explorer_input.taskset.format.prompt_key`| Taskset-specific | +| `data.response_key` | `buffer.explorer_input.taskset.format.response_key`| Taskset-specific | +| `data.train_batch_size` | `buffer.batch_size` * `synchronizer.sync_interval` | The number of tasks to be explored | +| `data.val_batch_size` | `buffer.batch_size` | Deprecated in veRL | +| `data.max_prompt_length` | `model.max_prompt_tokens` | - | +| `data.max_response_length` | `model.max_response_tokens` | - | +| `data.filter_overlong_prompts` | `model.enable_prompt_truncation` | Explained later | +| `data.truncation` | - | Equivalent to `right` | +| `data.shuffle` | `buffer.explorer_input.taskset.task_selector.selector_type:random` | Taskset-specific | + +💡 Detailed explanation: + +* The note `taskset-specific` means you can set different parameters for each training or evaluation task in `buffer.explorer_input.tasksets[i]` or `buffer.explorer_input.eval_tasksets[i]`. + +* For the parameters related to `batch size`, Trinity-RFT uses `buffer.batch_size` to control the number of tasks to be explored in each exploration step, and `buffer.train_batch_size` to control the number of tasks used in each gradient descent step. In most cases, controlling the following parameters can ensure the same effect as veRL: + - `buffer.batch_size` in Trinity-RFT = `actor_rollout_ref.actor.ppo_mini_batch_size` in veRL + - `buffer.train_batch_size` in Trinity-RFT (automatically) = `actor_rollout_ref.rollout.n` * `actor_rollout_ref.actor.ppo_mini_batch_size` in veRL + - `synchronizer.sync_interval` in Trinity-RFT = `data.train_batch_size` / `actor_rollout_ref.actor.ppo_mini_batch_size` in veRL + - Do not set `ppo_mini_batch_size`, which is automatically set to match the effect of veRL, although the values may not be the same. + +* If you want to filter the overlong prompts, you can set `model.enable_prompt_truncation=True` in Trinity-RFT. In this case, the corresponding experiences will not be counted in loss computation, and thus `truncation` side does not matter anymore. + + +### Actor, Rollout, and Critic + +This section includes the parameters for the actor and the rollout. For easy understanding, you may think the actor in veRL (`actor_rollout_ref.actor`) as the trainer in Trinity-RFT (`trainer`), and the rollout (`actor_rollout_ref.rollout`) as the explorer (`explorer.rollout_model`). + +```{note} +Any parameter in `actor_rollout_ref.rollout` in Trinity-RFT is not effective; please set them in other fields properly. +``` + +For advanced training configuration of veRL you can set these up in the field of `trainer.trainer_config`. For example,`actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` in veRL is equivalent to `trainer.trainer_config.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` in Trinity-RFT. If you want to setup the parameters in the `trainer.trainer_config` dictionary, please read the source code in `trinity/common/verl_config.py` carefully! + + +| veRL | Trinity-RFT | Note | +|:-----|:-----|:-----| +| `actor_rollout_ref.model.path` | `model.model_path` | - | +| `actor_rollout_ref.actor.optim` | `algorithm.optimizer` | Such as `lr` and `weight_decay` | +| `actor_rollout_ref.rollout.n` | `algorithm.repeat_times` | Eval taskset-specific: `eval_tasksets[i].repeat_times` | +| `actor_rollout_ref.actor.ppo_mini_batch_size` | `buffer.batch_size` | The number of tasks to be explored in each exploration step | +| `actor_rollout_ref.actor.use_dynamic_bsz` | `trainer.use_dynamic_bsz` | - | +| `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` | `trainer.max_token_len_per_gpu` | - | +| `actor_rollout_ref.actor.ulysses_sequence_parallel_size` | `trainer.ulysses_sequence_parallel_size` | The sequence parallel size for the actor | +| `actor_rollout_ref.actor.grad_clip` | `trainer.grad_clip` | The gradient clip value for the actor | +| `actor_rollout_ref.actor.use_kl_loss` | `algorithm.kl_loss_fn` | If set to `none`, the KL divergence loss will not be computed | +| `actor_rollout_ref.rollout.gpu_memory_utilization` | `explorer.rollout_model.gpu_memory_utilization` | - | +| `actor_rollout_ref.rollout.temperature` | `model.temperature` | Can be taskset-specific, like `buffer.explorer_input.taskset.rollout_args.temperature` | +| `actor_rollout_ref.rollout.top_p` | `model.top_p` | Can be taskset-specific | +| `actor_rollout_ref.rollout.top_k` | `model.top_k` | Can be taskset-specific | +| `actor_rollout_ref.rollout.tensor_model_parallel_size` | `explorer.rollout_model.tensor_parallel_size` | - | +| `actor_rollout_ref.rollout.val_kwargs` | `buffer.explorer_input.eval_tasksets[i]` | Taskset-specific | +| `critic.model.path` | `model.critic_model_path` | Defaults to `model.model_path` | + +💡 Detailed explanation: + +* The note `can be taskset-specific` (take `temperature` as an example) means you can set `model.temperature` for all the tasksets, or set different values for each task in `buffer.explorer_input.taskset.rollout_args.temperature` or `buffer.explorer_input.eval_tasksets[i].rollout_args.temperature`. A concrete example is as follows: +```yaml +buffer: + explorer_input: + eval_tasksets: + - name: AIME2024 + storage_type: file + path: HuggingFaceH4/aime_2024 + split: 'train' + repeat_times: 32 + format: + prompt_key: 'question' + response_key: 'answer' + rollout_args: + temperature: 1.0 + top_p: 0.7 +``` + +### Reward Model + +Trinity-RFT supports the taskset-specific reward functions as well as the reward models. For custom reward functions, you can set `buffer.explorer_input.default_reward_fn_type` to select the corresponding reward function; you can also set `explorer.auxiliary_models` as reward model and use them within your workflow. For example, +```yaml +buffer: + explorer_input: + default_reward_fn_type: 'custom_reward' +explorer: + auxiliary_models: + - model_path: Qwen/Qwen3-30B-A3B-Instruct-2507 + engine_num: 1 + tensor_parallel_size: 2 + enable_thinking: false + max_prompt_tokens: 19456 + max_response_tokens: 1024 + max_model_len: 20480 +``` +Please refer to the [configuration](https://github.com/modelscope/Trinity-RFT/blob/main/examples/grpo_rubric_as_reward/rubric.yaml) and [workflow](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/common/workflows/rubric_judge_workflow.py) with LLM-as-a-judge for more details. + + +### Trainer + +| veRL | Trinity-RFT | Note | +|:-----|:-----|:-----| +| `trainer.logger` | `monitor.monitor_type` | Support a chosen type and (no need to set) `console` | +| `trainer.project_name` | `project` | - | +| `trainer.experiment_name` | `name` | - | +| `trainer.default_local_dir` | `checkpoint_root_dir` | Checkpoint is saved in `///` | +| `trainer.n_gpus_per_node` | `cluster.gpu_per_node` | - | +| `trainer.nnodes` | `cluster.node_num` | - | +| `trainer.save_freq` | `trainer.save_interval` | - | +| `trainer.test_freq` | `explorer.eval_interval` | - | +| `trainer.total_epochs` | `buffer.total_epochs` | - | +| `trainer.total_training_steps` | `buffer.total_steps` and `trainer.total_steps` | If not None, `buffer.total_epochs` will be ignored | +| `trainer.critic_warmup` | `trainer.trainer_config.trainer.critic_warmup` | - | +| `trainer.val_before_train` | `explorer.eval_on_startup` | - | +| `trainer.resume_mode` | `continue_from_checkpoint` | Explained later | +| `trainer.resume_from_path` | - | Explained later | + +💡 Detailed explanation: + +* If you want to resume training from a checkpoint, you can set `continue_from_checkpoint` to `True` and the training will start from the latest checkpoint in the checkpoint path `///` (if any). + + +## GPU Resource Allocation + +In Trinity-RFT, the GPU resource is allocated to the `explorer`, `auxiliary models` (if any), and `trainer` manually. + +* There are total `cluster.node_num` nodes, and each node has `cluster.gpu_per_node` GPUs. +* The number of GPUs for the `explorer` is `explorer.rollout_model.engine_num` * `explorer.rollout_model.tensor_parallel_size`. +* The number of GPUs for auxiliary models is the sum of `explorer.auxiliary_models[i].engine_num` * `explorer.auxiliary_models[i].tensor_parallel_size`. +* The remaining GPUs are for the `trainer`. + + +## Metrics Mapping + +### Why do we see two runs for each experiment? + +In Trinity-RFT, the explorer is responsible for the rollout process, while the trainer is responsible for the training process. Metrics from these two processes are calculated independently and uploaded to the monitor as separate runs. This is why you will see two runs for each experiment, distinguished by the "_explorer" or "_trainer" suffix. + + +### Why are some metrics different from veRL? + +Trinity-RFT uses [vllm](https://github.com/vllm-project/vllm) as the rollout engine and veRL as the training backend. Due to precision differences between these frameworks, the log probabilities computed on the given tokens may differ. As a result, some metrics (e.g., `actor/ppo_kl` and `actor/pg_clipfrac`) may differ from those observed in veRL. However, when using the same parameters with veRL, these differences are expected to be small. + + +## Example: PPO Training + +We transfer a PPO training example `run_qwen2-7b_rm.sh` from veRL to Trinity-RFT. + +The configuration file of veRL is as follows: +```bash +gsm8k_train_path=$HOME/data/gsm8k/train.parquet +gsm8k_test_path=$HOME/data/gsm8k/test.parquet +math_train_path=$HOME/data/math/train.parquet +math_test_path=$HOME/data/math/test.parquet + +train_files="['$gsm8k_train_path', '$math_train_path']" +test_files="['$gsm8k_test_path', '$math_test_path']" + +# prepare model ckpt +huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir $HOME/models/Qwen2-7B-Instruct & +huggingface-cli download sfairXC/FsfairX-LLaMA3-RM-v0.1 --local-dir $HOME/models/FsfairX-LLaMA3-RM-v0.1 & +wait + +python3 -m verl.trainer.main_ppo \ + algorithm.adv_estimator=gae \ + data.train_files="$train_files" \ + data.val_files="$test_files" \ + data.train_batch_size=1024 \ + data.max_prompt_length=1024 \ + data.max_response_length=512 \ + data.filter_overlong_prompts=True \ + data.truncation='error' \ + data.return_raw_chat=True \ + actor_rollout_ref.model.path="$HOME/models/Qwen2-7B-Instruct" \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.model.use_remove_padding=True \ + actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 \ + actor_rollout_ref.actor.ppo_mini_batch_size=256 \ + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \ + actor_rollout_ref.actor.use_kl_loss=False \ + actor_rollout_ref.model.enable_gradient_checkpointing=True \ + actor_rollout_ref.actor.fsdp_config.param_offload=False \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ + critic.optim.lr=1e-5 \ + critic.model.use_remove_padding=True \ + critic.optim.lr_warmup_steps_ratio=0.05 \ + critic.model.path="$HOME/models/Qwen2-7B-Instruct" \ + critic.model.enable_gradient_checkpointing=True \ + critic.ppo_micro_batch_size_per_gpu=32 \ + critic.model.fsdp_config.param_offload=False \ + critic.model.fsdp_config.optimizer_offload=False \ + reward_model.enable=True \ + reward_model.model.path="$HOME/models/FsfairX-LLaMA3-RM-v0.1" \ + reward_model.model.use_remove_padding=True \ + reward_model.model.fsdp_config.param_offload=True \ + reward_model.micro_batch_size_per_gpu=32 \ + algorithm.use_kl_in_reward=False \ + trainer.critic_warmup=0 \ + trainer.logger='["console","wandb"]' \ + trainer.project_name='verl_example' \ + trainer.val_before_train=False \ + trainer.experiment_name='Qwen2-7B-Instruct_hybrid_rm' \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=1 \ + trainer.save_freq=20 \ + trainer.test_freq=5 \ + trainer.total_epochs=15 $@ +``` + +The corresponding configuration of Trinity-RFT (ppo_example.yaml) is as follows: +```yaml +project: verl_example +name: Qwen2-7B-Instruct_hybrid_rm +checkpoint_root_dir: ./checkpoints +algorithm: + algorithm_type: ppo + repeat_times: 1 + optimizer: + lr: 1e-6 + lr_warmup_steps_ratio: 0.1 # actor_rollout_ref.actor.optim.lr_warmup_steps_ratio + advantage_fn: ppo # algorithm.adv_estimator=gae + kl_penalty_fn: none # algorithm.use_kl_in_reward=False + kl_loss_fn: none # actor_rollout_ref.actor.use_kl_loss=False + +model: + model_path: ${oc.env:HOME}/models/Qwen2-7B-Instruct + critic_model_path: ${oc.env:HOME}/models/Qwen2-7B-Instruct # critic.model.path + max_prompt_tokens: 1024 # data.max_prompt_length + max_response_tokens: 512 # data.max_response_length + enable_prompt_truncation: true # data.filter_overlong_prompts=True + +cluster: + node_num: 1 # trainer.nnodes + gpu_per_node: 8 # trainer.n_gpus_per_node + +buffer: + total_epochs: 15 # trainer.total_epochs + batch_size: 256 # actor_rollout_ref.actor.ppo_mini_batch_size + train_batch_size: 256 # actor_rollout_ref.actor.ppo_mini_batch_size * actor_rollout_ref.rollout.n=256*1=256 + explorer_input: + tasksets: + - name: gsm8k + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: train + format: + prompt_key: prompt # Check the dataset format + response_key: answer # Check the dataset format + - name: math + storage_type: file + path: ${oc.env:HOME}/data/math + split: train + format: + prompt_key: prompt # Check the dataset format + response_key: answer # Check the dataset format + rollout_args: + temperature: 1.0 + eval_tasksets: + - name: gsm8k_eval + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: test + format: + prompt_key: prompt # Check the dataset format + response_key: answer # Check the dataset format + - name: math_eval + storage_type: file + path: ${oc.env:HOME}/data/math + split: test + format: + prompt_key: prompt # Check the dataset format + response_key: answer # Check the dataset format + +explorer: + eval_interval: 5 # trainer.test_freq + eval_on_startup: false # trainer.val_before_train=False + rollout_model: + engine_num: 2 # The number of GPUs for the rollout model + tensor_parallel_size: 1 # actor_rollout_ref.rollout.tensor_model_parallel_size + gpu_memory_utilization: 0.6 # actor_rollout_ref.rollout.gpu_memory_utilization + auxiliary_models: # reward_model configuration + - model_path: ${oc.env:HOME}/models/FsfairX-LLaMA3-RM-v0.1 + engine_num: 2 # The number of GPUs for the reward model + tensor_parallel_size: 1 + +synchronizer: + sync_style: fixed + sync_offset: 1 + sync_interval: 4 # sync_interval = data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size + sync_timeout: 1200 + +trainer: + save_interval: 20 # trainer.save_freq + trainer_config: + actor_rollout_ref: + model: + use_remove_padding: true # actor_rollout_ref.model.use_remove_padding + enable_gradient_checkpointing: true # actor_rollout_ref.model.enable_gradient_checkpointing + actor: + ppo_micro_batch_size_per_gpu: 16 # actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu + fsdp_config: + param_offload: false # actor_rollout_ref.actor.fsdp_config.param_offload + optimizer_offload: false # actor_rollout_ref.actor.fsdp_config.optimizer_offload + rollout: + log_prob_micro_batch_size_per_gpu: 16 # actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu + critic: + model: + use_remove_padding: true # critic.model.use_remove_padding + enable_gradient_checkpointing: true # critic.model.enable_gradient_checkpointing + fsdp_config: + param_offload: false # critic.model.fsdp_config.param_offload + optimizer_offload: false # critic.model.fsdp_config.optimizer_offload + optim: + lr: 1e-5 # critic.optim.lr + lr_warmup_steps_ratio: 0.05 # critic.optim.lr_warmup_steps_ratio + ppo_micro_batch_size_per_gpu: 32 # critic.ppo_micro_batch_size_per_gpu + trainer: + critic_warmup: 0 # trainer.critic_warmup + +monitor: + monitor_type: wandb # trainer.logger='["console","wandb"]' - wandb is the set value, console is default +``` + +The command to run this example is: +```bash +trinity run --config ppo_example.yaml +``` + + +## Example: GRPO Training + +We transfer a GRPO training example `run_deepseek7b_llm_seq_balance.sh` from veRL to Trinity-RFT. + +The configuration file of veRL is as follows: +```bash +set -x + +python3 -m verl.trainer.main_ppo \ + algorithm.adv_estimator=grpo \ + data.train_files=$HOME/data/gsm8k/train.parquet \ + data.val_files=$HOME/data/gsm8k/test.parquet \ + data.train_batch_size=1024 \ + data.max_prompt_length=512 \ + data.max_response_length=512 \ + data.filter_overlong_prompts=True \ + data.truncation='error' \ + actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.model.use_remove_padding=True \ + actor_rollout_ref.actor.ppo_mini_batch_size=256 \ + actor_rollout_ref.actor.use_dynamic_bsz=True \ + actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \ + actor_rollout_ref.actor.use_kl_loss=True \ + actor_rollout_ref.actor.kl_loss_coef=0.001 \ + actor_rollout_ref.actor.kl_loss_type=low_var_kl \ + actor_rollout_ref.actor.entropy_coeff=0 \ + actor_rollout_ref.model.enable_gradient_checkpointing=True \ + actor_rollout_ref.actor.fsdp_config.param_offload=False \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ + actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ + actor_rollout_ref.rollout.n=8 \ + actor_rollout_ref.ref.fsdp_config.param_offload=True \ + algorithm.use_kl_in_reward=False \ + trainer.critic_warmup=0 \ + trainer.logger='["console","wandb"]' \ + trainer.project_name='verl_grpo_example_gsm8k' \ + trainer.experiment_name='deepseek_llm_7b_function_rm_seq_packing' \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=1 \ + trainer.save_freq=20 \ + trainer.test_freq=5 \ + trainer.total_epochs=15 $@ +``` + +The corresponding configuration of Trinity-RFT (grpo_example.yaml) is as follows: +```yaml +project: verl_grpo_example_gsm8k +name: deepseek_llm_7b_function_rm_seq_packing +checkpoint_root_dir: ./checkpoints +algorithm: + algorithm_type: grpo + repeat_times: 8 # actor_rollout_ref.rollout.n=8 + optimizer: + lr: 1e-6 # actor_rollout_ref.actor.optim.lr + advantage_fn: grpo # algorithm.adv_estimator=grpo + kl_penalty_fn: none # algorithm.use_kl_in_reward=False + kl_loss_fn: low_var_kl # actor_rollout_ref.actor.kl_loss_type=low_var_kl + kl_loss_fn_args: + kl_coef: 0.001 # actor_rollout_ref.actor.kl_loss_coef + entropy_loss_fn_args: + entropy_coef: 0 # actor_rollout_ref.actor.entropy_coeff=0 + +model: + model_path: deepseek-ai/deepseek-llm-7b-chat # actor_rollout_ref.model.path + max_prompt_tokens: 512 # data.max_prompt_length + max_response_tokens: 512 # data.max_response_length + enable_prompt_truncation: true # data.filter_overlong_prompts=True + +cluster: + node_num: 1 # trainer.nnodes + gpu_per_node: 8 # trainer.n_gpus_per_node + +buffer: + total_epochs: 15 # trainer.total_epochs + batch_size: 256 # actor_rollout_ref.actor.ppo_mini_batch_size + train_batch_size: 2048 # actor_rollout_ref.actor.ppo_mini_batch_size * actor_rollout_ref.rollout.n=256*8=2048 + explorer_input: + tasksets: + - name: gsm8k + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: train + format: + prompt_key: prompt # Check the dataset format + response_key: answer # Check the dataset format + eval_tasksets: + - name: gsm8k_eval + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: test + format: + prompt_key: prompt # Check the dataset format + response_key: answer # Check the dataset format + +explorer: + eval_interval: 5 # trainer.test_freq + rollout_model: + engine_num: 1 + tensor_parallel_size: 2 # actor_rollout_ref.rollout.tensor_model_parallel_size + gpu_memory_utilization: 0.6 # actor_rollout_ref.rollout.gpu_memory_utilization + +synchronizer: + sync_style: fixed + sync_offset: 1 + sync_interval: 4 # data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size in veRL + sync_timeout: 1200 + +trainer: + save_interval: 20 # trainer.save_freq + use_dynamic_bsz: true # actor_rollout_ref.actor.use_dynamic_bsz=True + max_token_len_per_gpu: 24000 # actor_rollout_ref.actor.ppo_max_token_len_per_gpu + trainer_config: + actor_rollout_ref: + model: + use_remove_padding: true # actor_rollout_ref.model.use_remove_padding=True + enable_gradient_checkpointing: true # actor_rollout_ref.model.enable_gradient_checkpointing=True + actor: + fsdp_config: + param_offload: false # actor_rollout_ref.actor.fsdp_config.param_offload=False + optimizer_offload: false # actor_rollout_ref.actor.fsdp_config.optimizer_offload=False + ref: + fsdp_config: + param_offload: true # actor_rollout_ref.ref.fsdp_config.param_offload=True + trainer: + critic_warmup: 0 # trainer.critic_warmup=0 + +monitor: + monitor_type: wandb # trainer.logger='["console","wandb"]' - wandb is extracted, console is default +``` + +The command to run this example is: +```bash +trinity run --config grpo_example.yaml +``` diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst index 378f2dcc91..dcafd7d147 100644 --- a/docs/sphinx_doc/source_zh/index.rst +++ b/docs/sphinx_doc/source_zh/index.rst @@ -24,6 +24,7 @@ tutorial/trinity_configs.md tutorial/trinity_gpu_configs.md tutorial/synchronizer.md + tutorial/align_with_verl.md .. toctree:: :maxdepth: 1 diff --git a/docs/sphinx_doc/source_zh/main.md b/docs/sphinx_doc/source_zh/main.md index 6812e45a4a..ff084e01ae 100644 --- a/docs/sphinx_doc/source_zh/main.md +++ b/docs/sphinx_doc/source_zh/main.md @@ -30,7 +30,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: | *多轮智能体强化学习* | + [拼接多轮任务](/tutorial/example_multi_turn.md)
+ [通用多轮任务](/tutorial/example_step_wise.md)
+ [调用智能体框架中的 ReAct 工作流](/tutorial/example_react.md)
+ [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | | *全生命周期的数据流水线* | + [Rollout 任务混合与选取](/tutorial/develop_selector.md)
+ [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
+ [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
+ [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
+ [高级数据处理能力 & Human-in-the-loop](/tutorial/example_data_functionalities.md) | | *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](/tutorial/example_mix_algo.md) (📝 [论文](https://arxiv.org/pdf/2508.11408))
+ [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
+ 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *深入认识 Trinity-RFT* | + [完整配置指南](/tutorial/trinity_configs.md)
+ [用于快速验证和实验的 Benchmark 工具](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
+ [GPU 资源与训练配置对应指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
+ [理解 explorer-trainer 同步逻辑](/tutorial/synchronizer.md) | +| *深入认识 Trinity-RFT* | + [完整配置指南](/tutorial/trinity_configs.md)
+ [用于快速验证和实验的 Benchmark 工具](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
+ [GPU 资源与训练配置对应指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
+ [理解 explorer-trainer 同步逻辑](/tutorial/synchronizer.md)
+ [如何和 veRL 对齐配置](/tutorial/align_with_verl.md) | diff --git a/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md b/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md new file mode 100644 index 0000000000..4825ad69d8 --- /dev/null +++ b/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md @@ -0,0 +1,518 @@ +# 如何和 veRL 对齐配置 + +本指南为熟悉 [veRL](https://github.com/volcengine/verl) 的用户提供了将 Trinity-RFT 与 veRL 的参数和指标对齐的方法。 + +Trinity-RFT 使用 [veRL](https://github.com/volcengine/verl) 作为训练后端(`trainer`),包括 actor、reference 和 critic 模型。Trinity-RFT 中的 `explorer` 模块基于 [vllm](https://github.com/vllm-project/vllm) 实现,取代了 veRL 原生的 rollout 引擎。此外,Trinity-RFT 引入了新模块 `buffer` 来增强 RFT 的全生命周期数据管理,可以理解为对 veRL 的 RL dataset 和 DataProto 的进一步强化。 + +## 参数映射 + +veRL 中的核心参数分为以下几类:`algorithm`、`data`、`actor_rollout_ref`、`critic`、`reward_model` 和 `trainer`。 +Trinity-RFT 根据功能将强化微调的大量参数分为几个部分,例如 `algorithm`、`model`、`buffer`、`explorer`、`trainer`、`monitor`、`synchronizer` 和 `cluster`。 + +大致来说,veRL 中的参数可以按照下面的方式映射到 Trinity-RFT 中: + +| 配置 | veRL | Trinity-RFT | +|:----------|:-----|:-----| +| 算法,例如 Advantage 函数 | `algorithm` | `algorithm` | +| 训练和评估任务集 | `data` | `buffer.explorer_input` | +| 批次大小(💡 稍后说明) | `data.train_batch_size` 和 `actor_rollout_ref.actor.ppo_mini_batch_size` | `buffer.batch_size` 和 `buffer.train_batch_size` | +| Actor | `actor_rollout_ref.actor` | `model` 和 `trainer` | +| Rollout | `actor_rollout_ref.rollout` | `explorer.rollout_model` | +| Critic | `critic` | `trainer.trainer_config.critic` | +| 奖励模型 | `reward_model` | `explorer.auxiliary_models` | +| 一些全局配置 | `trainer` | `monitor`、`synchronizer`、`cluster` 等 | + + +在以下内容中,我们将展示如何将 veRL 中的参数映射到 Trinity-RFT 中的参数。有关 Trinity-RFT 的详细参数配置,请参考[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)。 + + +```{note} +为了匹配 veRL 的默认训练设置,我们在 Trinity-RFT 中设置 `synchronizer.sync_style=fixed` 和 `synchronizer.sync_offset=0`。 +``` + +### Algorithm + +| veRL | Trinity-RFT | 说明 | +|:-----|:-----|:-----| +| `algorithm.adv_estimator` | `algorithm.advantage_fn` | 通过 `algorithm.advantage_fn_args` 传递参数 | +| `algorithm.gamma` | `algorithm.advantage_fn_args.gamma` | 与 `algorithm.advantage_fn: ppo/reinforceplusplus` 一起使用 | +| `algorithm.lam` | `algorithm.advantage_fn_args.lam` | 与 `algorithm.advantage_fn: ppo` 一起使用 | +| `algorithm.use_kl_in_reward` | `algorithm.kl_penalty_fn` | 通过设置 `algorithm.kl_penalty_fn=none` 禁用奖励中的 KL | +| `algorithm.kl_penalty` | `algorithm.kl_penalty_fn` | 从 `k2`、`low_var_kl` 等中选择 | +| `algorithm.kl_ctrl.kl_coef` | `algorithm.kl_penalty_fn_args.kl_coef` | - | + +💡 详细说明: + +* 在使用优势函数或策略损失函数的参数(例如 `algorithm.advantage_fn_args`)之前,建议检查源代码以确保这些参数能够被相应函数正确处理。 + + +### Data + +| veRL | Trinity-RFT | 说明 | +|:-----|:-----|:-----| +| `data.train_files` | `buffer.explorer_input.taskset.path` 或 `buffer.explorer_input.tasksets[i].path` | - | +| `data.val_files` | `buffer.explorer_input.eval_tasksets[i].path` | - | +| `data.prompt_key` | `buffer.explorer_input.taskset.format.prompt_key`| Taskset-specific | +| `data.response_key` | `buffer.explorer_input.taskset.format.response_key`| Taskset-specific | +| `data.train_batch_size` | `buffer.batch_size` * `synchronizer.sync_interval` | 要探索的任务数量 | +| `data.val_batch_size` | `buffer.batch_size` | 在 veRL 中已弃用 | +| `data.max_prompt_length` | `model.max_prompt_tokens` | - | +| `data.max_response_length` | `model.max_response_tokens` | - | +| `data.filter_overlong_prompts` | `model.enable_prompt_truncation` | 稍后说明 | +| `data.truncation` | - | 等同于 `right` | +| `data.shuffle` | `buffer.explorer_input.taskset.task_selector.selector_type:random` | Taskset-specific | + +💡 详细说明: + +* 注释 `taskset-specific` 意味着您可以在 `buffer.explorer_input.tasksets[i]` 或 `buffer.explorer_input.eval_tasksets[i]` 中为每个训练或评估任务设置不同的参数。 + +* 对于与 `batch size` 相关的参数,Trinity-RFT 使用 `buffer.batch_size` 来控制每个探索步骤中要探索的任务数量,使用 `buffer.train_batch_size` 来控制每个梯度下降步骤中使用的任务数量。在大多数情况下,控制以下参数可以确保与 veRL 相同的效果: + - Trinity-RFT 中的 `buffer.batch_size` = veRL 中的 `actor_rollout_ref.actor.ppo_mini_batch_size` + - Trinity-RFT 中的 `buffer.train_batch_size`(自动)= veRL 中的 `actor_rollout_ref.rollout.n` * `actor_rollout_ref.actor.ppo_mini_batch_size` + - Trinity-RFT 中的 `synchronizer.sync_interval` = veRL 中的 `data.train_batch_size` / `actor_rollout_ref.actor.ppo_mini_batch_size` + - 不要设置 `ppo_mini_batch_size`,它会自动设置以匹配 veRL 的效果,尽管值可能不同。 + +* 如果您想过滤过长的提示,可以在 Trinity-RFT 中设置 `model.enable_prompt_truncation=True`。在这种情况下,相应的经验将不计入损失计算,因此 `truncation` 的方向不再重要。 + + +### Actor、Rollout 和 Critic + +本节包括 actor 和 rollout 的参数。为了便于理解,您可以将 veRL 中的 actor(`actor_rollout_ref.actor`)视为 Trinity-RFT 中的 trainer(`trainer`),将 rollout(`actor_rollout_ref.rollout`)视为 explorer(`explorer.rollout_model`)。 + +```{note} +Trinity-RFT 中 `actor_rollout_ref.rollout` 的任何参数都无效;请在其他字段中正确设置它们。 +``` + +对于 veRL 的高级训练配置,您可以在 `trainer.trainer_config` 字段中设置这些参数。例如,veRL 中的 `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` 等同于 Trinity-RFT 中的 `trainer.trainer_config.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`。如果您想在 `trainer.trainer_config` 字典中设置参数,请仔细阅读 `trinity/common/verl_config.py` 中的源代码! + + +| veRL | Trinity-RFT | 说明 | +|:-----|:-----|:-----| +| `actor_rollout_ref.model.path` | `model.model_path` | - | +| `actor_rollout_ref.actor.optim` | `algorithm.optimizer` | 例如 `lr` 和 `weight_decay` | +| `actor_rollout_ref.rollout.n` | `algorithm.repeat_times` | Eval taskset-specific:`eval_tasksets[i].repeat_times` | +| `actor_rollout_ref.actor.ppo_mini_batch_size` | `buffer.batch_size` | 每个探索步骤中要探索的任务数量 | +| `actor_rollout_ref.actor.use_dynamic_bsz` | `trainer.use_dynamic_bsz` | - | +| `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` | `trainer.max_token_len_per_gpu` | - | +| `actor_rollout_ref.actor.ulysses_sequence_parallel_size` | `trainer.ulysses_sequence_parallel_size` | actor 的序列并行大小 | +| `actor_rollout_ref.actor.grad_clip` | `trainer.grad_clip` | actor 的梯度裁剪值 | +| `actor_rollout_ref.actor.use_kl_loss` | `algorithm.kl_loss_fn` | 如果设置为 `none`,将不计算 KL 散度损失 | +| `actor_rollout_ref.rollout.gpu_memory_utilization` | `explorer.rollout_model.gpu_memory_utilization` | - | +| `actor_rollout_ref.rollout.temperature` | `model.temperature` | 可以是taskset-specific,例如 `buffer.explorer_input.taskset.rollout_args.temperature` | +| `actor_rollout_ref.rollout.top_p` | `model.top_p` | 可以是taskset-specific | +| `actor_rollout_ref.rollout.top_k` | `model.top_k` | 可以是taskset-specific | +| `actor_rollout_ref.rollout.tensor_model_parallel_size` | `explorer.rollout_model.tensor_parallel_size` | - | +| `actor_rollout_ref.rollout.val_kwargs` | `buffer.explorer_input.eval_tasksets[i]` | Taskset-specific | +| `critic.model.path` | `model.critic_model_path` | 默认为 `model.model_path` | + +💡 详细说明: + +* 注释 `可以是taskset-specific`(以 `temperature` 为例)意味着您可以为所有任务集设置 `model.temperature`,或者在 `buffer.explorer_input.taskset.rollout_args.temperature` 或 `buffer.explorer_input.eval_tasksets[i].rollout_args.temperature` 中为每个任务设置不同的值。具体示例如下: +```yaml +buffer: + explorer_input: + eval_tasksets: + - name: AIME2024 + storage_type: file + path: HuggingFaceH4/aime_2024 + split: 'train' + repeat_times: 32 + format: + prompt_key: 'question' + response_key: 'answer' + rollout_args: + temperature: 1.0 + top_p: 0.7 +``` + +### Reward Model + +Trinity-RFT 支持针对任务集定制的奖励函数以及奖励模型。对于自定义奖励函数,你可以通过设置 `buffer.explorer_input.default_reward_fn_type` 来选择对应的奖励函数;另外您可以设置 `explorer.auxiliary_models` 作为 reward model 并在工作流中使用它们。例如, +```yaml +buffer: + explorer_input: + default_reward_fn_type: 'custom_reward' +explorer: + auxiliary_models: + - model_path: Qwen/Qwen3-30B-A3B-Instruct-2507 + engine_num: 1 + tensor_parallel_size: 2 + enable_thinking: false + max_prompt_tokens: 19456 + max_response_tokens: 1024 + max_model_len: 20480 +``` +请参考使用 LLM-as-a-judge 的[配置](https://github.com/modelscope/Trinity-RFT/blob/main/examples/grpo_rubric_as_reward/rubric.yaml)和[工作流](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/common/workflows/rubric_judge_workflow.py)了解更多详情。 + + +### Trainer + +| veRL | Trinity-RFT | 说明 | +|:-----|:-----|:-----| +| `trainer.logger` | `monitor.monitor_type` | 支持选择的类型和(无需设置)`console` | +| `trainer.project_name` | `project` | - | +| `trainer.experiment_name` | `name` | - | +| `trainer.default_local_dir` | `checkpoint_root_dir` | 检查点保存在 `///` | +| `trainer.n_gpus_per_node` | `cluster.gpu_per_node` | - | +| `trainer.nnodes` | `cluster.node_num` | - | +| `trainer.save_freq` | `trainer.save_interval` | - | +| `trainer.test_freq` | `explorer.eval_interval` | - | +| `trainer.total_epochs` | `buffer.total_epochs` | - | +| `trainer.total_training_steps` | `buffer.total_steps` 和 `trainer.total_steps` | 如果不为 None,将忽略 `buffer.total_epochs` | +| `trainer.critic_warmup` | `trainer.trainer_config.trainer.critic_warmup` | - | +| `trainer.val_before_train` | `explorer.eval_on_startup` | - | +| `trainer.resume_mode` | `continue_from_checkpoint` | 稍后说明 | +| `trainer.resume_from_path` | - | 稍后说明 | + +💡 详细说明: + +* 如果您想从检查点恢复训练,可以将 `continue_from_checkpoint` 设置为 `True`,训练将从检查点路径 `///` 中的最新检查点开始(如果有的话)。 + + +## GPU 资源分配 + +在 Trinity-RFT 中,GPU 资源需要手动分配给 `explorer`、`auxiliary models`(如果有)和 `trainer`。 + +* 总共有 `cluster.node_num` 个节点,每个节点有 `cluster.gpu_per_node` 个 GPU。 +* `explorer` 使用的 GPU 数量为 `explorer.rollout_model.engine_num` * `explorer.rollout_model.tensor_parallel_size`。 +* 辅助模型的 GPU 数量为 `explorer.auxiliary_models[i].engine_num` * `explorer.auxiliary_models[i].tensor_parallel_size`。 +* 剩余的 GPU 用于 `trainer`。 + + +## 指标映射 + +### 为什么每个实验会看到两个运行记录? + +在 Trinity-RFT 中,explorer 负责 rollout 过程,而 trainer 负责训练过程。这两个过程的指标是独立计算的,并作为单独的运行上传到 monitor。这就是为什么您会看到每个实验会对应两个“run”,通过 "_explorer" 或 "_trainer" 后缀来区分。 + + +### 为什么某些指标与 veRL 不同? + +Trinity-RFT 使用 [vllm](https://github.com/vllm-project/vllm) 作为 rollout 引擎,使用 veRL 作为训练后端。由于这些框架之间的精度差异,在给定 token 上计算的对数概率可能不同。因此,某些指标(例如 `actor/ppo_kl` 和 `actor/pg_clipfrac`)可能与 veRL 中观察到的不同。但是,当使用与 veRL 相同的参数时,这些差异预计会很小。 + + +## 示例:PPO 训练 + +我们将一个 PPO 训练示例 `run_qwen2-7b_rm.sh` 从 veRL 的配置转换为 Trinity-RFT 的配置。 + +veRL 的配置如下: +```bash +gsm8k_train_path=$HOME/data/gsm8k/train.parquet +gsm8k_test_path=$HOME/data/gsm8k/test.parquet +math_train_path=$HOME/data/math/train.parquet +math_test_path=$HOME/data/math/test.parquet + +train_files="['$gsm8k_train_path', '$math_train_path']" +test_files="['$gsm8k_test_path', '$math_test_path']" + +# prepare model ckpt +huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir $HOME/models/Qwen2-7B-Instruct & +huggingface-cli download sfairXC/FsfairX-LLaMA3-RM-v0.1 --local-dir $HOME/models/FsfairX-LLaMA3-RM-v0.1 & +wait + +python3 -m verl.trainer.main_ppo \ + algorithm.adv_estimator=gae \ + data.train_files="$train_files" \ + data.val_files="$test_files" \ + data.train_batch_size=1024 \ + data.max_prompt_length=1024 \ + data.max_response_length=512 \ + data.filter_overlong_prompts=True \ + data.truncation='error' \ + data.return_raw_chat=True \ + actor_rollout_ref.model.path="$HOME/models/Qwen2-7B-Instruct" \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.model.use_remove_padding=True \ + actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 \ + actor_rollout_ref.actor.ppo_mini_batch_size=256 \ + actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \ + actor_rollout_ref.actor.use_kl_loss=False \ + actor_rollout_ref.model.enable_gradient_checkpointing=True \ + actor_rollout_ref.actor.fsdp_config.param_offload=False \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ + actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ + critic.optim.lr=1e-5 \ + critic.model.use_remove_padding=True \ + critic.optim.lr_warmup_steps_ratio=0.05 \ + critic.model.path="$HOME/models/Qwen2-7B-Instruct" \ + critic.model.enable_gradient_checkpointing=True \ + critic.ppo_micro_batch_size_per_gpu=32 \ + critic.model.fsdp_config.param_offload=False \ + critic.model.fsdp_config.optimizer_offload=False \ + reward_model.enable=True \ + reward_model.model.path="$HOME/models/FsfairX-LLaMA3-RM-v0.1" \ + reward_model.model.use_remove_padding=True \ + reward_model.model.fsdp_config.param_offload=True \ + reward_model.micro_batch_size_per_gpu=32 \ + algorithm.use_kl_in_reward=False \ + trainer.critic_warmup=0 \ + trainer.logger='["console","wandb"]' \ + trainer.project_name='verl_example' \ + trainer.val_before_train=False \ + trainer.experiment_name='Qwen2-7B-Instruct_hybrid_rm' \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=1 \ + trainer.save_freq=20 \ + trainer.test_freq=5 \ + trainer.total_epochs=15 $@ +``` + +Trinity-RFT 的相应配置(ppo_example.yaml)如下: +```yaml +project: verl_example +name: Qwen2-7B-Instruct_hybrid_rm +checkpoint_root_dir: ./checkpoints +algorithm: + algorithm_type: ppo + repeat_times: 1 + optimizer: + lr: 1e-6 + lr_warmup_steps_ratio: 0.1 # actor_rollout_ref.actor.optim.lr_warmup_steps_ratio + advantage_fn: ppo # algorithm.adv_estimator=gae + kl_penalty_fn: none # algorithm.use_kl_in_reward=False + kl_loss_fn: none # actor_rollout_ref.actor.use_kl_loss=False + +model: + model_path: ${oc.env:HOME}/models/Qwen2-7B-Instruct + critic_model_path: ${oc.env:HOME}/models/Qwen2-7B-Instruct # critic.model.path + max_prompt_tokens: 1024 # data.max_prompt_length + max_response_tokens: 512 # data.max_response_length + enable_prompt_truncation: true # data.filter_overlong_prompts=True + +cluster: + node_num: 1 # trainer.nnodes + gpu_per_node: 8 # trainer.n_gpus_per_node + +buffer: + total_epochs: 15 # trainer.total_epochs + batch_size: 256 # actor_rollout_ref.actor.ppo_mini_batch_size + train_batch_size: 256 # actor_rollout_ref.actor.ppo_mini_batch_size * actor_rollout_ref.rollout.n=256*1=256 + explorer_input: + tasksets: + - name: gsm8k + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: train + format: + prompt_key: prompt # 检查数据集格式 + response_key: answer # 检查数据集格式 + - name: math + storage_type: file + path: ${oc.env:HOME}/data/math + split: train + format: + prompt_key: prompt # 检查数据集格式 + response_key: answer # 检查数据集格式 + rollout_args: + temperature: 1.0 + eval_tasksets: + - name: gsm8k_eval + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: test + format: + prompt_key: prompt # 检查数据集格式 + response_key: answer # 检查数据集格式 + - name: math_eval + storage_type: file + path: ${oc.env:HOME}/data/math + split: test + format: + prompt_key: prompt # 检查数据集格式 + response_key: answer # 检查数据集格式 + +explorer: + eval_interval: 5 # trainer.test_freq + eval_on_startup: false # trainer.val_before_train=False + rollout_model: + engine_num: 2 # rollout 模型的 GPU 数量 + tensor_parallel_size: 1 # actor_rollout_ref.rollout.tensor_model_parallel_size + gpu_memory_utilization: 0.6 # actor_rollout_ref.rollout.gpu_memory_utilization + auxiliary_models: # reward_model 配置 + - model_path: ${oc.env:HOME}/models/FsfairX-LLaMA3-RM-v0.1 + engine_num: 2 # 奖励模型的 GPU 数量 + tensor_parallel_size: 1 + +synchronizer: + sync_style: fixed + sync_offset: 1 + sync_interval: 4 # sync_interval = data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size + sync_timeout: 1200 + +trainer: + save_interval: 20 # trainer.save_freq + trainer_config: + actor_rollout_ref: + model: + use_remove_padding: true # actor_rollout_ref.model.use_remove_padding + enable_gradient_checkpointing: true # actor_rollout_ref.model.enable_gradient_checkpointing + actor: + ppo_micro_batch_size_per_gpu: 16 # actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu + fsdp_config: + param_offload: false # actor_rollout_ref.actor.fsdp_config.param_offload + optimizer_offload: false # actor_rollout_ref.actor.fsdp_config.optimizer_offload + rollout: + log_prob_micro_batch_size_per_gpu: 16 # actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu + critic: + model: + use_remove_padding: true # critic.model.use_remove_padding + enable_gradient_checkpointing: true # critic.model.enable_gradient_checkpointing + fsdp_config: + param_offload: false # critic.model.fsdp_config.param_offload + optimizer_offload: false # critic.model.fsdp_config.optimizer_offload + optim: + lr: 1e-5 # critic.optim.lr + lr_warmup_steps_ratio: 0.05 # critic.optim.lr_warmup_steps_ratio + ppo_micro_batch_size_per_gpu: 32 # critic.ppo_micro_batch_size_per_gpu + trainer: + critic_warmup: 0 # trainer.critic_warmup + +monitor: + monitor_type: wandb # trainer.logger='["console","wandb"]' - wandb 是设定值,console 是默认值 +``` + +运行命令为: +```bash +trinity run --config ppo_example.yaml +``` + +## 示例:GRPO 训练 + +我们将一个 GRPO 训练示例 `run_deepseek7b_llm_seq_balance.sh` 从 veRL 的配置转换为 Trinity-RFT 的配置。 + +veRL 的配置如下: +```bash +set -x + +python3 -m verl.trainer.main_ppo \ + algorithm.adv_estimator=grpo \ + data.train_files=$HOME/data/gsm8k/train.parquet \ + data.val_files=$HOME/data/gsm8k/test.parquet \ + data.train_batch_size=1024 \ + data.max_prompt_length=512 \ + data.max_response_length=512 \ + data.filter_overlong_prompts=True \ + data.truncation='error' \ + actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.model.use_remove_padding=True \ + actor_rollout_ref.actor.ppo_mini_batch_size=256 \ + actor_rollout_ref.actor.use_dynamic_bsz=True \ + actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24000 \ + actor_rollout_ref.actor.use_kl_loss=True \ + actor_rollout_ref.actor.kl_loss_coef=0.001 \ + actor_rollout_ref.actor.kl_loss_type=low_var_kl \ + actor_rollout_ref.actor.entropy_coeff=0 \ + actor_rollout_ref.model.enable_gradient_checkpointing=True \ + actor_rollout_ref.actor.fsdp_config.param_offload=False \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ + actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ + actor_rollout_ref.rollout.n=8 \ + actor_rollout_ref.ref.fsdp_config.param_offload=True \ + algorithm.use_kl_in_reward=False \ + trainer.critic_warmup=0 \ + trainer.logger='["console","wandb"]' \ + trainer.project_name='verl_grpo_example_gsm8k' \ + trainer.experiment_name='deepseek_llm_7b_function_rm_seq_packing' \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=1 \ + trainer.save_freq=20 \ + trainer.test_freq=5 \ + trainer.total_epochs=15 $@ +``` + +Trinity-RFT 的相应配置(grpo_example.yaml)如下: +```yaml +project: verl_grpo_example_gsm8k +name: deepseek_llm_7b_function_rm_seq_packing +checkpoint_root_dir: ./checkpoints +algorithm: + algorithm_type: grpo + repeat_times: 8 # actor_rollout_ref.rollout.n=8 + optimizer: + lr: 1e-6 # actor_rollout_ref.actor.optim.lr + advantage_fn: grpo # algorithm.adv_estimator=grpo + kl_penalty_fn: none # algorithm.use_kl_in_reward=False + kl_loss_fn: low_var_kl # actor_rollout_ref.actor.kl_loss_type=low_var_kl + kl_loss_fn_args: + kl_coef: 0.001 # actor_rollout_ref.actor.kl_loss_coef + entropy_loss_fn_args: + entropy_coef: 0 # actor_rollout_ref.actor.entropy_coeff=0 + +model: + model_path: deepseek-ai/deepseek-llm-7b-chat # actor_rollout_ref.model.path + max_prompt_tokens: 512 # data.max_prompt_length + max_response_tokens: 512 # data.max_response_length + enable_prompt_truncation: true # data.filter_overlong_prompts=True + +cluster: + node_num: 1 # trainer.nnodes + gpu_per_node: 8 # trainer.n_gpus_per_node + +buffer: + total_epochs: 15 # trainer.total_epochs + batch_size: 256 # actor_rollout_ref.actor.ppo_mini_batch_size + train_batch_size: 2048 # actor_rollout_ref.actor.ppo_mini_batch_size * actor_rollout_ref.rollout.n=256*8=2048 + explorer_input: + tasksets: + - name: gsm8k + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: train + format: + prompt_key: prompt # 检查数据集格式 + response_key: answer # 检查数据集格式 + eval_tasksets: + - name: gsm8k_eval + storage_type: file + path: ${oc.env:HOME}/data/gsm8k + split: test + format: + prompt_key: prompt # 检查数据集格式 + response_key: answer # 检查数据集格式 + +explorer: + eval_interval: 5 # trainer.test_freq + rollout_model: + engine_num: 1 + tensor_parallel_size: 2 # actor_rollout_ref.rollout.tensor_model_parallel_size + gpu_memory_utilization: 0.6 # actor_rollout_ref.rollout.gpu_memory_utilization + +synchronizer: + sync_style: fixed + sync_offset: 1 + sync_interval: 4 # veRL 中的 data.train_batch_size / actor_rollout_ref.actor.ppo_mini_batch_size + sync_timeout: 1200 + +trainer: + save_interval: 20 # trainer.save_freq + use_dynamic_bsz: true # actor_rollout_ref.actor.use_dynamic_bsz=True + max_token_len_per_gpu: 24000 # actor_rollout_ref.actor.ppo_max_token_len_per_gpu + trainer_config: + actor_rollout_ref: + model: + use_remove_padding: true # actor_rollout_ref.model.use_remove_padding=True + enable_gradient_checkpointing: true # actor_rollout_ref.model.enable_gradient_checkpointing=True + actor: + fsdp_config: + param_offload: false # actor_rollout_ref.actor.fsdp_config.param_offload=False + optimizer_offload: false # actor_rollout_ref.actor.fsdp_config.optimizer_offload=False + ref: + fsdp_config: + param_offload: true # actor_rollout_ref.ref.fsdp_config.param_offload=True + trainer: + critic_warmup: 0 # trainer.critic_warmup=0 + +monitor: + monitor_type: wandb # trainer.logger='["console","wandb"]' - wandb 是设定值,console 是默认值 +``` + +运行命令为: +```bash +trinity run --config grpo_example.yaml +```