modelscope
diff --git a/‎benchmark/config/countdown-template.yaml‎
Lines changed: 1 addition & 2 deletions b/‎benchmark/config/countdown-template.yaml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎benchmark/config/gsm8k-template.yaml‎
Lines changed: 0 additions & 1 deletion b/‎benchmark/config/gsm8k-template.yaml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/sphinx_doc/source/tutorial/example_async_mode.md‎
Lines changed: 14 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_async_mode.md‎
Lines changed: 14 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 5 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_mix_algo.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/sphinx_doc/source/tutorial/example_mix_algo.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md‎
Lines changed: 1 addition & 2 deletions b/‎docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_reasoning_basic.md‎
Lines changed: 6 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_reasoning_basic.md‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_search_email.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/sphinx_doc/source/tutorial/example_search_email.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/sphinx_doc/source/tutorial/example_step_wise.md‎
Lines changed: 14 additions & 2 deletions b/‎docs/sphinx_doc/source/tutorial/example_step_wise.md‎
Lines changed: 14 additions & 2 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/faq.md‎
Lines changed: 7 additions & 9 deletions b/‎docs/sphinx_doc/source/tutorial/faq.md‎
Lines changed: 7 additions & 9 deletions
@@ -47,11 +47,10 @@ buffer:
         decay: 0.1
     sft_warmup_steps: 0
 explorer:
-  runner_num: 32
+  runner_per_model: 8
   max_timeout: 900
   max_retry_times: 2
   rollout_model:
-    engine_type: vllm_async
     engine_num: 2
     tensor_parallel_size: 1
     use_v1: true
 
@@ -56,7 +56,6 @@ explorer:
   max_timeout: 900
   max_retry_times: 2
   rollout_model:
-    engine_type: vllm_async
     engine_num: 2
     tensor_parallel_size: 1
     use_v1: true
 
@@ -52,8 +52,6 @@ explorer:
 synchronizer:
   sync_method: 'checkpoint'
   sync_interval: 10
-trainer:
-  trainer_config_path: examples/async_gsm8k/verl_config.yaml
 ```
 
 Key configurations in `trainer.yaml` are as follows:
@@ -95,7 +93,20 @@ synchronizer:
   sync_method: 'checkpoint'
   sync_interval: 10
 trainer:
-  trainer_config_path: examples/async_gsm8k/verl_config.yaml
+  trainer_config:
+    actor_rollout_ref:
+      model:
+        use_remove_padding: true
+      actor:
+        use_dynamic_bsz: true
+        ppo_max_token_len_per_gpu: 16384
+        ulysses_sequence_parallel_size: 1
+        optim:
+          lr: 1e-6
+      ref:
+        log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
+        log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
+        ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
 ```
 
 You can run this example with the following command:
 
@@ -50,7 +50,7 @@ For SFT, we download the `open-r1/Mixture-of-Thoughts` dataset to the local dire
 
 ### Configuration for DPO
 
-We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
+We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) for this experiment. Some important setups are listed in the following:
 
 We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and pass the data path to the trainer.
 
@@ -83,8 +83,9 @@ buffer:
         chosen_key: chosen
         rejected_key: rejected
 trainer:
-  trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
   save_interval: 30
+  trainer_config:
+    ... # omitted here for simplicity
 ```
 
 `buffer.trainer_input.experience_buffer` specifies the dataset to be used for training, including its name, storage type, path, and format.
@@ -129,8 +130,9 @@ buffer:
         prompt_type: messages
         messages_key: messages
 trainer:
-  trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
   save_interval: 50
+  trainer_config:
+    ... # omitted here for simplicity
 ```
 
 Here we set `buffer.trainer_input.experience_buffer.format.prompt_type` to `messages` because the source data is in message format. We also set `buffer.trainer_input.experience_buffer.format.messages_key` to `messages` to specify the key in the dataset that contains the messages.
 
@@ -254,7 +254,7 @@ class MIXPolicyLossFn(PolicyLossFn):
 
 With the above newly-defined classes and functions, we can run the experiments without modifying other process.
 An example showing some important configurations is shown below, including the weighting factor $\mu$ as `algorithm.policy_loss_fn_args['mu']` and the batch size of expert experiences $B'$, calculated as the product of `buffer.batch_size`, `algorithm.sample_strategy_args['expert_data_ratio']` and `algorithm.repeat_times`.
-For the full configuration, please refer to [`mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml) and [`train_mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/train_mix_math.yaml).
+For the full configuration, please refer to [`mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml).
 
 ```yaml
 algorithm:
 
@@ -12,15 +12,14 @@ Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) a
 
 As an experimental feature of Trinity-RFT, we develop an embarrasingly simple off-policy RL algorithm, termed as OPMD (Online Policy Mirror Descent, inspired by [Kimi k1.5](https://arxiv.org/abs/2501.12599)).
 The algorithm design and analysis can be found in this [technical report](../../assets/opmd.pdf).
-The config files are [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml) and [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
+The config file is [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml).
 
 To try out the OPMD algorithm:
 ```shell
 trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
 ```
 
 Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
-Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
 
 
 
 
@@ -100,7 +100,7 @@ We run the experiment in a synchronous mode where the Explorer and Trainer opera
 
 ### Use GRPO Algorithm
 
-We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following:
+We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following:
 
 
 ```yaml
@@ -155,9 +155,12 @@ synchronizer:
   sync_method: 'nccl'
   sync_interval: 1
 trainer:
-  trainer_config_path: 'examples/grpo_gsm8k/train_gsm8k.yaml'
   save_interval: 100
-
+  trainer_config:
+    actor_rollout_ref:
+      actor:
+        optim:
+          lr: 1e-5
 ```
 
 
 
@@ -34,7 +34,7 @@ If you want to choose a new database path, you can modify the `DEFAULT_DB_PATH`
 
 ### Step 2: Run the Workflow
 
-The config files are located in [`email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml) and [`train_email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/train_email_search.yaml).
+The config file is located in [`email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml).
 To run this example, you can run the following command:
 
 ```bash
 
@@ -148,9 +148,21 @@ synchronizer:
   sync_interval: 2
   sync_timeout: 3600
 trainer:
-  trainer_type: 'verl'
-  trainer_config_path: 'examples/grpo_alfworld_general_multi_step/train_alfworld.yaml'
   save_interval: 50
+  trainer_config:
+    actor_rollout_ref:
+      model:
+        use_remove_padding: true
+      actor:
+        use_dynamic_bsz: true
+        ppo_max_token_len_per_gpu: 16384
+        ulysses_sequence_parallel_size: 1
+        optim:
+          lr: 5e-6
+      ref:
+        log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
+        log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
+        ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
 ```
 
 
 
@@ -1,23 +1,21 @@
 # FAQ
 
 ## Part 1: Configurations
-**Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory?
+**Q:** How do I configure the parameters?
 
-**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
-If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL.
+**A:** You can use the config manager to configure the parameters by running `trinity studio --port 8080`. This approach provides a convenient way to configure the parameters.
 
-We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`.
-
-Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details.
-For users' convenience, future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.
+Advanced users can also edit the config file directly.
+Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, which can have massive parameters, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html). You may specify these parameters in two ways: (1) specify the parameters in the `trainer.trainer_config` dictionary; (2) specify them in an auxiliary YAML file starting with `train_` and pass the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`. These two ways are mutually exclusive.
 
 ---
 
-**Q:** What's the relationship between `buffer.batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?
+**Q:** What's the relationship between `buffer.batch_size`, `buffer.train_batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?
 
 **A:** The following parameters are closely related:
 
-- `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer.
+- `buffer.batch_size`: The number of tasks in a batch, effective for the explorer.
+- `buffer.train_batch_size`: The number of experiences in a mini-batch, effective for the trainer. If not specified, it defaults to `buffer.batch_size` * `algorithm.repeat_times`.
 - `actor_rollout_ref.actor.ppo_mini_batch_size`: The number of experiences in a mini-batch, overridden by `buffer.train_batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.train_batch_size (/ ngpus_trainer)`. The expression of dividing `ngpus_trainer` is caused by implict data allocation to GPUs, but this do not affects the result after gradient accumulation.
 - `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU.