From 0b5df71cbab2fcf46aad5d728be97c9abab535d4 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Tue, 1 Jul 2025 19:26:19 +0800 Subject: [PATCH 1/9] add faq --- docs/sphinx_doc/source/index.rst | 6 + docs/sphinx_doc/source/tutorial/faq.md | 160 ++++++++++++++++++ .../source/tutorial/trinity_configs.md | 2 +- 3 files changed, 167 insertions(+), 1 deletion(-) create mode 100644 docs/sphinx_doc/source/tutorial/faq.md diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst index 4b4cab2aa9..fc085215b0 100644 --- a/docs/sphinx_doc/source/index.rst +++ b/docs/sphinx_doc/source/index.rst @@ -33,6 +33,12 @@ Welcome to Trinity-RFT's documentation! tutorial/trinity_configs.md tutorial/example_mix_algo.md +.. toctree:: + :maxdepth: 2 + :caption: FAQ + + tutorial/faq.md + .. toctree:: :maxdepth: 1 :glob: diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md new file mode 100644 index 0000000000..99caafc457 --- /dev/null +++ b/docs/sphinx_doc/source/tutorial/faq.md @@ -0,0 +1,160 @@ +# FAQ + +## Part 1: Configurations +**Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory? + +**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://github.com/volcengine/verl/blob/v0.4.0/docs/examples/config.rst). +If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL. + +We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`. + +Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details. +Future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated. + +--- + +**Q:** What's the relationship between `buffer.batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes? + +**A:** The following parameters are closely related: + +- `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer. +- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times / ngpus_trainer`. +- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU. + +A minimal example showing their usage is as follows: + +```python +def update_policy(batch): + dataloader = batch.split(ppo_mini_batch_size) + for _ in range(ppo_epochs): + for batch_idx, data in enumerate(dataloader): + # Split data + mini_batch = data + if actor_rollout_ref.actor.use_dynamic_bsz: + micro_batches, _ = rearrange_micro_batches( + batch=mini_batch, max_token_len=max_token_len + ) + else: + micro_batches = mini_batch.split(actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu) + + # Computing gradient + for data in micro_batches: + entropy, log_prob = self._forward_micro_batch( + micro_batch=data, ... + ) + pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss( + log_prob=log_prob, **data + ) + policy_loss = pg_loss + ... + loss = policy_loss / self.gradient_accumulation + loss.backward() + + # Optimizer step + grad_norm = self._optimizer_step() + self.actor_optimizer.zero_grad() +``` +Please refer to `trinity/trainer/verl/dp_actor.py` for detailed implementation. veRL also provides an explanation in [FAQ](https://verl.readthedocs.io/en/latest/faq/faq.html#what-is-the-meaning-of-train-batch-size-mini-batch-size-and-micro-batch-size). + + +## Part 2: Common Errors + +**Error:** +```bash +File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module> + import flash_attn_2_cuda as flash_attn_gpu +ImportError: ... +``` + +**A:** The `flash-attn` module is not properly installed. Try to fix it by running `MAX_JOBS=128 pip install flash-attn`. + +--- + +**Error:** +```bash +UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ... +``` + +**A:** Try to log in to WandB before running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`. + +--- + +**Error:** +```bash +ValueError: Failed to look up actor with name 'explorer' ... +``` + +**A:** Try to restart Ray before running the experiment: + +```bash +ray stop +ray start --head +``` + +--- + +**Error:** Out-of-Memory (OOM) error + +**A:** The following parameters may be helpful: + +- For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`. +- For exploere, adjust `explorer.rollout_model.tensor_parallel_size`, + + +## Part 3: Debugging Methods [Coming Soon] +To see the full logs of all processes and save it to `debug.log`: +```bash +export RAY_DEDUP_LOGS=0 +trinity run --config grpo_gsm8k/gsm8k.yaml 2>&1 | tee debug.log +``` + + +## Part 4: Other Questions +**Q:** What's the purpose of `buffer.trainer_input.experience_buffer.path`? + +**A:** This path specifies the path to the SQLite database storaging the generated experiences. You may comment out this line if you don't want to use the SQLite database. + +To see the experiences in the database, you can use the following Python script: + +```python +from sqlalchemy import create_engine +from sqlalchemy.exc import OperationalError +from sqlalchemy.orm import sessionmaker +from sqlalchemy.pool import NullPool +from trinity.common.schema import ExperienceModel + +engine = create_engine(buffer.trainer_input.experience_buffer.path) +session = sessionmaker(bind=engine) +sess = session() + +MAX_EXPERIENCES = 4 +experiences = ( + sess.query(ExperienceModel) + .with_for_update() + .limit(MAX_EXPERIENCES) + .all() +) + +exp_list = [] +for exp in experiences: + exp_list.append(ExperienceModel.to_experience(exp)) + +# Print the experiences +for exp in exp_list: + print(f"{exp.prompt_text=}", f"{exp.response_text=}") +``` + +--- + +**Q:** How to load the checkpoints outside of the Trinity-RFT framework? + +**A:** You need to specify `model.model_path` and `checkpoint_root_dir`. The following code snippet gives an example with transformers. + +```python +from transformers import AutoTokenizer, AutoModelForCausalLM +from trinity.common.models.utils import load_state_dict_from_verl_checkpoint + +model = AutoModelForCausalLM.from_pretrained(model.model_path) +# Assume we need the checkpoint at step 780 +ckp_path = checkpoint_root_dir + "global_step_780/actor/" +model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path)) +``` diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md index 88d925f786..6c2497c5d5 100644 --- a/docs/sphinx_doc/source/tutorial/trinity_configs.md +++ b/docs/sphinx_doc/source/tutorial/trinity_configs.md @@ -399,7 +399,7 @@ data_processor: For advanced users working with the `verl` trainer backend. This includes fine-grained settings for actor/critic models, optimizer parameters, and training loops. -> For full parameter meanings, refer to the [veRL documentation](https://github.com/volcengine/verl/blob/v0.3.0.post1/docs/examples/config.rst). +> For full parameter meanings, refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html). ```yaml From 0f7c58a44672706f8f88c399f7f3ea7a4b9be883 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Thu, 3 Jul 2025 14:17:02 +0800 Subject: [PATCH 2/9] update faq --- docs/sphinx_doc/source/tutorial/faq.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md index 99caafc457..5f30588d93 100644 --- a/docs/sphinx_doc/source/tutorial/faq.md +++ b/docs/sphinx_doc/source/tutorial/faq.md @@ -3,13 +3,13 @@ ## Part 1: Configurations **Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory? -**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://github.com/volcengine/verl/blob/v0.4.0/docs/examples/config.rst). +**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html). If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL. We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`. Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details. -Future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated. +For users' convenience, future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated. --- @@ -18,14 +18,14 @@ Future versions will gradually reduce parameters in `trainer.trainer_config` and **A:** The following parameters are closely related: - `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer. -- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times / ngpus_trainer`. +- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times (/ ngpus_trainer)`. The expression of dividing `ngpus_trainer` is caused by implict data allocation to GPUs, but this do not affects the result after gradient accumulation. - `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU. A minimal example showing their usage is as follows: ```python -def update_policy(batch): - dataloader = batch.split(ppo_mini_batch_size) +def update_policy(batch_exps): + dataloader = batch_epxs.split(ppo_mini_batch_size) # here `ppo_mini_batch_size` is in terms of experiences for _ in range(ppo_epochs): for batch_idx, data in enumerate(dataloader): # Split data From a7414d66dde48e15da05586f592aaaf49f6c1b1a Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Thu, 3 Jul 2025 14:50:01 +0800 Subject: [PATCH 3/9] add faq to readme --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c8c50999f7..6631937d6c 100644 --- a/README.md +++ b/README.md @@ -283,7 +283,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol + [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md) - +For some frequently asked questions, check [FAQ](./docs/sphinx_doc/source/tutorial/faq.md) for answers. ## Advanced usage and full configurations From e4f0e906b48972901c3bc536ee14bc4ca9f858c0 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Thu, 3 Jul 2025 15:33:45 +0800 Subject: [PATCH 4/9] fix typo --- docs/sphinx_doc/source/tutorial/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md index 5f30588d93..0562a29848 100644 --- a/docs/sphinx_doc/source/tutorial/faq.md +++ b/docs/sphinx_doc/source/tutorial/faq.md @@ -97,7 +97,7 @@ ray start --head **A:** The following parameters may be helpful: - For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`. -- For exploere, adjust `explorer.rollout_model.tensor_parallel_size`, +- For explorer, adjust `explorer.rollout_model.tensor_parallel_size`, ## Part 3: Debugging Methods [Coming Soon] From 45689433c45fd127b52089ac5e4680ccae3e8179 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Thu, 3 Jul 2025 16:47:02 +0800 Subject: [PATCH 5/9] rm a verl param --- examples/async_gsm8k/verl_config.yaml | 1 - examples/dpo_humanlike/train_dpo.yaml | 1 - examples/grpo_alfworld/alfworld.yaml | 2 +- examples/grpo_alfworld/train_alfworld.yaml | 1 - examples/grpo_gsm8k/train_gsm8k.yaml | 1 - examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml | 1 - examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml | 1 - examples/grpo_math/train_math.yaml | 1 - examples/grpo_sciworld/train_sciworld.yaml | 1 - examples/grpo_webshop/train_webshop.yaml | 1 - examples/mix_math/train_mix_math.yaml | 1 - examples/opmd_gsm8k/train_opmd_gsm8k.yaml | 1 - examples/ppo_countdown/train_countdown.yaml | 2 -- 13 files changed, 1 insertion(+), 14 deletions(-) diff --git a/examples/async_gsm8k/verl_config.yaml b/examples/async_gsm8k/verl_config.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/async_gsm8k/verl_config.yaml +++ b/examples/async_gsm8k/verl_config.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/dpo_humanlike/train_dpo.yaml b/examples/dpo_humanlike/train_dpo.yaml index d5074848b0..28c687322c 100644 --- a/examples/dpo_humanlike/train_dpo.yaml +++ b/examples/dpo_humanlike/train_dpo.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 32 ppo_micro_batch_size_per_gpu: 2 # NOTE use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_alfworld/alfworld.yaml b/examples/grpo_alfworld/alfworld.yaml index 8323ef8591..281008ae46 100644 --- a/examples/grpo_alfworld/alfworld.yaml +++ b/examples/grpo_alfworld/alfworld.yaml @@ -13,7 +13,7 @@ cluster: gpu_per_node: 8 buffer: total_epochs: 20 - batch_size: 4 + batch_size: 32 max_retry_times: 3 max_retry_interval: 1 explorer_input: diff --git a/examples/grpo_alfworld/train_alfworld.yaml b/examples/grpo_alfworld/train_alfworld.yaml index 5b73ec7403..063abd768a 100644 --- a/examples/grpo_alfworld/train_alfworld.yaml +++ b/examples/grpo_alfworld/train_alfworld.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 1536 ppo_micro_batch_size_per_gpu: 1 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_gsm8k/train_gsm8k.yaml b/examples/grpo_gsm8k/train_gsm8k.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/grpo_gsm8k/train_gsm8k.yaml +++ b/examples/grpo_gsm8k/train_gsm8k.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml b/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml +++ b/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml b/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml +++ b/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_math/train_math.yaml b/examples/grpo_math/train_math.yaml index 0a46bd1788..ee94163eed 100644 --- a/examples/grpo_math/train_math.yaml +++ b/examples/grpo_math/train_math.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_sciworld/train_sciworld.yaml b/examples/grpo_sciworld/train_sciworld.yaml index 5b73ec7403..063abd768a 100644 --- a/examples/grpo_sciworld/train_sciworld.yaml +++ b/examples/grpo_sciworld/train_sciworld.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 1536 ppo_micro_batch_size_per_gpu: 1 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_webshop/train_webshop.yaml b/examples/grpo_webshop/train_webshop.yaml index 5b73ec7403..063abd768a 100644 --- a/examples/grpo_webshop/train_webshop.yaml +++ b/examples/grpo_webshop/train_webshop.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 1536 ppo_micro_batch_size_per_gpu: 1 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/mix_math/train_mix_math.yaml b/examples/mix_math/train_mix_math.yaml index ca072b78f6..7d32c1d756 100644 --- a/examples/mix_math/train_mix_math.yaml +++ b/examples/mix_math/train_mix_math.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 25600 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/opmd_gsm8k/train_opmd_gsm8k.yaml b/examples/opmd_gsm8k/train_opmd_gsm8k.yaml index 5ddd5124ee..cf2f06cf70 100644 --- a/examples/opmd_gsm8k/train_opmd_gsm8k.yaml +++ b/examples/opmd_gsm8k/train_opmd_gsm8k.yaml @@ -31,7 +31,6 @@ actor_rollout_ref: use_remove_padding: True actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/ppo_countdown/train_countdown.yaml b/examples/ppo_countdown/train_countdown.yaml index 191c345b90..7b1ef8eccf 100644 --- a/examples/ppo_countdown/train_countdown.yaml +++ b/examples/ppo_countdown/train_countdown.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} @@ -61,7 +60,6 @@ critic: # transformer_layer_cls_to_wrap: None min_num_params: 0 fsdp_size: -1 - ppo_mini_batch_size: ${actor_rollout_ref.actor.ppo_mini_batch_size} ppo_micro_batch_size_per_gpu: 8 forward_micro_batch_size_per_gpu: ${critic.ppo_micro_batch_size_per_gpu} use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz} From 2dcd722f86048ab1825a6c66c3d9f52f1bed3e50 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Thu, 3 Jul 2025 17:17:48 +0800 Subject: [PATCH 6/9] fix readme --- examples/grpo_math/README.md | 2 +- examples/ppo_countdown/README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/grpo_math/README.md b/examples/grpo_math/README.md index 649cc5272f..5b3c2c3ea2 100644 --- a/examples/grpo_math/README.md +++ b/examples/grpo_math/README.md @@ -1,6 +1,6 @@ # Example: PPO on MATH dataset -This example shows the usage of PPO on the MATH dataset. +This example shows the usage of PPO on the MATH dataset, adapted from [simpleRL](https://github.com/hkust-nlp/simpleRL-reason/tree/v0). For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md). diff --git a/examples/ppo_countdown/README.md b/examples/ppo_countdown/README.md index fa08b375a7..04c14d6241 100644 --- a/examples/ppo_countdown/README.md +++ b/examples/ppo_countdown/README.md @@ -1,6 +1,6 @@ # Example: PPO on Countdown dataset -This example shows the usage of PPO on the Countdown dataset. +This example shows the usage of PPO on the Countdown dataset, adapted from [TinyZero](https://github.com/Jiayi-Pan/TinyZero). For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md). From 3ca7f986da680f2e6e2ce8ee9984611c655a1b30 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Fri, 4 Jul 2025 12:55:20 +0800 Subject: [PATCH 7/9] fix comments --- .../sphinx_doc/source/tutorial/example_mix_algo.md | 4 ++-- docs/sphinx_doc/source/tutorial/faq.md | 14 ++++++++------ 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/sphinx_doc/source/tutorial/example_mix_algo.md b/docs/sphinx_doc/source/tutorial/example_mix_algo.md index b106293eed..59f7036f46 100644 --- a/docs/sphinx_doc/source/tutorial/example_mix_algo.md +++ b/docs/sphinx_doc/source/tutorial/example_mix_algo.md @@ -15,9 +15,9 @@ $$ \left[ \frac{1}{T'_b} \sum_{t=1}^{T'_b} \log \pi_\theta(o'_{b,t} \mid q'_b, o'_{b, ImportError: ... ``` -**A:** The `flash-attn` module is not properly installed. Try to fix it by running `MAX_JOBS=128 pip install flash-attn`. +**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn`. --- @@ -74,7 +74,7 @@ ImportError: ... UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ... ``` -**A:** Try to log in to WandB before running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`. +**A:** Try to log in to WandB before starting Ray and running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`. --- @@ -147,14 +147,16 @@ for exp in exp_list: **Q:** How to load the checkpoints outside of the Trinity-RFT framework? -**A:** You need to specify `model.model_path` and `checkpoint_root_dir`. The following code snippet gives an example with transformers. +**A:** You need to specify model path and checkpoint path. The following code snippet gives an example with transformers. ```python +import os from transformers import AutoTokenizer, AutoModelForCausalLM from trinity.common.models.utils import load_state_dict_from_verl_checkpoint -model = AutoModelForCausalLM.from_pretrained(model.model_path) -# Assume we need the checkpoint at step 780 -ckp_path = checkpoint_root_dir + "global_step_780/actor/" +# Assume we need the checkpoint at step 780; +# model_path, checkpoint_root_dir, project, and name are already defined +model = AutoModelForCausalLM.from_pretrained(model_path) +ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor") model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path)) ``` From 4fc18fcec797cffedb251e002336b399dc43ab0f Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Fri, 4 Jul 2025 14:06:27 +0800 Subject: [PATCH 8/9] fix pre-commit issue --- docs/sphinx_doc/source/tutorial/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md index 4a56308525..5c47632778 100644 --- a/docs/sphinx_doc/source/tutorial/faq.md +++ b/docs/sphinx_doc/source/tutorial/faq.md @@ -154,7 +154,7 @@ import os from transformers import AutoTokenizer, AutoModelForCausalLM from trinity.common.models.utils import load_state_dict_from_verl_checkpoint -# Assume we need the checkpoint at step 780; +# Assume we need the checkpoint at step 780; # model_path, checkpoint_root_dir, project, and name are already defined model = AutoModelForCausalLM.from_pretrained(model_path) ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor") From bab822bc7f15588b1d254e860be2fcd4def16ee1 Mon Sep 17 00:00:00 2001 From: hiyuchang Date: Fri, 4 Jul 2025 14:09:20 +0800 Subject: [PATCH 9/9] fix comments --- docs/sphinx_doc/source/tutorial/faq.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md index 5c47632778..b9606734a7 100644 --- a/docs/sphinx_doc/source/tutorial/faq.md +++ b/docs/sphinx_doc/source/tutorial/faq.md @@ -65,7 +65,7 @@ File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module> ImportError: ... ``` -**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn`. +**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn` or `pip install flash-attn -v --no-build-isolation`. --- @@ -83,7 +83,7 @@ UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key] ValueError: Failed to look up actor with name 'explorer' ... ``` -**A:** Try to restart Ray before running the experiment: +**A:** Make sure Ray is started before running the experiment. If Ray is already running, you can restart it with the following commands: ```bash ray stop