diff --git a/README.md b/README.md index c8c50999f7..6631937d6c 100644 --- a/README.md +++ b/README.md @@ -283,7 +283,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol + [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md) - +For some frequently asked questions, check [FAQ](./docs/sphinx_doc/source/tutorial/faq.md) for answers. ## Advanced usage and full configurations diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst index 4b4cab2aa9..fc085215b0 100644 --- a/docs/sphinx_doc/source/index.rst +++ b/docs/sphinx_doc/source/index.rst @@ -33,6 +33,12 @@ Welcome to Trinity-RFT's documentation! tutorial/trinity_configs.md tutorial/example_mix_algo.md +.. toctree:: + :maxdepth: 2 + :caption: FAQ + + tutorial/faq.md + .. toctree:: :maxdepth: 1 :glob: diff --git a/docs/sphinx_doc/source/tutorial/example_mix_algo.md b/docs/sphinx_doc/source/tutorial/example_mix_algo.md index b106293eed..59f7036f46 100644 --- a/docs/sphinx_doc/source/tutorial/example_mix_algo.md +++ b/docs/sphinx_doc/source/tutorial/example_mix_algo.md @@ -15,9 +15,9 @@ $$ \left[ \frac{1}{T'_b} \sum_{t=1}^{T'_b} \log \pi_\theta(o'_{b,t} \mid q'_b, o'_{b, + import flash_attn_2_cuda as flash_attn_gpu +ImportError: ... +``` + +**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn` or `pip install flash-attn -v --no-build-isolation`. + +--- + +**Error:** +```bash +UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ... +``` + +**A:** Try to log in to WandB before starting Ray and running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`. + +--- + +**Error:** +```bash +ValueError: Failed to look up actor with name 'explorer' ... +``` + +**A:** Make sure Ray is started before running the experiment. If Ray is already running, you can restart it with the following commands: + +```bash +ray stop +ray start --head +``` + +--- + +**Error:** Out-of-Memory (OOM) error + +**A:** The following parameters may be helpful: + +- For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`. +- For explorer, adjust `explorer.rollout_model.tensor_parallel_size`, + + +## Part 3: Debugging Methods [Coming Soon] +To see the full logs of all processes and save it to `debug.log`: +```bash +export RAY_DEDUP_LOGS=0 +trinity run --config grpo_gsm8k/gsm8k.yaml 2>&1 | tee debug.log +``` + + +## Part 4: Other Questions +**Q:** What's the purpose of `buffer.trainer_input.experience_buffer.path`? + +**A:** This path specifies the path to the SQLite database storaging the generated experiences. You may comment out this line if you don't want to use the SQLite database. + +To see the experiences in the database, you can use the following Python script: + +```python +from sqlalchemy import create_engine +from sqlalchemy.exc import OperationalError +from sqlalchemy.orm import sessionmaker +from sqlalchemy.pool import NullPool +from trinity.common.schema import ExperienceModel + +engine = create_engine(buffer.trainer_input.experience_buffer.path) +session = sessionmaker(bind=engine) +sess = session() + +MAX_EXPERIENCES = 4 +experiences = ( + sess.query(ExperienceModel) + .with_for_update() + .limit(MAX_EXPERIENCES) + .all() +) + +exp_list = [] +for exp in experiences: + exp_list.append(ExperienceModel.to_experience(exp)) + +# Print the experiences +for exp in exp_list: + print(f"{exp.prompt_text=}", f"{exp.response_text=}") +``` + +--- + +**Q:** How to load the checkpoints outside of the Trinity-RFT framework? + +**A:** You need to specify model path and checkpoint path. The following code snippet gives an example with transformers. + +```python +import os +from transformers import AutoTokenizer, AutoModelForCausalLM +from trinity.common.models.utils import load_state_dict_from_verl_checkpoint + +# Assume we need the checkpoint at step 780; +# model_path, checkpoint_root_dir, project, and name are already defined +model = AutoModelForCausalLM.from_pretrained(model_path) +ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor") +model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path)) +``` diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md index 88d925f786..6c2497c5d5 100644 --- a/docs/sphinx_doc/source/tutorial/trinity_configs.md +++ b/docs/sphinx_doc/source/tutorial/trinity_configs.md @@ -399,7 +399,7 @@ data_processor: For advanced users working with the `verl` trainer backend. This includes fine-grained settings for actor/critic models, optimizer parameters, and training loops. -> For full parameter meanings, refer to the [veRL documentation](https://github.com/volcengine/verl/blob/v0.3.0.post1/docs/examples/config.rst). +> For full parameter meanings, refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html). ```yaml diff --git a/examples/async_gsm8k/verl_config.yaml b/examples/async_gsm8k/verl_config.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/async_gsm8k/verl_config.yaml +++ b/examples/async_gsm8k/verl_config.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/dpo_humanlike/train_dpo.yaml b/examples/dpo_humanlike/train_dpo.yaml index d5074848b0..28c687322c 100644 --- a/examples/dpo_humanlike/train_dpo.yaml +++ b/examples/dpo_humanlike/train_dpo.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 32 ppo_micro_batch_size_per_gpu: 2 # NOTE use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_alfworld/alfworld.yaml b/examples/grpo_alfworld/alfworld.yaml index 8323ef8591..281008ae46 100644 --- a/examples/grpo_alfworld/alfworld.yaml +++ b/examples/grpo_alfworld/alfworld.yaml @@ -13,7 +13,7 @@ cluster: gpu_per_node: 8 buffer: total_epochs: 20 - batch_size: 4 + batch_size: 32 max_retry_times: 3 max_retry_interval: 1 explorer_input: diff --git a/examples/grpo_alfworld/train_alfworld.yaml b/examples/grpo_alfworld/train_alfworld.yaml index 5b73ec7403..063abd768a 100644 --- a/examples/grpo_alfworld/train_alfworld.yaml +++ b/examples/grpo_alfworld/train_alfworld.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 1536 ppo_micro_batch_size_per_gpu: 1 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_gsm8k/train_gsm8k.yaml b/examples/grpo_gsm8k/train_gsm8k.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/grpo_gsm8k/train_gsm8k.yaml +++ b/examples/grpo_gsm8k/train_gsm8k.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml b/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml +++ b/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml b/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml index fc44fdad94..f773f9a0ae 100644 --- a/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml +++ b/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_math/README.md b/examples/grpo_math/README.md index 649cc5272f..5b3c2c3ea2 100644 --- a/examples/grpo_math/README.md +++ b/examples/grpo_math/README.md @@ -1,6 +1,6 @@ # Example: PPO on MATH dataset -This example shows the usage of PPO on the MATH dataset. +This example shows the usage of PPO on the MATH dataset, adapted from [simpleRL](https://github.com/hkust-nlp/simpleRL-reason/tree/v0). For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md). diff --git a/examples/grpo_math/train_math.yaml b/examples/grpo_math/train_math.yaml index 0a46bd1788..ee94163eed 100644 --- a/examples/grpo_math/train_math.yaml +++ b/examples/grpo_math/train_math.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_sciworld/train_sciworld.yaml b/examples/grpo_sciworld/train_sciworld.yaml index 5b73ec7403..063abd768a 100644 --- a/examples/grpo_sciworld/train_sciworld.yaml +++ b/examples/grpo_sciworld/train_sciworld.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 1536 ppo_micro_batch_size_per_gpu: 1 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/grpo_webshop/train_webshop.yaml b/examples/grpo_webshop/train_webshop.yaml index 5b73ec7403..063abd768a 100644 --- a/examples/grpo_webshop/train_webshop.yaml +++ b/examples/grpo_webshop/train_webshop.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 1536 ppo_micro_batch_size_per_gpu: 1 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/mix_math/train_mix_math.yaml b/examples/mix_math/train_mix_math.yaml index ca072b78f6..7d32c1d756 100644 --- a/examples/mix_math/train_mix_math.yaml +++ b/examples/mix_math/train_mix_math.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True # False actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True # False ppo_max_token_len_per_gpu: 25600 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/opmd_gsm8k/train_opmd_gsm8k.yaml b/examples/opmd_gsm8k/train_opmd_gsm8k.yaml index 5ddd5124ee..cf2f06cf70 100644 --- a/examples/opmd_gsm8k/train_opmd_gsm8k.yaml +++ b/examples/opmd_gsm8k/train_opmd_gsm8k.yaml @@ -31,7 +31,6 @@ actor_rollout_ref: use_remove_padding: True actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} diff --git a/examples/ppo_countdown/README.md b/examples/ppo_countdown/README.md index fa08b375a7..04c14d6241 100644 --- a/examples/ppo_countdown/README.md +++ b/examples/ppo_countdown/README.md @@ -1,6 +1,6 @@ # Example: PPO on Countdown dataset -This example shows the usage of PPO on the Countdown dataset. +This example shows the usage of PPO on the Countdown dataset, adapted from [TinyZero](https://github.com/Jiayi-Pan/TinyZero). For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md). diff --git a/examples/ppo_countdown/train_countdown.yaml b/examples/ppo_countdown/train_countdown.yaml index 191c345b90..7b1ef8eccf 100644 --- a/examples/ppo_countdown/train_countdown.yaml +++ b/examples/ppo_countdown/train_countdown.yaml @@ -7,7 +7,6 @@ actor_rollout_ref: use_remove_padding: True actor: strategy: fsdp # This is for backward-compatibility - ppo_mini_batch_size: 128 ppo_micro_batch_size_per_gpu: 4 use_dynamic_bsz: True ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length} @@ -61,7 +60,6 @@ critic: # transformer_layer_cls_to_wrap: None min_num_params: 0 fsdp_size: -1 - ppo_mini_batch_size: ${actor_rollout_ref.actor.ppo_mini_batch_size} ppo_micro_batch_size_per_gpu: 8 forward_micro_batch_size_per_gpu: ${critic.ppo_micro_batch_size_per_gpu} use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}