Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
+ [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md)



For some frequently asked questions, check [FAQ](./docs/sphinx_doc/source/tutorial/faq.md) for answers.


## Advanced usage and full configurations
Expand Down
6 changes: 6 additions & 0 deletions docs/sphinx_doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ Welcome to Trinity-RFT's documentation!
tutorial/trinity_configs.md
tutorial/example_mix_algo.md

.. toctree::
:maxdepth: 2
:caption: FAQ

tutorial/faq.md

.. toctree::
:maxdepth: 1
:glob:
Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx_doc/source/tutorial/example_mix_algo.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ $$
\left[
\frac{1}{T'_b} \sum_{t=1}^{T'_b}
\log \pi_\theta(o'_{b,t} \mid q'_b, o'_{b,<t})
\right]}_{\text{Auxiliary Loss on Expert Data}}.
\right]}_{\text{Auxiliary objective on expert data}}.
$$
The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary loss defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.
The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary objective defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.


## Step 0: Prepare the Expert Data
Expand Down
162 changes: 162 additions & 0 deletions docs/sphinx_doc/source/tutorial/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# FAQ

## Part 1: Configurations
**Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory?

**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL.

We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`.

Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details.
For users' convenience, future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.

---

**Q:** What's the relationship between `buffer.batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?

**A:** The following parameters are closely related:

- `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer.
- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times (/ ngpus_trainer)`. The expression of dividing `ngpus_trainer` is caused by implict data allocation to GPUs, but this do not affects the result after gradient accumulation.
- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU.

A minimal example showing their usage is as follows:

```python
def update_policy(batch_exps):
dataloader = batch_epxs.split(ppo_mini_batch_size) # here `ppo_mini_batch_size` is in terms of experiences
for _ in range(ppo_epochs):
for batch_idx, data in enumerate(dataloader):
# Split data
mini_batch = data
if actor_rollout_ref.actor.use_dynamic_bsz:
micro_batches, _ = rearrange_micro_batches(
batch=mini_batch, max_token_len=max_token_len
)
else:
micro_batches = mini_batch.split(actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu)

# Computing gradient
for data in micro_batches:
entropy, log_prob = self._forward_micro_batch(
micro_batch=data, ...
)
pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss(
log_prob=log_prob, **data
)
policy_loss = pg_loss + ...
loss = policy_loss / self.gradient_accumulation
loss.backward()

# Optimizer step
grad_norm = self._optimizer_step()
self.actor_optimizer.zero_grad()
```
Please refer to `trinity/trainer/verl/dp_actor.py` for detailed implementation. veRL also provides an explanation in [FAQ](https://verl.readthedocs.io/en/latest/faq/faq.html#what-is-the-meaning-of-train-batch-size-mini-batch-size-and-micro-batch-size).


## Part 2: Common Errors

**Error:**
```bash
File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module>
import flash_attn_2_cuda as flash_attn_gpu
ImportError: ...
```

**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn` or `pip install flash-attn -v --no-build-isolation`.

---

**Error:**
```bash
UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ...
```

**A:** Try to log in to WandB before starting Ray and running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`.

---

**Error:**
```bash
ValueError: Failed to look up actor with name 'explorer' ...
```

**A:** Make sure Ray is started before running the experiment. If Ray is already running, you can restart it with the following commands:

```bash
ray stop
ray start --head
```

---

**Error:** Out-of-Memory (OOM) error

**A:** The following parameters may be helpful:

- For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`.
- For explorer, adjust `explorer.rollout_model.tensor_parallel_size`,


## Part 3: Debugging Methods [Coming Soon]
To see the full logs of all processes and save it to `debug.log`:
```bash
export RAY_DEDUP_LOGS=0
trinity run --config grpo_gsm8k/gsm8k.yaml 2>&1 | tee debug.log
```


## Part 4: Other Questions
**Q:** What's the purpose of `buffer.trainer_input.experience_buffer.path`?

**A:** This path specifies the path to the SQLite database storaging the generated experiences. You may comment out this line if you don't want to use the SQLite database.

To see the experiences in the database, you can use the following Python script:

```python
from sqlalchemy import create_engine
from sqlalchemy.exc import OperationalError
from sqlalchemy.orm import sessionmaker
from sqlalchemy.pool import NullPool
from trinity.common.schema import ExperienceModel

engine = create_engine(buffer.trainer_input.experience_buffer.path)
session = sessionmaker(bind=engine)
sess = session()

MAX_EXPERIENCES = 4
experiences = (
sess.query(ExperienceModel)
.with_for_update()
.limit(MAX_EXPERIENCES)
.all()
)

exp_list = []
for exp in experiences:
exp_list.append(ExperienceModel.to_experience(exp))

# Print the experiences
for exp in exp_list:
print(f"{exp.prompt_text=}", f"{exp.response_text=}")
```

---

**Q:** How to load the checkpoints outside of the Trinity-RFT framework?

**A:** You need to specify model path and checkpoint path. The following code snippet gives an example with transformers.

```python
import os
from transformers import AutoTokenizer, AutoModelForCausalLM
from trinity.common.models.utils import load_state_dict_from_verl_checkpoint

# Assume we need the checkpoint at step 780;
# model_path, checkpoint_root_dir, project, and name are already defined
model = AutoModelForCausalLM.from_pretrained(model_path)
ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor")
model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path))
```
2 changes: 1 addition & 1 deletion docs/sphinx_doc/source/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,7 @@ data_processor:

For advanced users working with the `verl` trainer backend. This includes fine-grained settings for actor/critic models, optimizer parameters, and training loops.

> For full parameter meanings, refer to the [veRL documentation](https://github.com/volcengine/verl/blob/v0.3.0.post1/docs/examples/config.rst).
> For full parameter meanings, refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).


```yaml
Expand Down
1 change: 0 additions & 1 deletion examples/async_gsm8k/verl_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True # False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/dpo_humanlike/train_dpo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 32
ppo_micro_batch_size_per_gpu: 2 # NOTE
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
2 changes: 1 addition & 1 deletion examples/grpo_alfworld/alfworld.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ cluster:
gpu_per_node: 8
buffer:
total_epochs: 20
batch_size: 4
batch_size: 32
max_retry_times: 3
max_retry_interval: 1
explorer_input:
Expand Down
1 change: 0 additions & 1 deletion examples/grpo_alfworld/train_alfworld.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 1536
ppo_micro_batch_size_per_gpu: 1
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/grpo_gsm8k/train_gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True # False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True # False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True # False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
2 changes: 1 addition & 1 deletion examples/grpo_math/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Example: PPO on MATH dataset

This example shows the usage of PPO on the MATH dataset.
This example shows the usage of PPO on the MATH dataset, adapted from [simpleRL](https://github.com/hkust-nlp/simpleRL-reason/tree/v0).

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).

Expand Down
1 change: 0 additions & 1 deletion examples/grpo_math/train_math.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True # False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/grpo_sciworld/train_sciworld.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 1536
ppo_micro_batch_size_per_gpu: 1
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/grpo_webshop/train_webshop.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 1536
ppo_micro_batch_size_per_gpu: 1
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/mix_math/train_mix_math.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True # False
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True # False
ppo_max_token_len_per_gpu: 25600 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
1 change: 0 additions & 1 deletion examples/opmd_gsm8k/train_opmd_gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ actor_rollout_ref:
use_remove_padding: True
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down
2 changes: 1 addition & 1 deletion examples/ppo_countdown/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Example: PPO on Countdown dataset

This example shows the usage of PPO on the Countdown dataset.
This example shows the usage of PPO on the Countdown dataset, adapted from [TinyZero](https://github.com/Jiayi-Pan/TinyZero).

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).

Expand Down
2 changes: 0 additions & 2 deletions examples/ppo_countdown/train_countdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ actor_rollout_ref:
use_remove_padding: True
actor:
strategy: fsdp # This is for backward-compatibility
ppo_mini_batch_size: 128
ppo_micro_batch_size_per_gpu: 4
use_dynamic_bsz: True
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
Expand Down Expand Up @@ -61,7 +60,6 @@ critic:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
fsdp_size: -1
ppo_mini_batch_size: ${actor_rollout_ref.actor.ppo_mini_batch_size}
ppo_micro_batch_size_per_gpu: 8
forward_micro_batch_size_per_gpu: ${critic.ppo_micro_batch_size_per_gpu}
use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
Expand Down