Skip to content

Commit 4370578

Browse files
authored
Add FAQ in docs (#109)
1 parent 531f38c commit 4370578

File tree

20 files changed

+175
-20
lines changed

20 files changed

+175
-20
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
283283
+ [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md)
284284
285285
286-
286+
For some frequently asked questions, check [FAQ](./docs/sphinx_doc/source/tutorial/faq.md) for answers.
287287
288288
289289
## Advanced usage and full configurations

docs/sphinx_doc/source/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ Welcome to Trinity-RFT's documentation!
3333
tutorial/trinity_configs.md
3434
tutorial/example_mix_algo.md
3535

36+
.. toctree::
37+
:maxdepth: 2
38+
:caption: FAQ
39+
40+
tutorial/faq.md
41+
3642
.. toctree::
3743
:maxdepth: 1
3844
:glob:

docs/sphinx_doc/source/tutorial/example_mix_algo.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ $$
1515
\left[
1616
\frac{1}{T'_b} \sum_{t=1}^{T'_b}
1717
\log \pi_\theta(o'_{b,t} \mid q'_b, o'_{b,<t})
18-
\right]}_{\text{Auxiliary Loss on Expert Data}}.
18+
\right]}_{\text{Auxiliary objective on expert data}}.
1919
$$
20-
The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary loss defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.
20+
The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary objective defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.
2121

2222

2323
## Step 0: Prepare the Expert Data
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# FAQ
2+
3+
## Part 1: Configurations
4+
**Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory?
5+
6+
**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
7+
If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL.
8+
9+
We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`.
10+
11+
Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details.
12+
For users' convenience, future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.
13+
14+
---
15+
16+
**Q:** What's the relationship between `buffer.batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?
17+
18+
**A:** The following parameters are closely related:
19+
20+
- `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer.
21+
- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times (/ ngpus_trainer)`. The expression of dividing `ngpus_trainer` is caused by implict data allocation to GPUs, but this do not affects the result after gradient accumulation.
22+
- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU.
23+
24+
A minimal example showing their usage is as follows:
25+
26+
```python
27+
def update_policy(batch_exps):
28+
dataloader = batch_epxs.split(ppo_mini_batch_size) # here `ppo_mini_batch_size` is in terms of experiences
29+
for _ in range(ppo_epochs):
30+
for batch_idx, data in enumerate(dataloader):
31+
# Split data
32+
mini_batch = data
33+
if actor_rollout_ref.actor.use_dynamic_bsz:
34+
micro_batches, _ = rearrange_micro_batches(
35+
batch=mini_batch, max_token_len=max_token_len
36+
)
37+
else:
38+
micro_batches = mini_batch.split(actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu)
39+
40+
# Computing gradient
41+
for data in micro_batches:
42+
entropy, log_prob = self._forward_micro_batch(
43+
micro_batch=data, ...
44+
)
45+
pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss(
46+
log_prob=log_prob, **data
47+
)
48+
policy_loss = pg_loss + ...
49+
loss = policy_loss / self.gradient_accumulation
50+
loss.backward()
51+
52+
# Optimizer step
53+
grad_norm = self._optimizer_step()
54+
self.actor_optimizer.zero_grad()
55+
```
56+
Please refer to `trinity/trainer/verl/dp_actor.py` for detailed implementation. veRL also provides an explanation in [FAQ](https://verl.readthedocs.io/en/latest/faq/faq.html#what-is-the-meaning-of-train-batch-size-mini-batch-size-and-micro-batch-size).
57+
58+
59+
## Part 2: Common Errors
60+
61+
**Error:**
62+
```bash
63+
File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module>
64+
import flash_attn_2_cuda as flash_attn_gpu
65+
ImportError: ...
66+
```
67+
68+
**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn` or `pip install flash-attn -v --no-build-isolation`.
69+
70+
---
71+
72+
**Error:**
73+
```bash
74+
UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ...
75+
```
76+
77+
**A:** Try to log in to WandB before starting Ray and running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`.
78+
79+
---
80+
81+
**Error:**
82+
```bash
83+
ValueError: Failed to look up actor with name 'explorer' ...
84+
```
85+
86+
**A:** Make sure Ray is started before running the experiment. If Ray is already running, you can restart it with the following commands:
87+
88+
```bash
89+
ray stop
90+
ray start --head
91+
```
92+
93+
---
94+
95+
**Error:** Out-of-Memory (OOM) error
96+
97+
**A:** The following parameters may be helpful:
98+
99+
- For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`.
100+
- For explorer, adjust `explorer.rollout_model.tensor_parallel_size`,
101+
102+
103+
## Part 3: Debugging Methods [Coming Soon]
104+
To see the full logs of all processes and save it to `debug.log`:
105+
```bash
106+
export RAY_DEDUP_LOGS=0
107+
trinity run --config grpo_gsm8k/gsm8k.yaml 2>&1 | tee debug.log
108+
```
109+
110+
111+
## Part 4: Other Questions
112+
**Q:** What's the purpose of `buffer.trainer_input.experience_buffer.path`?
113+
114+
**A:** This path specifies the path to the SQLite database storaging the generated experiences. You may comment out this line if you don't want to use the SQLite database.
115+
116+
To see the experiences in the database, you can use the following Python script:
117+
118+
```python
119+
from sqlalchemy import create_engine
120+
from sqlalchemy.exc import OperationalError
121+
from sqlalchemy.orm import sessionmaker
122+
from sqlalchemy.pool import NullPool
123+
from trinity.common.schema import ExperienceModel
124+
125+
engine = create_engine(buffer.trainer_input.experience_buffer.path)
126+
session = sessionmaker(bind=engine)
127+
sess = session()
128+
129+
MAX_EXPERIENCES = 4
130+
experiences = (
131+
sess.query(ExperienceModel)
132+
.with_for_update()
133+
.limit(MAX_EXPERIENCES)
134+
.all()
135+
)
136+
137+
exp_list = []
138+
for exp in experiences:
139+
exp_list.append(ExperienceModel.to_experience(exp))
140+
141+
# Print the experiences
142+
for exp in exp_list:
143+
print(f"{exp.prompt_text=}", f"{exp.response_text=}")
144+
```
145+
146+
---
147+
148+
**Q:** How to load the checkpoints outside of the Trinity-RFT framework?
149+
150+
**A:** You need to specify model path and checkpoint path. The following code snippet gives an example with transformers.
151+
152+
```python
153+
import os
154+
from transformers import AutoTokenizer, AutoModelForCausalLM
155+
from trinity.common.models.utils import load_state_dict_from_verl_checkpoint
156+
157+
# Assume we need the checkpoint at step 780;
158+
# model_path, checkpoint_root_dir, project, and name are already defined
159+
model = AutoModelForCausalLM.from_pretrained(model_path)
160+
ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor")
161+
model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path))
162+
```

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -409,7 +409,7 @@ data_processor:
409409

410410
For advanced users working with the `verl` trainer backend. This includes fine-grained settings for actor/critic models, optimizer parameters, and training loops.
411411

412-
> For full parameter meanings, refer to the [veRL documentation](https://github.com/volcengine/verl/blob/v0.3.0.post1/docs/examples/config.rst).
412+
> For full parameter meanings, refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
413413

414414

415415
```yaml

examples/async_gsm8k/verl_config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ actor_rollout_ref:
77
use_remove_padding: True # False
88
actor:
99
strategy: fsdp # This is for backward-compatibility
10-
ppo_mini_batch_size: 128
1110
ppo_micro_batch_size_per_gpu: 4
1211
use_dynamic_bsz: True # False
1312
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}

examples/dpo_humanlike/train_dpo.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ actor_rollout_ref:
77
use_remove_padding: False
88
actor:
99
strategy: fsdp # This is for backward-compatibility
10-
ppo_mini_batch_size: 32
1110
ppo_micro_batch_size_per_gpu: 2 # NOTE
1211
use_dynamic_bsz: False
1312
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}

examples/grpo_alfworld/alfworld.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ cluster:
1313
gpu_per_node: 8
1414
buffer:
1515
total_epochs: 20
16-
batch_size: 4
16+
batch_size: 32
1717
max_retry_times: 3
1818
max_retry_interval: 1
1919
explorer_input:

examples/grpo_alfworld/train_alfworld.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ actor_rollout_ref:
77
use_remove_padding: False
88
actor:
99
strategy: fsdp # This is for backward-compatibility
10-
ppo_mini_batch_size: 1536
1110
ppo_micro_batch_size_per_gpu: 1
1211
use_dynamic_bsz: False
1312
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}

examples/grpo_gsm8k/train_gsm8k.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ actor_rollout_ref:
77
use_remove_padding: True # False
88
actor:
99
strategy: fsdp # This is for backward-compatibility
10-
ppo_mini_batch_size: 128
1110
ppo_micro_batch_size_per_gpu: 4
1211
use_dynamic_bsz: True # False
1312
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}

0 commit comments

Comments
 (0)