From 0b5df71cbab2fcf46aad5d728be97c9abab535d4 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Tue, 1 Jul 2025 19:26:19 +0800
Subject: [PATCH 1/9] add faq

---
 docs/sphinx_doc/source/index.rst              |   6 +
 docs/sphinx_doc/source/tutorial/faq.md        | 160 ++++++++++++++++++
 .../source/tutorial/trinity_configs.md        |   2 +-
 3 files changed, 167 insertions(+), 1 deletion(-)
 create mode 100644 docs/sphinx_doc/source/tutorial/faq.md

diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst
index 4b4cab2aa9..fc085215b0 100644
--- a/docs/sphinx_doc/source/index.rst
+++ b/docs/sphinx_doc/source/index.rst
@@ -33,6 +33,12 @@ Welcome to Trinity-RFT's documentation!
    tutorial/trinity_configs.md
    tutorial/example_mix_algo.md
 
+.. toctree::
+   :maxdepth: 2
+   :caption: FAQ
+
+   tutorial/faq.md
+
 .. toctree::
    :maxdepth: 1
    :glob:
diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md
new file mode 100644
index 0000000000..99caafc457
--- /dev/null
+++ b/docs/sphinx_doc/source/tutorial/faq.md
@@ -0,0 +1,160 @@
+# FAQ
+
+## Part 1: Configurations
+**Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory?
+
+**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://github.com/volcengine/verl/blob/v0.4.0/docs/examples/config.rst).
+If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL.
+
+We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`.
+
+Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details.
+Future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.
+
+---
+
+**Q:** What's the relationship between `buffer.batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?
+
+**A:** The following parameters are closely related:
+
+- `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer.
+- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times / ngpus_trainer`.
+- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU.
+
+A minimal example showing their usage is as follows:
+
+```python
+def update_policy(batch):
+    dataloader = batch.split(ppo_mini_batch_size)
+    for _ in range(ppo_epochs):
+        for batch_idx, data in enumerate(dataloader):
+            # Split data
+            mini_batch = data
+            if actor_rollout_ref.actor.use_dynamic_bsz:
+                micro_batches, _ = rearrange_micro_batches(
+                        batch=mini_batch, max_token_len=max_token_len
+                    )
+            else:
+                micro_batches = mini_batch.split(actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu)
+
+            # Computing gradient
+            for data in micro_batches:
+                entropy, log_prob = self._forward_micro_batch(
+                    micro_batch=data, ...
+                )
+                pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = compute_policy_loss(
+                    log_prob=log_prob, **data
+                )
+                policy_loss = pg_loss + ...
+                loss = policy_loss / self.gradient_accumulation
+                loss.backward()
+
+            # Optimizer step
+            grad_norm = self._optimizer_step()
+    self.actor_optimizer.zero_grad()
+```
+Please refer to `trinity/trainer/verl/dp_actor.py` for detailed implementation. veRL also provides an explanation in [FAQ](https://verl.readthedocs.io/en/latest/faq/faq.html#what-is-the-meaning-of-train-batch-size-mini-batch-size-and-micro-batch-size).
+
+
+## Part 2: Common Errors
+
+**Error:**
+```bash
+File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module>
+    import flash_attn_2_cuda as flash_attn_gpu
+ImportError: ...
+```
+
+**A:** The `flash-attn` module is not properly installed. Try to fix it by running `MAX_JOBS=128 pip install flash-attn`.
+
+---
+
+**Error:**
+```bash
+UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ...
+```
+
+**A:** Try to log in to WandB before running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`.
+
+---
+
+**Error:**
+```bash
+ValueError: Failed to look up actor with name 'explorer' ...
+```
+
+**A:** Try to restart Ray before running the experiment:
+
+```bash
+ray stop
+ray start --head
+```
+
+---
+
+**Error:** Out-of-Memory (OOM) error
+
+**A:** The following parameters may be helpful:
+
+- For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`.
+- For exploere, adjust `explorer.rollout_model.tensor_parallel_size`,
+
+
+## Part 3: Debugging Methods [Coming Soon]
+To see the full logs of all processes and save it to `debug.log`:
+```bash
+export RAY_DEDUP_LOGS=0
+trinity run --config grpo_gsm8k/gsm8k.yaml 2>&1 | tee debug.log
+```
+
+
+## Part 4: Other Questions
+**Q:** What's the purpose of `buffer.trainer_input.experience_buffer.path`?
+
+**A:** This path specifies the path to the SQLite database storaging the generated experiences. You may comment out this line if you don't want to use the SQLite database.
+
+To see the experiences in the database, you can use the following Python script:
+
+```python
+from sqlalchemy import create_engine
+from sqlalchemy.exc import OperationalError
+from sqlalchemy.orm import sessionmaker
+from sqlalchemy.pool import NullPool
+from trinity.common.schema import ExperienceModel
+
+engine = create_engine(buffer.trainer_input.experience_buffer.path)
+session = sessionmaker(bind=engine)
+sess = session()
+
+MAX_EXPERIENCES = 4
+experiences = (
+    sess.query(ExperienceModel)
+    .with_for_update()
+    .limit(MAX_EXPERIENCES)
+    .all()
+)
+
+exp_list = []
+for exp in experiences:
+    exp_list.append(ExperienceModel.to_experience(exp))
+
+# Print the experiences
+for exp in exp_list:
+    print(f"{exp.prompt_text=}", f"{exp.response_text=}")
+```
+
+---
+
+**Q:** How to load the checkpoints outside of the Trinity-RFT framework?
+
+**A:** You need to specify `model.model_path` and `checkpoint_root_dir`. The following code snippet gives an example with transformers.
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from trinity.common.models.utils import load_state_dict_from_verl_checkpoint
+
+model = AutoModelForCausalLM.from_pretrained(model.model_path)
+# Assume we need the checkpoint at step 780
+ckp_path = checkpoint_root_dir + "global_step_780/actor/"
+model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path))
+```
diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md
index 88d925f786..6c2497c5d5 100644
--- a/docs/sphinx_doc/source/tutorial/trinity_configs.md
+++ b/docs/sphinx_doc/source/tutorial/trinity_configs.md
@@ -399,7 +399,7 @@ data_processor:
 
 For advanced users working with the `verl` trainer backend. This includes fine-grained settings for actor/critic models, optimizer parameters, and training loops.
 
-> For full parameter meanings, refer to the [veRL documentation](https://github.com/volcengine/verl/blob/v0.3.0.post1/docs/examples/config.rst).
+> For full parameter meanings, refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
 
 
 ```yaml

From 0f7c58a44672706f8f88c399f7f3ea7a4b9be883 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Thu, 3 Jul 2025 14:17:02 +0800
Subject: [PATCH 2/9] update faq

---
 docs/sphinx_doc/source/tutorial/faq.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md
index 99caafc457..5f30588d93 100644
--- a/docs/sphinx_doc/source/tutorial/faq.md
+++ b/docs/sphinx_doc/source/tutorial/faq.md
@@ -3,13 +3,13 @@
 ## Part 1: Configurations
 **Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory?
 
-**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://github.com/volcengine/verl/blob/v0.4.0/docs/examples/config.rst).
+**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
 If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL.
 
 We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`.
 
 Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details.
-Future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.
+For users' convenience, future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.
 
 ---
 
@@ -18,14 +18,14 @@ Future versions will gradually reduce parameters in `trainer.trainer_config` and
 **A:** The following parameters are closely related:
 
 - `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer.
-- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times / ngpus_trainer`.
+- `actor_rollout_ref.actor.ppo_mini_batch_size`: In the configuration, this value represents the number of tasks in a mini-batch, overridden by `buffer.batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.batch_size * algorithm.repeat_times (/ ngpus_trainer)`. The expression of dividing `ngpus_trainer` is caused by implict data allocation to GPUs, but this do not affects the result after gradient accumulation.
 - `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU.
 
 A minimal example showing their usage is as follows:
 
 ```python
-def update_policy(batch):
-    dataloader = batch.split(ppo_mini_batch_size)
+def update_policy(batch_exps):
+    dataloader = batch_epxs.split(ppo_mini_batch_size) # here `ppo_mini_batch_size` is in terms of experiences
     for _ in range(ppo_epochs):
         for batch_idx, data in enumerate(dataloader):
             # Split data

From a7414d66dde48e15da05586f592aaaf49f6c1b1a Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Thu, 3 Jul 2025 14:50:01 +0800
Subject: [PATCH 3/9] add faq to readme

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c8c50999f7..6631937d6c 100644
--- a/README.md
+++ b/README.md
@@ -283,7 +283,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
 + [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md)
 
 
-
+For some frequently asked questions, check [FAQ](./docs/sphinx_doc/source/tutorial/faq.md) for answers.
 
 
 ## Advanced usage and full configurations

From e4f0e906b48972901c3bc536ee14bc4ca9f858c0 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Thu, 3 Jul 2025 15:33:45 +0800
Subject: [PATCH 4/9] fix typo

---
 docs/sphinx_doc/source/tutorial/faq.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md
index 5f30588d93..0562a29848 100644
--- a/docs/sphinx_doc/source/tutorial/faq.md
+++ b/docs/sphinx_doc/source/tutorial/faq.md
@@ -97,7 +97,7 @@ ray start --head
 **A:** The following parameters may be helpful:
 
 - For trainer, adjust `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` when `actor_rollout_ref.actor.use_dynamic_bsz=false`; adjust `actor_rollout_ref.actor.ppo_max_token_len_per_gpu` and `actor_rollout_ref.actor.ulysses_sequence_parallel_size` when `actor_rollout_ref.actor.use_dynamic_bsz=true`.
-- For exploere, adjust `explorer.rollout_model.tensor_parallel_size`,
+- For explorer, adjust `explorer.rollout_model.tensor_parallel_size`,
 
 
 ## Part 3: Debugging Methods [Coming Soon]

From 45689433c45fd127b52089ac5e4680ccae3e8179 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Thu, 3 Jul 2025 16:47:02 +0800
Subject: [PATCH 5/9] rm a verl param

---
 examples/async_gsm8k/verl_config.yaml                    | 1 -
 examples/dpo_humanlike/train_dpo.yaml                    | 1 -
 examples/grpo_alfworld/alfworld.yaml                     | 2 +-
 examples/grpo_alfworld/train_alfworld.yaml               | 1 -
 examples/grpo_gsm8k/train_gsm8k.yaml                     | 1 -
 examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml | 1 -
 examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml       | 1 -
 examples/grpo_math/train_math.yaml                       | 1 -
 examples/grpo_sciworld/train_sciworld.yaml               | 1 -
 examples/grpo_webshop/train_webshop.yaml                 | 1 -
 examples/mix_math/train_mix_math.yaml                    | 1 -
 examples/opmd_gsm8k/train_opmd_gsm8k.yaml                | 1 -
 examples/ppo_countdown/train_countdown.yaml              | 2 --
 13 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/examples/async_gsm8k/verl_config.yaml b/examples/async_gsm8k/verl_config.yaml
index fc44fdad94..f773f9a0ae 100644
--- a/examples/async_gsm8k/verl_config.yaml
+++ b/examples/async_gsm8k/verl_config.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True  # False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True # False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/dpo_humanlike/train_dpo.yaml b/examples/dpo_humanlike/train_dpo.yaml
index d5074848b0..28c687322c 100644
--- a/examples/dpo_humanlike/train_dpo.yaml
+++ b/examples/dpo_humanlike/train_dpo.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 32
     ppo_micro_batch_size_per_gpu: 2 # NOTE
     use_dynamic_bsz: False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_alfworld/alfworld.yaml b/examples/grpo_alfworld/alfworld.yaml
index 8323ef8591..281008ae46 100644
--- a/examples/grpo_alfworld/alfworld.yaml
+++ b/examples/grpo_alfworld/alfworld.yaml
@@ -13,7 +13,7 @@ cluster:
   gpu_per_node: 8
 buffer:
   total_epochs: 20
-  batch_size: 4
+  batch_size: 32
   max_retry_times: 3
   max_retry_interval: 1
   explorer_input:
diff --git a/examples/grpo_alfworld/train_alfworld.yaml b/examples/grpo_alfworld/train_alfworld.yaml
index 5b73ec7403..063abd768a 100644
--- a/examples/grpo_alfworld/train_alfworld.yaml
+++ b/examples/grpo_alfworld/train_alfworld.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 1536
     ppo_micro_batch_size_per_gpu: 1
     use_dynamic_bsz: False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_gsm8k/train_gsm8k.yaml b/examples/grpo_gsm8k/train_gsm8k.yaml
index fc44fdad94..f773f9a0ae 100644
--- a/examples/grpo_gsm8k/train_gsm8k.yaml
+++ b/examples/grpo_gsm8k/train_gsm8k.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True  # False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True # False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml b/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml
index fc44fdad94..f773f9a0ae 100644
--- a/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml
+++ b/examples/grpo_gsm8k_experience_pipeline/train_gsm8k.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True  # False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True # False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml b/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml
index fc44fdad94..f773f9a0ae 100644
--- a/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml
+++ b/examples/grpo_gsm8k_task_pipeline/train_gsm8k.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True  # False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True # False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_math/train_math.yaml b/examples/grpo_math/train_math.yaml
index 0a46bd1788..ee94163eed 100644
--- a/examples/grpo_math/train_math.yaml
+++ b/examples/grpo_math/train_math.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True  # False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True # False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_sciworld/train_sciworld.yaml b/examples/grpo_sciworld/train_sciworld.yaml
index 5b73ec7403..063abd768a 100644
--- a/examples/grpo_sciworld/train_sciworld.yaml
+++ b/examples/grpo_sciworld/train_sciworld.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 1536
     ppo_micro_batch_size_per_gpu: 1
     use_dynamic_bsz: False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/grpo_webshop/train_webshop.yaml b/examples/grpo_webshop/train_webshop.yaml
index 5b73ec7403..063abd768a 100644
--- a/examples/grpo_webshop/train_webshop.yaml
+++ b/examples/grpo_webshop/train_webshop.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 1536
     ppo_micro_batch_size_per_gpu: 1
     use_dynamic_bsz: False
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/mix_math/train_mix_math.yaml b/examples/mix_math/train_mix_math.yaml
index ca072b78f6..7d32c1d756 100644
--- a/examples/mix_math/train_mix_math.yaml
+++ b/examples/mix_math/train_mix_math.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True  # False
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True # False
     ppo_max_token_len_per_gpu: 25600 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/opmd_gsm8k/train_opmd_gsm8k.yaml b/examples/opmd_gsm8k/train_opmd_gsm8k.yaml
index 5ddd5124ee..cf2f06cf70 100644
--- a/examples/opmd_gsm8k/train_opmd_gsm8k.yaml
+++ b/examples/opmd_gsm8k/train_opmd_gsm8k.yaml
@@ -31,7 +31,6 @@ actor_rollout_ref:
     use_remove_padding: True
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
diff --git a/examples/ppo_countdown/train_countdown.yaml b/examples/ppo_countdown/train_countdown.yaml
index 191c345b90..7b1ef8eccf 100644
--- a/examples/ppo_countdown/train_countdown.yaml
+++ b/examples/ppo_countdown/train_countdown.yaml
@@ -7,7 +7,6 @@ actor_rollout_ref:
     use_remove_padding: True
   actor:
     strategy: fsdp  # This is for backward-compatibility
-    ppo_mini_batch_size: 128
     ppo_micro_batch_size_per_gpu: 4
     use_dynamic_bsz: True
     ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
@@ -61,7 +60,6 @@ critic:
         # transformer_layer_cls_to_wrap: None
         min_num_params: 0
       fsdp_size: -1
-  ppo_mini_batch_size: ${actor_rollout_ref.actor.ppo_mini_batch_size}
   ppo_micro_batch_size_per_gpu: 8
   forward_micro_batch_size_per_gpu: ${critic.ppo_micro_batch_size_per_gpu}
   use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}

From 2dcd722f86048ab1825a6c66c3d9f52f1bed3e50 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Thu, 3 Jul 2025 17:17:48 +0800
Subject: [PATCH 6/9] fix readme

---
 examples/grpo_math/README.md     | 2 +-
 examples/ppo_countdown/README.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/examples/grpo_math/README.md b/examples/grpo_math/README.md
index 649cc5272f..5b3c2c3ea2 100644
--- a/examples/grpo_math/README.md
+++ b/examples/grpo_math/README.md
@@ -1,6 +1,6 @@
 # Example: PPO on MATH dataset
 
-This example shows the usage of PPO on the MATH dataset.
+This example shows the usage of PPO on the MATH dataset, adapted from [simpleRL](https://github.com/hkust-nlp/simpleRL-reason/tree/v0).
 
 For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).
 
diff --git a/examples/ppo_countdown/README.md b/examples/ppo_countdown/README.md
index fa08b375a7..04c14d6241 100644
--- a/examples/ppo_countdown/README.md
+++ b/examples/ppo_countdown/README.md
@@ -1,6 +1,6 @@
 # Example: PPO on Countdown dataset
 
-This example shows the usage of PPO on the Countdown dataset.
+This example shows the usage of PPO on the Countdown dataset, adapted from [TinyZero](https://github.com/Jiayi-Pan/TinyZero).
 
 For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).
 

From 3ca7f986da680f2e6e2ce8ee9984611c655a1b30 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Fri, 4 Jul 2025 12:55:20 +0800
Subject: [PATCH 7/9] fix comments

---
 .../sphinx_doc/source/tutorial/example_mix_algo.md |  4 ++--
 docs/sphinx_doc/source/tutorial/faq.md             | 14 ++++++++------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/docs/sphinx_doc/source/tutorial/example_mix_algo.md b/docs/sphinx_doc/source/tutorial/example_mix_algo.md
index b106293eed..59f7036f46 100644
--- a/docs/sphinx_doc/source/tutorial/example_mix_algo.md
+++ b/docs/sphinx_doc/source/tutorial/example_mix_algo.md
@@ -15,9 +15,9 @@ $$
 \left[
     \frac{1}{T'_b} \sum_{t=1}^{T'_b}
     \log \pi_\theta(o'_{b,t} \mid q'_b, o'_{b,<t})
-\right]}_{\text{Auxiliary Loss on Expert Data}}.
+\right]}_{\text{Auxiliary objective on expert data}}.
 $$
-The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary loss defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.
+The first term corresponds to the standard GRPO objective, which aims to maximize the expected reward. The last term is an auxiliary objective defined on expert data, encouraging the policy to imitate expert behavior. $\mu$ is a weighting factor that controls the relative importance of the two terms.
 
 
 ## Step 0: Prepare the Expert Data
diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md
index 0562a29848..4a56308525 100644
--- a/docs/sphinx_doc/source/tutorial/faq.md
+++ b/docs/sphinx_doc/source/tutorial/faq.md
@@ -65,7 +65,7 @@ File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module>
 ImportError: ...
 ```
 
-**A:** The `flash-attn` module is not properly installed. Try to fix it by running `MAX_JOBS=128 pip install flash-attn`.
+**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn`.
 
 ---
 
@@ -74,7 +74,7 @@ ImportError: ...
 UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]) ...
 ```
 
-**A:** Try to log in to WandB before running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`.
+**A:** Try to log in to WandB before starting Ray and running the experiment. One way to do this is run the command `export WANDB_API_KEY=[your_api_key]`.
 
 ---
 
@@ -147,14 +147,16 @@ for exp in exp_list:
 
 **Q:** How to load the checkpoints outside of the Trinity-RFT framework?
 
-**A:** You need to specify `model.model_path` and `checkpoint_root_dir`. The following code snippet gives an example with transformers.
+**A:** You need to specify model path and checkpoint path. The following code snippet gives an example with transformers.
 
 ```python
+import os
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from trinity.common.models.utils import load_state_dict_from_verl_checkpoint
 
-model = AutoModelForCausalLM.from_pretrained(model.model_path)
-# Assume we need the checkpoint at step 780
-ckp_path = checkpoint_root_dir + "global_step_780/actor/"
+# Assume we need the checkpoint at step 780; 
+# model_path, checkpoint_root_dir, project, and name are already defined
+model = AutoModelForCausalLM.from_pretrained(model_path)
+ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor")
 model.load_state_dict(load_state_dict_from_verl_checkpoint(ckp_path))
 ```

From 4fc18fcec797cffedb251e002336b399dc43ab0f Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Fri, 4 Jul 2025 14:06:27 +0800
Subject: [PATCH 8/9] fix pre-commit issue

---
 docs/sphinx_doc/source/tutorial/faq.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md
index 4a56308525..5c47632778 100644
--- a/docs/sphinx_doc/source/tutorial/faq.md
+++ b/docs/sphinx_doc/source/tutorial/faq.md
@@ -154,7 +154,7 @@ import os
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from trinity.common.models.utils import load_state_dict_from_verl_checkpoint
 
-# Assume we need the checkpoint at step 780; 
+# Assume we need the checkpoint at step 780;
 # model_path, checkpoint_root_dir, project, and name are already defined
 model = AutoModelForCausalLM.from_pretrained(model_path)
 ckp_path = os.path.join(checkpoint_root_dir, project, name, "global_step_780", "actor")

From bab822bc7f15588b1d254e860be2fcd4def16ee1 Mon Sep 17 00:00:00 2001
From: hiyuchang <sycgloria@gmail.com>
Date: Fri, 4 Jul 2025 14:09:20 +0800
Subject: [PATCH 9/9] fix comments

---
 docs/sphinx_doc/source/tutorial/faq.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md
index 5c47632778..b9606734a7 100644
--- a/docs/sphinx_doc/source/tutorial/faq.md
+++ b/docs/sphinx_doc/source/tutorial/faq.md
@@ -65,7 +65,7 @@ File ".../flash_attn/flash_attn_interface.py", line 15, in ‹module>
 ImportError: ...
 ```
 
-**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn`.
+**A:** The `flash-attn` module is not properly installed. Try to fix it by running `pip install flash-attn` or `pip install flash-attn -v --no-build-isolation`.
 
 ---
 
@@ -83,7 +83,7 @@ UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key]
 ValueError: Failed to look up actor with name 'explorer' ...
 ```
 
-**A:** Try to restart Ray before running the experiment:
+**A:** Make sure Ray is started before running the experiment. If Ray is already running, you can restart it with the following commands:
 
 ```bash
 ray stop