You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ALFWorld](https://github.com/alfworld/alfworld) is a text-based interactive environment where agents need to complete household tasks in a virtual home environment. The agent interacts with the environment through natural language commands to accomplish tasks.
6
+
7
+
The environment is configured as follows:
8
+
* Environment: Text-based interactive environment built on TextWorld
9
+
* Action Space: Commands such as `pick`, `go to`, `place`, etc.
10
+
* Reward Structure: +1 for successfully completing the task, -0.1 otherwise
11
+
* Maximum Steps: 30 (configurable via `max_env_steps`)
12
+
13
+
See the [documentation](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) for data preparation.
14
+
15
+
## 2. Experimental Settings
16
+
17
+
We evaluate the performance of the following methods in Trinity-RFT framework with version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.11.0) and compare against the latest release of rLLM with commit ID [ef6451f](https://github.com/rllm-org/rllm/commit/ef6451fbd7eba224c4a87e3fd944d7c0e2bcc0ea) (verl==0.5.0) as of Nov. 6, 2025.
18
+
Since rLLM does not support ALFWorld environment yet, we implement this task in rLLM for comparison.
19
+
20
+
In Trinity-RFT and rLLM, we respectively evaluate the performance using GRPO algorithm on this task.
21
+
We fine-tune a `Qwen2.5-3B-Instruct` model, which has been trained on a SFT dataset, on the training tasks with GRPO and other methods. For all methods, we fix key parameters to `batch_size=32`, `repeat_times=8`, `lr=1e-6`, and `kl_coef=0.001`.
22
+
23
+
For better efficiency, we use 64 rollout workers in rLLM and set the `explorer.engine_num` to 4 and `explorer.runner_per_model` to 8 in Trinity-RFT.
24
+
25
+
## 3. Results and Analysis
26
+
27
+
We compare the sample efficiency of different methods by plotting the reward and test score vs. training steps. As shown in the following figures, Trinity-RFT and rLLM reach similar training and test results at the same step.
We further compare the efficiency on the ALFWorld task.
32
+
The following table details the wall-clock time required for each method to reach the specific performance thresholds, i.e., reward = 0.8 and test score = 0.6.
33
+
34
+
| Method | Training Reward | Time to Reach Target (Hours) | Speedup |
The results show that the Trinity-RFT achieves a noticeable speedup on the ALFWorld task, also shown in the following figures.
46
+
The primary reason for the efficiency lies in the difference between the rollout mechanisms of Trinity-RFT and rLLM. Trinity-RFT uses multiprocessing during rollout, whereas rLLM employs multithreading, which restricts the parallelism of the rollout process in ALFWorld environment given that this environment is not thread-safe (refer to [this issue](https://github.com/alfworld/alfworld/issues/71)).
[Frozen lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) involves walking over a frozen lake from Start (S) to Goal (G) without falling into any Holes (H). We formulate this task as a multi-step workflow, where the agent interacts with the environment for multiple steps to reach the goal.
6
+
7
+
The environment is configured as follows:
8
+
* Map Size: From 2x2 to 5x5, randomly generated.
9
+
* Mode: Non-Slippery
10
+
* Action Space: Up, Down, Left, Right
11
+
* Reward Structure: +1 for reaching the goal, 0 otherwise.
12
+
13
+
The training and test data are generated by the following script:
This command generates 10000 training tasks and 100 test tasks.
18
+
19
+
To filter the unsolvable tasks, we restrict the game map to have a valid path within `env_map_steps=8` steps. Moreover, the agent can take at most `agent_max_steps=10` steps to reach the goal.
20
+
21
+
22
+
## 2. Experimental Settings
23
+
24
+
We evaluate the performance of the following methods in Trinity-RFT framework with version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.11.0) and compare against the latest release of rLLM with commit ID [ef6451f](https://github.com/rllm-org/rllm/commit/ef6451fbd7eba224c4a87e3fd944d7c0e2bcc0ea) (verl==0.5.0) as of Nov. 6, 2025.
25
+
26
+
We fine-tune a Qwen2.5-3B-Instruct model using the training tasks with GRPO. For all experiments, we fix key parameters to `batch_size=64`, `repeat_times=8`, and `lr=1e-6`. We run each experiment for three times and report the average results.
27
+
28
+
For fair comparison, we optimize the configurations related to the training efficiency to achieve better performance. For rLLM, we adopt the default configurations in ` examples/frozenlake/train_frozenlake_agent.sh` except that we increase the batch size to 64 for stability and set the number of rollout workers to 64 for efficiency. For Trinity-RFT, we set the `explorer.engine_num` to 4 for efficiency.
29
+
30
+
## 3. Results and Analysis
31
+
32
+
We compare the sample efficiency of different methods by plotting the reward and test score in the following figures. At the same step, Trinity-RFT and rLLM achieve similar rewards and test scores, verifying the training correctness.
The following table details the wall-clock time required for each method to reach a specific performance threshold. From the results, Trinity-RFT requires less time to reach the target performance, i.e., reward=0.6, reward=0.8, and test score=0.8.
37
+
38
+
| Method | Training Reward | Time to Reach Target (Hours) | Speedup |
0 commit comments