Skip to content

Commit 25b8e11

Browse files
kokolerkhiyuchang
andauthored
Alfworld Concatenated Multi-turn RFT SFT format AND settings. (#442)
Co-authored-by: Yuchang Sun <[email protected]>
1 parent 39378e2 commit 25b8e11

File tree

2 files changed

+52
-3
lines changed

2 files changed

+52
-3
lines changed

examples/grpo_alfworld/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,51 @@ This example shows the usage of GRPO on the ALFWorld dataset.
55
For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_multi_turn.md).
66

77
The config file is located in [`alfworld.yaml`](alfworld.yaml).
8+
9+
NOTE: For the Concatenated Multi-Turn RFT setup in the Qwen-2.5 series, the model may not follow the `<think></think><action></action>` format. You may need to perform SFT first, then GRPO.
10+
11+
The SFT data should be named as `<TRINITY_SFT_DATASET_PATH>/data.json`, following the format:
12+
13+
```
14+
[
15+
{
16+
"messages": [
17+
{
18+
"role": "system", # fixed, align with the grpo workflow: alfworld_workflow.
19+
"content": "\nYou are an agent interacting with a virtual test-based environments.\n\n## Notes:\nAt each step, you should first think then perform action to fulfill the instruction. You should ALWAYS wrap your thinking with the tag and wrap your action with the tag.\nYou should ALWAYS take one action each step. \nYou should finish the task and buy the item within 15 steps.\nDONOT try to interact with the user at anytime. Finish the task and buy the item by yourself.\n\n## Action Format:\nBelow are the available commands you can use:\n look: look around your current location\n inventory: check your current inventory(you can only have 1 item in your inventory)\n go to (receptacle): move to a receptacle\n open (receptacle): open a receptacle\n close (receptacle): close a receptacle\n take (object) from (receptacle): take an object from a receptacle\n move (object) to (receptacle): place an object in or on a receptacle\n examine (something): examine a receptacle or an object\n use (object): use an object\n heat (object) with (receptacle): heat an object using a receptacle\n clean (object) with (receptacle): clean an object using a receptacle\n cool (object) with (receptacle): cool an object using a receptacle\n slice (object) with (object): slice an object using a sharp object\n\nFor example your output should be like this:\n To solve the task, I need first to ... go to cabinet 1\n"
20+
},
21+
{
22+
"role": "user",
23+
"content": "Observation: {observation by alfworld}"
24+
},
25+
{
26+
"role": "assistant",
27+
"content": "<think>think process</think><action>action</action>"
28+
},
29+
{
30+
"role": "user",
31+
"content": "Observation: {observation by alfworld}"
32+
},
33+
{
34+
"role": "assistant",
35+
"content": "<think>think process</think><action>action</action>"
36+
},
37+
.....
38+
],
39+
},
40+
{
41+
"messages": [
42+
{
43+
......
44+
},
45+
]
46+
},
47+
{
48+
"messages": [
49+
{
50+
.......
51+
},
52+
]
53+
},
54+
]
55+
```

examples/grpo_alfworld/alfworld.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ algorithm:
77
optimizer:
88
lr: 1e-6
99
model:
10-
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-7B-Instruct}
11-
max_response_tokens: 16384
12-
max_model_len: 20480
10+
model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-3B-Instruct}
11+
max_prompt_tokens: 10240 # input max tokens every turn
12+
max_response_tokens: 4096 # output max tokens every turn
1313
cluster:
1414
node_num: 1
1515
gpu_per_node: 8
@@ -77,4 +77,5 @@ trainer:
7777
# format:
7878
# prompt_type: messages
7979
# messages_key: 'messages'
80+
#. enable_concatenated_multi_turn: true # Enable concatenated multi-turn SFT data preprocess, default is false
8081
# - stage_name: rft

0 commit comments

Comments
 (0)