Skip to content

Commit f859525

Browse files
authored
fix(embodied): fix libero oft, fix grpo reset, fix oft do sample (RLinf#438)
Signed-off-by: guozhen1997 <2997871698@qq.com>
1 parent 5e02e21 commit f859525

File tree

109 files changed

+224
-182
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+224
-182
lines changed

docs/source-en/rst_source/tutorials/user/yaml.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ algorithm
137137
use_valid_token_scale: False
138138
139139
sampling_params:
140-
use_greedy: False
140+
do_sample: True
141141
temperature: 1.0
142142
top_k: 1000000
143143
top_p: 1.0
@@ -174,7 +174,7 @@ algorithm
174174

175175
**sampling_params:**
176176

177-
``algorithm.sampling_params.use_greedy``: Deterministic decoding if True.
177+
``algorithm.sampling_params.do_sample``: Deterministic decoding if False.
178178

179179
``algorithm.sampling_params.temperature``: Softmax temperature during sampling.
180180

docs/source-zh/rst_source/tutorials/user/yaml.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ algorithm
133133
use_valid_token_scale: False
134134
135135
sampling_params:
136-
use_greedy: False
136+
do_sample: True
137137
temperature: 1.0
138138
top_k: 1000000
139139
top_p: 1.0
@@ -169,7 +169,7 @@ algorithm
169169

170170
**sampling_params:**
171171

172-
``algorithm.sampling_params.use_greedy``:True 时使用贪心解码。
172+
``algorithm.sampling_params.do_sample``:False 时使用贪心解码。
173173

174174
``algorithm.sampling_params.temperature``:采样温度。
175175

examples/coding_online_rl/config/qwen2.5-1.5b-grpo-llm_judge.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ algorithm:
6969

7070
# params for rollout
7171
sampling_params:
72-
use_greedy: False
72+
do_sample: True
7373
temperature: 1.0
7474
top_k: 1000000
7575
top_p: 1.0

examples/coding_online_rl/config/qwen2.5-1.5b-ppo.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ algorithm:
8686

8787
# params for rollout
8888
sampling_params:
89-
use_greedy: False
89+
do_sample: True
9090
temperature: 0.1
9191
top_k: 1000000
9292
top_p: 1.0

examples/embodiment/config/behavior_openvlaoft_eval.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ algorithm:
6565

6666
# params for rollout
6767
sampling_params:
68-
use_greedy: False
68+
do_sample: True
6969
temperature_train: 1.0
7070
temperature_eval: 0.6
7171
top_k: 50
@@ -87,7 +87,7 @@ env:
8787
queue_size: 0
8888
enable_offload: False
8989

90-
# Override the default values in env/train or env/eval
90+
# Override the default values in env/behavior_r1pro
9191
eval:
9292
total_num_envs: 2
9393
max_episode_steps: 2000 # max episode steps for truncation

examples/embodiment/config/behavior_ppo_openvlaoft.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ algorithm:
6767

6868
# params for rollout
6969
sampling_params:
70-
use_greedy: False
70+
do_sample: True
7171
temperature_train: 1.0
7272
temperature_eval: 0.6
7373
top_k: 50
@@ -89,7 +89,7 @@ env:
8989
queue_size: 0
9090
enable_offload: False
9191

92-
# Override the default values in env/train or env/eval
92+
# Override the default values in env/behavior_r1pro
9393
train:
9494
total_num_envs: 2
9595
max_episode_steps: 2000 # max episode steps for truncation

examples/embodiment/config/calvin_d_d_ppo_openpi.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ algorithm:
6767
rewards_upper_bound: 0.9
6868
# params for generation
6969
sampling_params:
70-
use_greedy: False
70+
do_sample: True
7171
temperature_train: 1.0
7272
temperature_eval: 0.6
7373
top_k: 50

examples/embodiment/config/calvin_d_d_ppo_openpi_pi05.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ algorithm:
6767
rewards_upper_bound: 0.9
6868
# params for generation
6969
sampling_params:
70-
use_greedy: False
70+
do_sample: True
7171
temperature_train: 1.0
7272
temperature_eval: 0.6
7373
top_k: 50

examples/embodiment/config/env/libero_10.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ use_ordered_reset_state_ids: False
1212

1313
use_rel_reward: True
1414
reward_coef: 5.0
15+
16+
# RLinf LiberoEnv specific settings
17+
reset_gripper_open: True
1518
is_eval: False
1619

1720
seed: 0

examples/embodiment/config/env/libero_130.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ max_episode_steps: 512 # max episode steps for truncation
1010

1111
use_rel_reward: True
1212
reward_coef: 1.0
13+
14+
# RLinf LiberoEnv specific settings
15+
reset_gripper_open: True
1316
is_eval: False
1417

1518
seed: 0

0 commit comments

Comments
 (0)