-
Notifications
You must be signed in to change notification settings - Fork 187
feature(xjy): Refine PriorZero Implementation #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xiongjyu
wants to merge
78
commits into
opendilab:dev-multitask-balance-clean-rft
Choose a base branch
from
xiongjyu:dev-multitask-balance-clean-rft
base: dev-multitask-balance-clean-rft
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
78 commits
Select commit
Hold shift + click to select a range
a3a2d69
Fix game_segment/weighted_total_loss bugs and refine prompts, compute…
xiongjyu 959a558
Fixed the accumulate_steps bug and added cprofile functionality.
xiongjyu ecedc5f
Refine the code and fix the bug in data collection.
xiongjyu 2d53d22
Add REINFORCE-style losses and store old_logprob in the buffer.
xiongjyu c608600
Fix the get_llm_prior bug so that every action receives a logprob
xiongjyu 15e39f6
fixed the history bug in the build_llm_prompt and logs in forward_learn
xiongjyu 7c9acd9
rename advantage_tensor on rft
xiongjyu 738f300
Fixed the action out-of-bounds bug and added a record for forward_col…
xiongjyu 0a166f6
Fixed the misalignment between old_log_prob and log_prob, and correct…
xiongjyu 4f3668e
add some logs for analysying
xiongjyu 2985e60
Polish the code and standardize the format.
xiongjyu ff98006
Add kL divergence in rft and llm_prior_entropy in collect
xiongjyu 7e43e45
polish config and format
xiongjyu d6555e5
delete unused files
xiongjyu b7d42ee
Decouple the training of world_model and LLM.
xiongjyu 95e2347
add cache in the jericho
xiongjyu 9682486
Separate sync and async entry points to simplify the program.
xiongjyu 0a38197
Reference OpenRLHF’s implementation to update vLLM weights in real ti…
xiongjyu e361039
delete unused orz files
xiongjyu f957db9
fix a small bug
xiongjyu 628d7d2
Fix action='go' bug; optimize replay buffer with larger capacity; sam…
xiongjyu c16174f
fix a bug
xiongjyu 35cb4f9
Optimized log-probability computation for the CoT setting.
xiongjyu 2c67a8d
polish and format file
xiongjyu b16c3e7
Improve single/multi-process LLM training with DeepSpeed
xiongjyu 97c9843
fix the vllm bug when using torchrun
xiongjyu eb8a4bd
fix the fork bug and polish the vllm about sleep
xiongjyu 9178397
Optimized efficiency and added multiple ways to calculate advantage.
xiongjyu cb6f7cf
limit the ouput length when using vllm
xiongjyu 19fac8f
polish(pu): add cot-reuse in training, use running-norm in value, pol…
puyuan1996 88f047b
fix(pu): fix some bugs in reuse-collect-cot in training phase
puyuan1996 de4b2c0
polish configs and format
xiongjyu 2069e32
delete unuse config
xiongjyu 3ff091e
fix not found go bug
xiongjyu 3888d8e
fix the misalignment bug when reusing cot
xiongjyu da0d0fd
make the prompt more compact
xiongjyu 5f88151
add lr warmup
xiongjyu ed89062
add warmup for training world model before training llm and AdaptiveV…
xiongjyu 96dc250
polish the implementation of profile
xiongjyu 66ac376
fix a small bug
xiongjyu a9593cd
add profile of forward_collect
xiongjyu 5d0f359
add format reward option and fix the cot gradient
xiongjyu e15e66c
rename kl/clip-ratio metrics
xiongjyu 55e66ed
Optimize the use of format rewards
xiongjyu 0cef8b8
polish the format
xiongjyu f88989b
Add a complete DDP program, including the process of collect and worl…
xiongjyu ffcf348
fix some small bugs
xiongjyu 591c420
add the result of valid_actions's prob in vllm_output and samples shu…
xiongjyu 091cc80
add the policy model for reload/offload func
xiongjyu e512707
fix a bug
xiongjyu a6bdc0e
Fixed a bug in the advantage feature and used target_value-pred_value…
xiongjyu 250ba63
Optimize parameter definitions and off-policy's implementation
xiongjyu c3af1c2
fix a small bug
xiongjyu 9d304c6
fix a small bug
xiongjyu 9654f6a
Optimize essential logging; add grad norm metrics; record metrics as …
xiongjyu 1bbcc19
fix a small bug
xiongjyu 335e16c
Fix a bug; refactor LLM prompts into system/user roles; improve CoT o…
xiongjyu acf5d04
add input/response length metrics and envstep-based tb_logger for lea…
xiongjyu 2ff4a90
add llm_prior_tempearture to apply_temperature_scaling for llm_prior
xiongjyu ea32a4a
Fixed bugs in prefix_cot & padding and aligned the input contexts of …
xiongjyu f895ed2
Remove the logits_to_keep parameter and add metrics such as entropy.
xiongjyu 4567190
tmp
xiongjyu 5098395
refine some log
xiongjyu b865dcb
polish some config
xiongjyu 82e1d29
Fixed the bug caused by fmt_weight being 1
xiongjyu 7c9c922
delete unused file
xiongjyu 2dcc4ae
fix score, action_str, and timestep_dict bug, and add 3 modes of Prio…
xiongjyu dbec27c
fix self.history_buffer bug and refine logs
xiongjyu e71eb61
fix some bug when running unizero
xiongjyu e3f7cfd
fix a bug in evaluator and add enable_rft/wm to control what models n…
xiongjyu 7377220
add bash scripts to run priorzero and format files
xiongjyu d5c4923
tmp
xiongjyu 3caf28f
add priorzero README.md
xiongjyu cfc000a
refine cot prefix and add mcts_root_logits_dict
xiongjyu 8bec5f7
tmp
xiongjyu f49c9d9
add user_prompt_dict to control user_prompt about reward/valid_action…
xiongjyu 9a459f8
fix the bug of pretrained_model and some small bugs
xiongjyu 38efe3d
fix the evaluator_env_num when steping into the _init_eval
xiongjyu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.