-
Notifications
You must be signed in to change notification settings - Fork 16
Check in prototype notebook #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've skimmed it and would want to take another pass, but at a high level I think it covers a lot of good ground. My one concern is that it leads with a lot of infra concepts that won't be grounded in anything to a reader who's background is in RL. It could be easier to grasp for the first time if you're introduced to a few high level concepts (services, multiprocessing, simple orchestration, data movement, etc) and then tie it into RL before going deeper on each part.
"\n", | ||
"This is the role of \"rollouts\" - creating the dataset used to update our policy. Rather than training on a static dataset, RL dynamically generates training data by having the current policy interact with the environment.\n", | ||
"\n", | ||
"Let's build a step-by-step synchronous training loop to see how these services work together. The basic RL cycle is:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is too generic I think, it doesn't make it obvious why some random off the shelf services, like found in other libraries couldn't work instead.
- Collect Experience: run tasks, tools, and mutli-step workflows
- Compute Rewards: run verifiers and judges to calculate rewards
- Store Experience: Add the episode to our replay buffer
- Sample & Train: Sample a batch and update the policy in parallel
- Broadcast Policy: Overlapped policy broadcast for services to update from
- Repeat: Continue this cycle to improve the policy
This maybe goes too far into incomplete features, but maybe something in between
Checks in a notebook for our initial prototype!