Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Aug 31, 2025

Checks in a notebook for our initial prototype!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 31, 2025
@allenwang28 allenwang28 changed the title [not for land] - service env Check in prototype notebook Sep 1, 2025
@allenwang28 allenwang28 marked this pull request as ready for review September 1, 2025 19:09
Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've skimmed it and would want to take another pass, but at a high level I think it covers a lot of good ground. My one concern is that it leads with a lot of infra concepts that won't be grounded in anything to a reader who's background is in RL. It could be easier to grasp for the first time if you're introduced to a few high level concepts (services, multiprocessing, simple orchestration, data movement, etc) and then tie it into RL before going deeper on each part.

"\n",
"This is the role of \"rollouts\" - creating the dataset used to update our policy. Rather than training on a static dataset, RL dynamically generates training data by having the current policy interact with the environment.\n",
"\n",
"Let's build a step-by-step synchronous training loop to see how these services work together. The basic RL cycle is:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is too generic I think, it doesn't make it obvious why some random off the shelf services, like found in other libraries couldn't work instead.

  1. Collect Experience: run tasks, tools, and mutli-step workflows
  2. Compute Rewards: run verifiers and judges to calculate rewards
  3. Store Experience: Add the episode to our replay buffer
  4. Sample & Train: Sample a batch and update the policy in parallel
  5. Broadcast Policy: Overlapped policy broadcast for services to update from
  6. Repeat: Continue this cycle to improve the policy

This maybe goes too far into incomplete features, but maybe something in between

@allenwang28 allenwang28 merged commit d4011ea into meta-pytorch:main Sep 2, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants