-
Notifications
You must be signed in to change notification settings - Fork 15
Zero to Forge Tutorials #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sanyam, this is awesome! Had a bunch of comments but I do generally like the flow and where we're at with this
|
||
## Enter Forge: RL-Native Architecture | ||
|
||
Forge solves these problems by treating each RL component as an **independent, scalable service** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by treating each RL component as an independent, scalable service
@init27 - we have actually made pieces like the Trainer a regular actor, since the recovery semantics are unclear, whereas handling a vLLM generator going down is pretty straightforward. We could have torchft style recovery but it's unclear how well this affects RL numerics (and we just haven't had time to add it yet)
It's also why you see some call_one()
, call()
(actor APIs) mixed in with route()
and fanout()
which are service APIs in the real RL training step. It's a logical choice, but hard to tell the story at the 101 level. I'm wondering if this is something we should mention at all, or if you have any ideas for how to simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks yes it was quite hard to get this right, for now I mentioned some high level details and acknowledge we cover them properly in part 2.
Open to any other suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Just a partial review for now, but will come back for the rest later
Thanks so much Allen! Will wait for Evan's review and then will address all the comments to land this. Thanks for your time both! |
Co-authored-by: Allen Wang <[email protected]>
Can we move the files under https://github.com/meta-pytorch/forge/tree/main/docs/source/tutorial_sources directory and list them in the https://github.com/meta-pytorch/forge/blob/main/docs/source/tutorials.md toctree? Also, if you want to make them executable and have a link to a Google Colab. You could convert them into .py files using this template: https://github.com/meta-pytorch/forge/blob/main/docs/source/tutorial_sources/template_tutorial.py |
@allenwang28 @ebsmothers-Sorry for being slow because of Flu, just addressed all comments, I think we are good to merge. Once we merge, I will work with @svekars to maybe create excalidraw diagrams for docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tutorials look great. Thanks for working on them! Approving to unblock. Please check the comments i left.
Thanks so much for the detailed review and awesome guidance @allenwang28, @ebsmothers and @felipemello1. I've addressed all comments and merging now. I'll trouble you for examples soon now :) |
Co-authored-by: Allen Wang <[email protected]>
Hi team,
As promised, here is the three part tutorial series covering (written in MD):