You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service.
153
153
154
-
Did you realise-we are not worrying about any Infra code here! Forge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!
154
+
Did you realise-we are not worrying about any Infra code here! TorchForge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!
155
155
156
156
157
157
## Why This Matters: Traditional ML Infrastructure Fails
@@ -196,7 +196,7 @@ graph LR
196
196
Each step has different:
197
197
-**Latency requirements**: Policy inference needs low latency (each episode waits), training can batch multiple episodes together
198
198
-**Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference)
199
-
-**Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover)
199
+
-**Failure modes**: Any component failure cascades to halt the entire pipeline (TorchForge prevents this with automatic failover)
200
200
-**Resource utilization**: GPUs for inference/training, CPUs for data processing
201
201
202
202
### Problem 3: The Coordination Challenge
@@ -218,11 +218,11 @@ def naive_rl_step():
218
218
entire_system_stops()
219
219
```
220
220
221
-
## Enter Forge: RL-Native Architecture
221
+
## Enter TorchForge: RL-Native Architecture
222
222
223
-
Forge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)
223
+
TorchForge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)
224
224
225
-
Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
225
+
Let's see how core RL concepts map to TorchForge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
226
226
227
227
**Quick API Reference:** (covered in detail in Part 2: Service Communication Patterns)
228
228
-`.route()` - Send request to any healthy replica in a service (load balanced)
@@ -231,7 +231,7 @@ Let's see how core RL concepts map to Forge components (you'll notice a mix of `
231
231
232
232
```python
233
233
asyncdefreal_rl_training_step(services, step):
234
-
"""Single RL step using verified Forge APIs"""
234
+
"""Single RL step using verified TorchForge APIs"""
235
235
236
236
# 1. Environment interaction - Using actual DatasetActor API
Copy file name to clipboardExpand all lines: docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Part 2: Peeling Back the Abstraction - What Are Services?
2
2
3
-
We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in Forge.
3
+
We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in TorchForge.
4
4
5
5
Now that you see the power of the service abstraction, let's understand what's actually happening under the hood, Grab your chai!
6
6
@@ -49,7 +49,7 @@ graph TD
49
49
50
50
### 1. Real Service Configuration
51
51
52
-
Here's the actual ServiceConfig from Forge source code:
52
+
Here's the actual ServiceConfig from TorchForge source code:
53
53
54
54
```python
55
55
# Configuration pattern from apps/grpo/main.py:
@@ -106,14 +106,14 @@ await policy.shutdown()
106
106
107
107
### 3. How Services Actually Work
108
108
109
-
Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
109
+
TorchForge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
110
110
111
-
When you call `.as_service()`, Forge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.
111
+
When you call `.as_service()`, TorchForge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.
# But Forge handles all the complexity of replica management, load balancing, and fault tolerance
116
+
# But TorchForge handles all the complexity of replica management, load balancing, and fault tolerance
117
117
```
118
118
119
119
## Communication Patterns: Quick Reference
@@ -159,7 +159,7 @@ graph LR
159
159
160
160
## Deep Dive: Service Communication Patterns
161
161
162
-
These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage.
162
+
These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective TorchForge usage.
163
163
164
164
### 1. `.route()` - Load Balanced Single Replica
165
165
@@ -181,7 +181,7 @@ Behind the scenes:
181
181
-**Throughput**: Limited by single replica capacity
182
182
-**Fault tolerance**: Automatic failover to other replicas
183
183
184
-
**Critical insight**: `.route()` is your default choice for stateless operations in Forge services.
184
+
**Critical insight**: `.route()` is your default choice for stateless operations in TorchForge services.
185
185
186
186
### 2. `.fanout()` - Broadcast with Results Collection
Copy file name to clipboardExpand all lines: docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
-
# Part 3: The Forge-Monarch Connection
1
+
# Part 3: The TorchForge-Monarch Connection
2
2
3
-
This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals), Part 2: [Forge Internals](./2_Forge_Internals).
3
+
This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to TorchForge](./1_RL_and_Forge_Fundamentals), Part 2: [TorchForge Internals](./2_Forge_Internals).
4
4
5
-
Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
5
+
Now let's peel back the layers. TorchForge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
6
6
7
7
## The Complete Hierarchy: Service to Silicon
8
8
@@ -12,7 +12,7 @@ graph TD
12
12
Call["await policy_service.generate.route('What is 2+2?')"]
13
13
end
14
14
15
-
subgraph ForgeServices["2. Forge Service Layer"]
15
+
subgraph ForgeServices["2. TorchForge Service Layer"]
16
16
ServiceInterface["ServiceInterface: Routes requests, Load balancing, Health checks"]
Copy file name to clipboardExpand all lines: docs/source/zero-to-forge-intro.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Zero to Forge: From RL Theory to Production-Scale Implementation
1
+
# Zero to TorchForge: From RL Theory to Production-Scale Implementation
2
2
3
3
A comprehensive guide for ML Engineers building distributed RL systems for language models.
4
4
@@ -12,9 +12,9 @@ PyTorch tutorial, shoutout to our PyTorch friends that remember!
12
12
13
13
This section currently is structured in 3 detailed parts:
14
14
15
-
1.[RL Fundamentals and Understanding Forge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals
16
-
2.[Forge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of Forge
17
-
3.[Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how Forge Talks to Monarch
15
+
1.[RL Fundamentals and Understanding TorchForge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you TorchForge Fundamentals
16
+
2.[TorchForge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of TorchForge
17
+
3.[Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how TorchForge Talks to Monarch
18
18
19
19
Each part builds upon the next and the entire section can be consumed in roughly an hour - Grab a Chai and Enjoy!
0 commit comments