Skip to content

Commit bcd86f0

Browse files
authored
Replace Forge with TorchForge (#432)
1 parent 9afb769 commit bcd86f0

File tree

5 files changed

+54
-54
lines changed

5 files changed

+54
-54
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Tutorials
22
=========
33

4-
This gallery contains tutorials and examples to help you get started with Forge.
4+
This gallery contains tutorials and examples to help you get started with TorchForge.
55
Each tutorial demonstrates specific features and use cases with practical examples.

docs/source/tutorial_sources/zero-to-forge/1_RL_and_Forge_Fundamentals.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Part 1: RL Fundamentals - Using Forge Terminology
1+
# Part 1: RL Fundamentals - Using TorchForge Terminology
22

3-
## Core RL Components in Forge
3+
## Core RL Components in TorchForge
44

5-
Let's start with a simple math tutoring example to understand RL concepts with the exact names Forge uses:
5+
Let's start with a simple math tutoring example to understand RL concepts with the exact names TorchForge uses:
66

77
### The Toy Example: Teaching Math
88

@@ -30,7 +30,7 @@ graph TD
3030
style Trainer fill:#E91E63
3131
```
3232

33-
### RL Components Defined (Forge Names)
33+
### RL Components Defined (TorchForge Names)
3434

3535
1. **Dataset**: Provides questions/prompts (like "What is 2+2?")
3636
2. **Policy**: The AI being trained (generates answers like "The answer is 4")
@@ -66,12 +66,12 @@ def conceptual_rl_step():
6666
if batch is not None:
6767
trainer.train_step(batch) # Student gets better!
6868

69-
# 🔄 See complete working example below with actual Forge service calls
69+
# 🔄 See complete working example below with actual TorchForge service calls
7070
```
7171

72-
## From Concepts to Forge Services
72+
## From Concepts to TorchForge Services
7373

74-
Here's the key insight: **Each RL component becomes a Forge service**. The toy example above maps directly to Forge:
74+
Here's the key insight: **Each RL component becomes a TorchForge service**. The toy example above maps directly to TorchForge:
7575

7676
```mermaid
7777
graph LR
@@ -85,7 +85,7 @@ graph LR
8585
C6["Trainer"]
8686
end
8787
88-
subgraph Services["Forge Services (Real Classes)"]
88+
subgraph Services["TorchForge Services (Real Classes)"]
8989
direction TB
9090
S1["DatasetActor"]
9191
S2["Policy"]
@@ -108,9 +108,9 @@ graph LR
108108
style S3 fill:#FF9800
109109
```
110110

111-
### RL Step with Forge Services
111+
### RL Step with TorchForge Services
112112

113-
Let's look at the example from above again, but this time we would use the names from Forge:
113+
Let's look at the example from above again, but this time we would use the names from TorchForge:
114114

115115
```python
116116
# Conceptual Example
@@ -151,7 +151,7 @@ async def conceptual_forge_rl_step(services, step):
151151

152152
**Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service.
153153

154-
Did you realise-we are not worrying about any Infra code here! Forge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!
154+
Did you realise-we are not worrying about any Infra code here! TorchForge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!
155155

156156

157157
## Why This Matters: Traditional ML Infrastructure Fails
@@ -196,7 +196,7 @@ graph LR
196196
Each step has different:
197197
- **Latency requirements**: Policy inference needs low latency (each episode waits), training can batch multiple episodes together
198198
- **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference)
199-
- **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover)
199+
- **Failure modes**: Any component failure cascades to halt the entire pipeline (TorchForge prevents this with automatic failover)
200200
- **Resource utilization**: GPUs for inference/training, CPUs for data processing
201201

202202
### Problem 3: The Coordination Challenge
@@ -218,11 +218,11 @@ def naive_rl_step():
218218
entire_system_stops()
219219
```
220220

221-
## Enter Forge: RL-Native Architecture
221+
## Enter TorchForge: RL-Native Architecture
222222

223-
Forge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)
223+
TorchForge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)
224224

225-
Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
225+
Let's see how core RL concepts map to TorchForge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
226226

227227
**Quick API Reference:** (covered in detail in Part 2: Service Communication Patterns)
228228
- `.route()` - Send request to any healthy replica in a service (load balanced)
@@ -231,7 +231,7 @@ Let's see how core RL concepts map to Forge components (you'll notice a mix of `
231231

232232
```python
233233
async def real_rl_training_step(services, step):
234-
"""Single RL step using verified Forge APIs"""
234+
"""Single RL step using verified TorchForge APIs"""
235235

236236
# 1. Environment interaction - Using actual DatasetActor API
237237
sample = await services['dataloader'].sample.call_one()
@@ -280,7 +280,7 @@ responses = await policy.generate.route(prompt=question)
280280
answer = responses[0].text # responses is list[Completion]
281281
```
282282

283-
Forge handles behind the scenes:
283+
TorchForge handles behind the scenes:
284284
- Routing to least loaded replica
285285
- GPU memory management
286286
- Batch optimization
@@ -361,9 +361,9 @@ group_size = 1
361361
)
362362
```
363363

364-
**Forge Components: Services vs Actors**
364+
**TorchForge Components: Services vs Actors**
365365

366-
Forge has two types of distributed components:
366+
TorchForge has two types of distributed components:
367367
- **Services**: Multiple replicas with automatic load balancing (like Policy, RewardActor)
368368
- **Actors**: Single instances that handle their own internal distribution (like RLTrainer, ReplayBuffer)
369369

@@ -378,7 +378,7 @@ We cover this distinction in detail in Part 2, but for now this explains the sca
378378
# If a policy replica fails:
379379
responses = await policy.generate.route(prompt=question)
380380
answer = responses[0].text
381-
# -> Forge automatically routes to healthy replica
381+
# -> TorchForge automatically routes to healthy replica
382382
# -> Failed replica respawns in background
383383
# -> No impact on training loop
384384

docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Part 2: Peeling Back the Abstraction - What Are Services?
22

3-
We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in Forge.
3+
We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in TorchForge.
44

55
Now that you see the power of the service abstraction, let's understand what's actually happening under the hood, Grab your chai!
66

@@ -49,7 +49,7 @@ graph TD
4949

5050
### 1. Real Service Configuration
5151

52-
Here's the actual ServiceConfig from Forge source code:
52+
Here's the actual ServiceConfig from TorchForge source code:
5353

5454
```python
5555
# Configuration pattern from apps/grpo/main.py:
@@ -106,14 +106,14 @@ await policy.shutdown()
106106

107107
### 3. How Services Actually Work
108108

109-
Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
109+
TorchForge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
110110

111-
When you call `.as_service()`, Forge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.
111+
When you call `.as_service()`, TorchForge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.
112112

113113
```python
114114
# Your code sees this simple interface:
115115
responses = await policy.generate.route(prompt=prompt)
116-
# But Forge handles all the complexity of replica management, load balancing, and fault tolerance
116+
# But TorchForge handles all the complexity of replica management, load balancing, and fault tolerance
117117
```
118118

119119
## Communication Patterns: Quick Reference
@@ -159,7 +159,7 @@ graph LR
159159

160160
## Deep Dive: Service Communication Patterns
161161

162-
These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage.
162+
These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective TorchForge usage.
163163

164164
### 1. `.route()` - Load Balanced Single Replica
165165

@@ -181,7 +181,7 @@ Behind the scenes:
181181
- **Throughput**: Limited by single replica capacity
182182
- **Fault tolerance**: Automatic failover to other replicas
183183

184-
**Critical insight**: `.route()` is your default choice for stateless operations in Forge services.
184+
**Critical insight**: `.route()` is your default choice for stateless operations in TorchForge services.
185185

186186
### 2. `.fanout()` - Broadcast with Results Collection
187187

@@ -346,10 +346,10 @@ async def optimized_multi_turn():
346346

347347
**The challenge**: Multiple trainers and experience collectors reading/writing concurrently.
348348

349-
**Real Forge approach**: The ReplayBuffer actor handles concurrency internally:
349+
**Real TorchForge approach**: The ReplayBuffer actor handles concurrency internally:
350350

351351
```python
352-
# Forge ReplayBuffer endpoints (verified from source code)
352+
# TorchForge ReplayBuffer endpoints (verified from source code)
353353
# Add episodes (thread-safe by actor model)
354354
await replay_buffer.add.call_one(episode) # .choose() would work too, but .call_one() clarifies it's a singleton actor not ActorMesh
355355

@@ -372,7 +372,7 @@ batch = await replay_buffer.sample.call_one(
372372
**The challenge**: Trainer updates policy weights, but policy service needs those weights.
373373

374374
```python
375-
# Forge weight synchronization pattern from apps/grpo/main.py
375+
# TorchForge weight synchronization pattern from apps/grpo/main.py
376376
async def real_weight_sync(trainer, policy, step):
377377
# Trainer pushes weights to TorchStore with version number
378378
await trainer.push_weights.call_one(policy_version=step + 1)
@@ -388,11 +388,11 @@ print(f"Current policy version: {current_version}")
388388

389389
## Deep Dive: Asynchronous Coordination Patterns
390390

391-
**The real challenge**: Different services run at different speeds, but Forge's service abstraction handles the coordination complexity.
391+
**The real challenge**: Different services run at different speeds, but TorchForge's service abstraction handles the coordination complexity.
392392

393-
### The Forge Approach: Let Services Handle Coordination
393+
### The TorchForge Approach: Let Services Handle Coordination
394394

395-
Instead of manual coordination, Forge services handle speed mismatches automatically:
395+
Instead of manual coordination, TorchForge services handle speed mismatches automatically:
396396

397397
```python
398398
from apps.grpo.main import Episode, Group
@@ -556,7 +556,7 @@ await reward_actor.shutdown()
556556
Now let's see how services coordinate in a real training loop:
557557

558558
```python
559-
# This is the REAL way production RL systems are built with Forge
559+
# This is the REAL way production RL systems are built with TorchForge
560560

561561
import asyncio
562562
import torch

docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Part 3: The Forge-Monarch Connection
1+
# Part 3: The TorchForge-Monarch Connection
22

3-
This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals), Part 2: [Forge Internals](./2_Forge_Internals).
3+
This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to TorchForge](./1_RL_and_Forge_Fundamentals), Part 2: [TorchForge Internals](./2_Forge_Internals).
44

5-
Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
5+
Now let's peel back the layers. TorchForge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
66

77
## The Complete Hierarchy: Service to Silicon
88

@@ -12,7 +12,7 @@ graph TD
1212
Call["await policy_service.generate.route('What is 2+2?')"]
1313
end
1414
15-
subgraph ForgeServices["2. Forge Service Layer"]
15+
subgraph ForgeServices["2. TorchForge Service Layer"]
1616
ServiceInterface["ServiceInterface: Routes requests, Load balancing, Health checks"]
1717
ServiceActor["ServiceActor: Manages replicas, Monitors health, Coordinates failures"]
1818
end
@@ -92,8 +92,8 @@ actors = cluster_procs.spawn("my_actor", MyActor)
9292
**The power**: Scale from single host to cluster without changing your actor code - ProcMesh handles all the complexity.
9393

9494
```python
95-
# This shows the underlying actor system that powers Forge services
96-
# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real Forge apps!
95+
# This shows the underlying actor system that powers TorchForge services
96+
# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real TorchForge apps!
9797

9898
from monarch.actor import Actor, endpoint, this_proc, Future
9999
from monarch.actor import ProcMesh, this_host
@@ -146,8 +146,8 @@ await counters.increment.broadcast() # No return value - just sends to all acto
146146
# Cleanup
147147
await procs.stop()
148148

149-
# Remember: This raw Monarch code is for understanding how Forge works internally.
150-
# In your Forge applications, use ForgeActor, .as_service(), .as_actor() instead!
149+
# Remember: This raw Monarch code is for understanding how TorchForge works internally.
150+
# In your TorchForge applications, use ForgeActor, .as_service(), .as_actor() instead!
151151
```
152152

153153
## Actor Meshes: Your Code Running Distributed
@@ -168,9 +168,9 @@ policy_actors = procs.spawn("policy", PolicyActor, model="Qwen/Qwen3-7B")
168168
# All initialized with the same model parameter
169169
```
170170

171-
## How Forge Services Use Monarch
171+
## How TorchForge Services Use Monarch
172172

173-
Now the key insight: **Forge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**.
173+
Now the key insight: **TorchForge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**.
174174

175175
### The Service Creation Process
176176

@@ -264,7 +264,7 @@ In real RL systems, you have multiple services that can share or use separate Pr
264264
```mermaid
265265
graph TD
266266
subgraph Cluster["RL Training Cluster"]
267-
subgraph Services["Forge Services"]
267+
subgraph Services["TorchForge Services"]
268268
PS["Policy Service<br/>4 GPU replicas"]
269269
TS["Trainer Service<br/>2 GPU replicas"]
270270
RS["Reward Service<br/>4 CPU replicas"]
@@ -317,7 +317,7 @@ graph TD
317317
2. **Location Transparency**: Actors can be local or remote with identical APIs
318318
3. **Structured Distribution**: ProcMesh maps directly to hardware topology
319319
4. **Message Passing**: No shared memory means no race conditions or locks
320-
5. **Service Abstraction**: Forge hides Monarch complexity while preserving power
320+
5. **Service Abstraction**: TorchForge hides Monarch complexity while preserving power
321321

322322
Understanding this hierarchy helps you:
323323
- **Debug performance issues**: Is the bottleneck at service, actor, or hardware level?
@@ -329,9 +329,9 @@ Understanding this hierarchy helps you:
329329

330330
## What You've Learned
331331

332-
1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples
333-
2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns
334-
3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware
332+
1. **RL Fundamentals**: How RL concepts map to TorchForge services with REAL, working examples
333+
2. **Service Abstraction**: How to use TorchForge services effectively with verified communication patterns
334+
3. **Monarch Foundation**: How TorchForge services connect to distributed actors and hardware
335335

336336
## Key Takeaways
337337

docs/source/zero-to-forge-intro.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Zero to Forge: From RL Theory to Production-Scale Implementation
1+
# Zero to TorchForge: From RL Theory to Production-Scale Implementation
22

33
A comprehensive guide for ML Engineers building distributed RL systems for language models.
44

@@ -12,9 +12,9 @@ PyTorch tutorial, shoutout to our PyTorch friends that remember!
1212

1313
This section currently is structured in 3 detailed parts:
1414

15-
1. [RL Fundamentals and Understanding Forge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals
16-
2. [Forge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of Forge
17-
3. [Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how Forge Talks to Monarch
15+
1. [RL Fundamentals and Understanding TorchForge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you TorchForge Fundamentals
16+
2. [TorchForge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of TorchForge
17+
3. [Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how TorchForge Talks to Monarch
1818

1919
Each part builds upon the next and the entire section can be consumed in roughly an hour - Grab a Chai and Enjoy!
2020

0 commit comments

Comments
 (0)