Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/tutorial_sources/README.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Tutorials
=========

This gallery contains tutorials and examples to help you get started with Forge.
This gallery contains tutorials and examples to help you get started with TorchForge.
Each tutorial demonstrates specific features and use cases with practical examples.
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Part 1: RL Fundamentals - Using Forge Terminology
# Part 1: RL Fundamentals - Using TorchForge Terminology

## Core RL Components in Forge
## Core RL Components in TorchForge

Let's start with a simple math tutoring example to understand RL concepts with the exact names Forge uses:
Let's start with a simple math tutoring example to understand RL concepts with the exact names TorchForge uses:

### The Toy Example: Teaching Math

Expand Down Expand Up @@ -30,7 +30,7 @@ graph TD
style Trainer fill:#E91E63
```

### RL Components Defined (Forge Names)
### RL Components Defined (TorchForge Names)

1. **Dataset**: Provides questions/prompts (like "What is 2+2?")
2. **Policy**: The AI being trained (generates answers like "The answer is 4")
Expand Down Expand Up @@ -66,12 +66,12 @@ def conceptual_rl_step():
if batch is not None:
trainer.train_step(batch) # Student gets better!

# 🔄 See complete working example below with actual Forge service calls
# 🔄 See complete working example below with actual TorchForge service calls
```

## From Concepts to Forge Services
## From Concepts to TorchForge Services

Here's the key insight: **Each RL component becomes a Forge service**. The toy example above maps directly to Forge:
Here's the key insight: **Each RL component becomes a TorchForge service**. The toy example above maps directly to TorchForge:

```mermaid
graph LR
Expand All @@ -85,7 +85,7 @@ graph LR
C6["Trainer"]
end

subgraph Services["Forge Services (Real Classes)"]
subgraph Services["TorchForge Services (Real Classes)"]
direction TB
S1["DatasetActor"]
S2["Policy"]
Expand All @@ -108,9 +108,9 @@ graph LR
style S3 fill:#FF9800
```

### RL Step with Forge Services
### RL Step with TorchForge Services

Let's look at the example from above again, but this time we would use the names from Forge:
Let's look at the example from above again, but this time we would use the names from TorchForge:

```python
# Conceptual Example
Expand Down Expand Up @@ -151,7 +151,7 @@ async def conceptual_forge_rl_step(services, step):

**Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service.

Did you realise-we are not worrying about any Infra code here! Forge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!
Did you realise-we are not worrying about any Infra code here! TorchForge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!


## Why This Matters: Traditional ML Infrastructure Fails
Expand Down Expand Up @@ -196,7 +196,7 @@ graph LR
Each step has different:
- **Latency requirements**: Policy inference needs low latency (each episode waits), training can batch multiple episodes together
- **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference)
- **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover)
- **Failure modes**: Any component failure cascades to halt the entire pipeline (TorchForge prevents this with automatic failover)
- **Resource utilization**: GPUs for inference/training, CPUs for data processing

### Problem 3: The Coordination Challenge
Expand All @@ -218,11 +218,11 @@ def naive_rl_step():
entire_system_stops()
```

## Enter Forge: RL-Native Architecture
## Enter TorchForge: RL-Native Architecture

Forge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)
TorchForge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)

Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
Let's see how core RL concepts map to TorchForge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):

**Quick API Reference:** (covered in detail in Part 2: Service Communication Patterns)
- `.route()` - Send request to any healthy replica in a service (load balanced)
Expand All @@ -231,7 +231,7 @@ Let's see how core RL concepts map to Forge components (you'll notice a mix of `

```python
async def real_rl_training_step(services, step):
"""Single RL step using verified Forge APIs"""
"""Single RL step using verified TorchForge APIs"""

# 1. Environment interaction - Using actual DatasetActor API
sample = await services['dataloader'].sample.call_one()
Expand Down Expand Up @@ -280,7 +280,7 @@ responses = await policy.generate.route(prompt=question)
answer = responses[0].text # responses is list[Completion]
```

Forge handles behind the scenes:
TorchForge handles behind the scenes:
- Routing to least loaded replica
- GPU memory management
- Batch optimization
Expand Down Expand Up @@ -361,9 +361,9 @@ group_size = 1
)
```

**Forge Components: Services vs Actors**
**TorchForge Components: Services vs Actors**

Forge has two types of distributed components:
TorchForge has two types of distributed components:
- **Services**: Multiple replicas with automatic load balancing (like Policy, RewardActor)
- **Actors**: Single instances that handle their own internal distribution (like RLTrainer, ReplayBuffer)

Expand All @@ -378,7 +378,7 @@ We cover this distinction in detail in Part 2, but for now this explains the sca
# If a policy replica fails:
responses = await policy.generate.route(prompt=question)
answer = responses[0].text
# -> Forge automatically routes to healthy replica
# -> TorchForge automatically routes to healthy replica
# -> Failed replica respawns in background
# -> No impact on training loop

Expand Down
28 changes: 14 additions & 14 deletions docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Part 2: Peeling Back the Abstraction - What Are Services?

We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in Forge.
We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in TorchForge.

Now that you see the power of the service abstraction, let's understand what's actually happening under the hood, Grab your chai!

Expand Down Expand Up @@ -49,7 +49,7 @@ graph TD

### 1. Real Service Configuration

Here's the actual ServiceConfig from Forge source code:
Here's the actual ServiceConfig from TorchForge source code:

```python
# Configuration pattern from apps/grpo/main.py:
Expand Down Expand Up @@ -106,14 +106,14 @@ await policy.shutdown()

### 3. How Services Actually Work

Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
TorchForge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:

When you call `.as_service()`, Forge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.
When you call `.as_service()`, TorchForge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.

```python
# Your code sees this simple interface:
responses = await policy.generate.route(prompt=prompt)
# But Forge handles all the complexity of replica management, load balancing, and fault tolerance
# But TorchForge handles all the complexity of replica management, load balancing, and fault tolerance
```

## Communication Patterns: Quick Reference
Expand Down Expand Up @@ -159,7 +159,7 @@ graph LR

## Deep Dive: Service Communication Patterns

These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage.
These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective TorchForge usage.

### 1. `.route()` - Load Balanced Single Replica

Expand All @@ -181,7 +181,7 @@ Behind the scenes:
- **Throughput**: Limited by single replica capacity
- **Fault tolerance**: Automatic failover to other replicas

**Critical insight**: `.route()` is your default choice for stateless operations in Forge services.
**Critical insight**: `.route()` is your default choice for stateless operations in TorchForge services.

### 2. `.fanout()` - Broadcast with Results Collection

Expand Down Expand Up @@ -346,10 +346,10 @@ async def optimized_multi_turn():

**The challenge**: Multiple trainers and experience collectors reading/writing concurrently.

**Real Forge approach**: The ReplayBuffer actor handles concurrency internally:
**Real TorchForge approach**: The ReplayBuffer actor handles concurrency internally:

```python
# Forge ReplayBuffer endpoints (verified from source code)
# TorchForge ReplayBuffer endpoints (verified from source code)
# Add episodes (thread-safe by actor model)
await replay_buffer.add.call_one(episode) # .choose() would work too, but .call_one() clarifies it's a singleton actor not ActorMesh

Expand All @@ -372,7 +372,7 @@ batch = await replay_buffer.sample.call_one(
**The challenge**: Trainer updates policy weights, but policy service needs those weights.

```python
# Forge weight synchronization pattern from apps/grpo/main.py
# TorchForge weight synchronization pattern from apps/grpo/main.py
async def real_weight_sync(trainer, policy, step):
# Trainer pushes weights to TorchStore with version number
await trainer.push_weights.call_one(policy_version=step + 1)
Expand All @@ -388,11 +388,11 @@ print(f"Current policy version: {current_version}")

## Deep Dive: Asynchronous Coordination Patterns

**The real challenge**: Different services run at different speeds, but Forge's service abstraction handles the coordination complexity.
**The real challenge**: Different services run at different speeds, but TorchForge's service abstraction handles the coordination complexity.

### The Forge Approach: Let Services Handle Coordination
### The TorchForge Approach: Let Services Handle Coordination

Instead of manual coordination, Forge services handle speed mismatches automatically:
Instead of manual coordination, TorchForge services handle speed mismatches automatically:

```python
from apps.grpo.main import Episode, Group
Expand Down Expand Up @@ -556,7 +556,7 @@ await reward_actor.shutdown()
Now let's see how services coordinate in a real training loop:

```python
# This is the REAL way production RL systems are built with Forge
# This is the REAL way production RL systems are built with TorchForge

import asyncio
import torch
Expand Down
30 changes: 15 additions & 15 deletions docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Part 3: The Forge-Monarch Connection
# Part 3: The TorchForge-Monarch Connection

This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals), Part 2: [Forge Internals](./2_Forge_Internals).
This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to TorchForge](./1_RL_and_Forge_Fundamentals), Part 2: [TorchForge Internals](./2_Forge_Internals).

Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
Now let's peel back the layers. TorchForge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.

## The Complete Hierarchy: Service to Silicon

Expand All @@ -12,7 +12,7 @@ graph TD
Call["await policy_service.generate.route('What is 2+2?')"]
end

subgraph ForgeServices["2. Forge Service Layer"]
subgraph ForgeServices["2. TorchForge Service Layer"]
ServiceInterface["ServiceInterface: Routes requests, Load balancing, Health checks"]
ServiceActor["ServiceActor: Manages replicas, Monitors health, Coordinates failures"]
end
Expand Down Expand Up @@ -92,8 +92,8 @@ actors = cluster_procs.spawn("my_actor", MyActor)
**The power**: Scale from single host to cluster without changing your actor code - ProcMesh handles all the complexity.

```python
# This shows the underlying actor system that powers Forge services
# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real Forge apps!
# This shows the underlying actor system that powers TorchForge services
# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real TorchForge apps!

from monarch.actor import Actor, endpoint, this_proc, Future
from monarch.actor import ProcMesh, this_host
Expand Down Expand Up @@ -146,8 +146,8 @@ await counters.increment.broadcast() # No return value - just sends to all acto
# Cleanup
await procs.stop()

# Remember: This raw Monarch code is for understanding how Forge works internally.
# In your Forge applications, use ForgeActor, .as_service(), .as_actor() instead!
# Remember: This raw Monarch code is for understanding how TorchForge works internally.
# In your TorchForge applications, use ForgeActor, .as_service(), .as_actor() instead!
```

## Actor Meshes: Your Code Running Distributed
Expand All @@ -168,9 +168,9 @@ policy_actors = procs.spawn("policy", PolicyActor, model="Qwen/Qwen3-7B")
# All initialized with the same model parameter
```

## How Forge Services Use Monarch
## How TorchForge Services Use Monarch

Now the key insight: **Forge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**.
Now the key insight: **TorchForge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**.

### The Service Creation Process

Expand Down Expand Up @@ -264,7 +264,7 @@ In real RL systems, you have multiple services that can share or use separate Pr
```mermaid
graph TD
subgraph Cluster["RL Training Cluster"]
subgraph Services["Forge Services"]
subgraph Services["TorchForge Services"]
PS["Policy Service<br/>4 GPU replicas"]
TS["Trainer Service<br/>2 GPU replicas"]
RS["Reward Service<br/>4 CPU replicas"]
Expand Down Expand Up @@ -317,7 +317,7 @@ graph TD
2. **Location Transparency**: Actors can be local or remote with identical APIs
3. **Structured Distribution**: ProcMesh maps directly to hardware topology
4. **Message Passing**: No shared memory means no race conditions or locks
5. **Service Abstraction**: Forge hides Monarch complexity while preserving power
5. **Service Abstraction**: TorchForge hides Monarch complexity while preserving power

Understanding this hierarchy helps you:
- **Debug performance issues**: Is the bottleneck at service, actor, or hardware level?
Expand All @@ -329,9 +329,9 @@ Understanding this hierarchy helps you:

## What You've Learned

1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples
2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns
3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware
1. **RL Fundamentals**: How RL concepts map to TorchForge services with REAL, working examples
2. **Service Abstraction**: How to use TorchForge services effectively with verified communication patterns
3. **Monarch Foundation**: How TorchForge services connect to distributed actors and hardware

## Key Takeaways

Expand Down
8 changes: 4 additions & 4 deletions docs/source/zero-to-forge-intro.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Zero to Forge: From RL Theory to Production-Scale Implementation
# Zero to TorchForge: From RL Theory to Production-Scale Implementation

A comprehensive guide for ML Engineers building distributed RL systems for language models.

Expand All @@ -12,9 +12,9 @@ PyTorch tutorial, shoutout to our PyTorch friends that remember!

This section currently is structured in 3 detailed parts:

1. [RL Fundamentals and Understanding Forge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals
2. [Forge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of Forge
3. [Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how Forge Talks to Monarch
1. [RL Fundamentals and Understanding TorchForge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you TorchForge Fundamentals
2. [TorchForge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of TorchForge
3. [Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how TorchForge Talks to Monarch

Each part builds upon the next and the entire section can be consumed in roughly an hour - Grab a Chai and Enjoy!

Expand Down
Loading