From 4c92653bae1e5f8eeb72fd24ba321f41ead7364d Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 18:50:19 -0700 Subject: [PATCH 01/22] Create ReadMe.MD --- docs/Tutorials/ReadMe.MD | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 docs/Tutorials/ReadMe.MD diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD new file mode 100644 index 000000000..6294c8ec8 --- /dev/null +++ b/docs/Tutorials/ReadMe.MD @@ -0,0 +1,11 @@ +Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our friends that remember! + +This section currently is structured in 3 detailed parts: + +1. []() +2. []() +3. []() + +Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! + +If you're eager, please checkout our SFT Tutorial too (Coming soon!) as well as [App Examples](../../apps/). \ No newline at end of file From 430a45e4af363f6ba3cd265e652435809a3dcad5 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:02:51 -0700 Subject: [PATCH 02/22] add part 1 --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 385 ++++++++++++++++++ docs/Tutorials/2_.MD | 0 docs/Tutorials/3_.MD | 0 docs/Tutorials/ReadMe.MD | 12 +- 4 files changed, 395 insertions(+), 2 deletions(-) create mode 100644 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD create mode 100644 docs/Tutorials/2_.MD create mode 100644 docs/Tutorials/3_.MD diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD new file mode 100644 index 000000000..96710b57a --- /dev/null +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -0,0 +1,385 @@ +# Part 1: RL Fundamentals - Using Forge Terminology + +## Core RL Components in Forge + +Let's start with a simple math tutoring example to understand RL concepts with the exact names Forge uses: + +### The Toy Example: Teaching Math + +```mermaid +graph TD + subgraph Example["Math Tutoring RL Example"] + Dataset["Dataset
math problems
'What is 2+2?'"] + Policy["Policy
student AI
generates: 'The answer is 4'"] + Reward["Reward Model
Evaluation Exam
scores: 0.95 (excellent)"] + Reference["Reference Model
original student
baseline comparison"] + ReplayBuffer["Replay Buffer
notebook
stores experiences"] + Trainer["Trainer
tutor
improves student"] + end + + Dataset --> Policy + Policy --> Reward + Policy --> Reference + Reward --> ReplayBuffer + Reference --> ReplayBuffer + ReplayBuffer --> Trainer + Trainer --> Policy + + style Policy fill:#99ff99 + style Reward fill:#ffcc99 + style Trainer fill:#ff99cc +``` + +### RL Components Defined (Forge Names) + +1. **Dataset**: Provides questions/prompts (like "What is 2+2?") +2. **Policy**: The AI being trained (generates answers like "The answer is 4") +3. **Reward Model**: Evaluates answer quality (gives scores like 0.95) +4. **Reference Model**: Original policy copy (prevents drift from baseline) +5. **Replay Buffer**: Stores experiences (question + answer + score) +6. **Trainer**: Updates the policy weights based on experiences + +### The RL Learning Flow + +```python +# CONCEPTUAL EXAMPLE - see apps/grpo/main.py for GRPO Code + +def conceptual_rl_step(): + # 1. Get a math problem + question = dataset.sample() # "What is 2+2?" + + # 2. Student generates answer + answer = policy.generate(question) # "The answer is 4" + + # 3. Teacher grades it + score = reward_model.evaluate(question, answer) # 0.95 + + # 4. Compare to original student + baseline = reference_model.compute_logprobs(question, answer) + + # 5. Store the experience + experience = Episode(question, answer, score, baseline) + replay_buffer.add(experience) + + # 6. When enough experiences collected, improve student + batch = replay_buffer.sample(curr_policy_version=0) + if batch is not None: + trainer.train_step(batch) # Student gets better! + +# 🔄 See complete working example below with actual Forge service calls +``` + +## From Concepts to Forge Services + +Here's the key insight: **Each RL component becomes a Forge service**. The toy example above maps directly to Forge: + +```mermaid +graph LR + subgraph Concepts["RL Concepts"] + C1["Dataset"] + C2["Policy"] + C3["Reward Model"] + C4["Reference Model"] + C5["Replay Buffer"] + C6["Trainer"] + end + + subgraph Services["Forge Services (Real Classes)"] + S1["DatasetActor"] + S2["Policy"] + S3["RewardActor"] + S4["ReferenceModel"] + S5["ReplayBuffer"] + S6["RLTrainer"] + end + + C1 --> S1 + C2 --> S2 + C3 --> S3 + C4 --> S4 + C5 --> S5 + C6 --> S6 + + style C2 fill:#99ff99 + style S2 fill:#99ff99 + style C3 fill:#ffcc99 + style S3 fill:#ffcc99 +``` + +### RL Step with Forge Services + +```python +# Conceptual Example + +async def conceptual_forge_rl_step(services, step): + # 1. Get a math problem - CONCEPTUAL API + sample = await services['dataloader'].get_sample() + question, target = sample["question"], sample["answer"] + + # 2. Student generates answer - CONCEPTUAL API + # Actual method names vary by implementation + responses = await services['policy'].generate(prompt=question) + answer = responses[0].text + + # 3. Teacher grades it - CONCEPTUAL API + # Actual reward evaluation varies by implementation + score = await services['reward_actor'].evaluate( + prompt=question, response=answer, target=target + ) + + # 4. Compare to baseline - CONCEPTUAL API + ref_logprobs = await services['ref_model'].compute_baseline(responses[0].token_ids) + + # 5. Store experience - CONCEPTUAL Episode structure + # Real Episode structure in src/forge/data_models/episode.py + episode = create_episode(responses[0], score, ref_logprobs, step) + await services['replay_buffer'].store(episode) + + # 6. Improve student - CONCEPTUAL API + batch = await services['replay_buffer'].get_batch(policy_version=step) + if batch is not None: + loss = await services['trainer'].update_policy(batch) + return loss +``` + +**Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service. + + +## Why This Matters: Traditional ML Infrastructure Fails + +### The Infrastructure Challenge + +Our simple RL loop above has complex requirements: + +#### Problem 1: Different Resource Needs + +```mermaid +graph TD + subgraph Components["Each Component Needs Different Resources"] + Policy["Policy (Student AI)
Generates: 'The answer is 4'
Needs: Large GPU memory
Scaling: Multiple replicas for speed"] + + Reward["Reward Model (Teacher)
Scores answers: 0.95
Needs: Moderate compute
Scaling: CPU or small GPU"] + + Trainer["Trainer (Tutor)
Improves student weights
Needs: Massive GPU compute
Scaling: Distributed training"] + + Dataset["Dataset (Question Bank)
Provides: 'What is 2+2?'
Needs: CPU intensive I/O
Scaling: High memory bandwidth"] + end + + style Policy fill:#99ff99 + style Reward fill:#ffcc99 + style Trainer fill:#ff99cc + style Dataset fill:#ccccff +``` + +### Problem 2: Complex Interdependencies + +```mermaid +graph LR + A["Policy: Student AI
'What is 2+2?' → 'The answer is 4'"] + B["Reward: Teacher
Scores answer: 0.95"] + C["Reference: Original Student
Provides baseline comparison"] + D["Replay Buffer: Notebook
Stores: question + answer + score"] + E["Trainer: Tutor
Improves student using experiences"] + + A --> B + A --> C + B --> D + C --> D + D --> E + E --> A + + style A fill:#99ff99 + style B fill:#ffcc99 + style C fill:#99ccff + style D fill:#ccff99 + style E fill:#ff99cc +``` + +Each step has different: +- **Latency requirements**: Policy inference needs low latency, training can batch +- **Scaling patterns**: Reward evaluation scales with response count, training with model size +- **Failure modes**: Policy failure stops generation, reward failure affects learning quality +- **Resource utilization**: GPUs for inference/training, CPUs for data processing + +### Problem 3: The Coordination Challenge + +Unlike supervised learning where you process independent batches, RL requires coordination: + +```python +# This won't work - creates bottlenecks and resource waste +def naive_rl_step(): + # Policy waits idle while reward model works + response = policy_model.generate(prompt) # GPU busy + reward = reward_model.evaluate(prompt, response) # Policy GPU idle + + # Training waits for single episode + loss = compute_loss(response, reward) # Batch size = 1, inefficient + + # Everything stops if any component fails + if policy_fails or reward_fails or trainer_fails: + entire_system_stops() +``` + +## Enter Forge: RL-Native Architecture + +Forge solves these problems by treating each RL component as an **independent, scalable service** + +Let's see how core RL concepts map to Forge services: + +```python +async def real_rl_training_step(services, step): + """Single RL step using verified Forge APIs""" + + # 1. Environment interaction + sample = await services['dataloader'].__next__.call_one() + prompt, target = sample["question"], sample["answer"] + + responses = await services['policy'].generate.route(prompt=prompt) + + # 2. Reward computation + score = await services['reward_actor'].evaluate_response.route( + prompt=prompt, response=responses[0].text, target=target + ) + + # 3. Get reference logprobs + ref_logprobs = await services['ref_model'].forward.route(responses[0].token_ids) + + # 4. Experience storage - Episode creation pattern + # Note: Actual Episode structure requires token tensors, not text + episode = create_episode_from_response(responses[0], score, ref_logprobs, step) + await services['replay_buffer'].add.call_one(episode) + + # 5. Learning - trainer endpoint + batch = await services['replay_buffer'].sample.call_one( + curr_policy_version=step + ) + if batch is not None: + loss = await services['trainer'].train_step.call_one(batch) + + # 6. Policy synchronization - weight update pattern + await services['trainer'].push_weights.call_one(step + 1) + await services['policy'].update_weights.fanout(step + 1) + + return loss +``` + +**Key insight**: Each line of RL pseudocode becomes a service call. The complexity of distribution, scaling, and fault tolerance is hidden behind these simple interfaces. + +## What Makes This Powerful + +### Automatic Resource Management +```python +responses = await policy.generate.route(prompt=question) +answer = responses[0].text # responses is list[Completion] + +# Forge handles behind the scenes: +# - Routing to least loaded replica +# - GPU memory management +# - Batch optimization +# - Failure recovery +# - Auto-scaling based on demand +``` + +### Independent Scaling +```python + +from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig +from forge.actors.replay_buffer import ReplayBuffer +from forge.controller.service import shutdown_service +from apps.grpo.main import Trainer, RewardActor, ComputeAdvantages, RefModel, DatasetActor +from forge.data.rewards import MathReward, ThinkingReward +import asyncio + +model = "Qwen/Qwen3-1.7B" +group_size = 1 + +( + dataloader, + policy, + trainer, + replay_buffer, + compute_advantages, + ref_model, + reward_actor, +) = await asyncio.gather( + # Dataset service + spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=1), + DatasetActor, + path="openai/gsm8k", + config_name="main", + split="train", + streaming=True, + ), + # Policy service with GPU + spawn_service( + ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1), + Policy, + config=PolicyConfig( + worker_params=WorkerConfig(model=model), + sampling_params=SamplingOverrides( + num_samples=group_size, max_tokens=16 + ), + ), + ), + # Trainer service with GPU + spawn_service( + ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1), + Trainer, + learning_rate=1e-5, + beta=0.1, + model_name=model, + ), + # Replay buffer (CPU) + spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=1), + ReplayBuffer, + batch_size=2, + max_policy_age=1, + ), + # Advantage computation (CPU) + spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=1), + ComputeAdvantages, + gamma=0.99, + lambda_=0.95, + ), + # Reference model with GPU + spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=1, with_gpus=True), + RefModel, + model_name=model, + ), + # Reward actor (CPU) + spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=1), + RewardActor, + reward_functions=[MathReward(), ThinkingReward()], + ) + ) + +# Production scaling - multiply num_replicas: +# Policy: num_replicas=8 for high inference demand +# RewardActor: num_replicas=16 for parallel evaluation +# Trainer: num_replicas=4 for distributed training +``` + +### Fault Tolerance +```python +# If a policy replica fails: +responses = await policy.generate.route(prompt=question) +answer = responses[0].text +# -> Forge automatically routes to healthy replica +# -> Failed replica respawns in background +# -> No impact on training loop + +# If reward service fails: +score = await reward_actor.evaluate_response.route( + prompt=question, response=answer, target=target +) +# -> Retries on different replica automatically +# -> Graceful degradation if all replicas fail +# -> System continues (may need application-level handling) +``` + +This is fundamentally different from monolithic RL implementations where any component failure stops everything. diff --git a/docs/Tutorials/2_.MD b/docs/Tutorials/2_.MD new file mode 100644 index 000000000..e69de29bb diff --git a/docs/Tutorials/3_.MD b/docs/Tutorials/3_.MD new file mode 100644 index 000000000..e69de29bb diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD index 6294c8ec8..01d750d06 100644 --- a/docs/Tutorials/ReadMe.MD +++ b/docs/Tutorials/ReadMe.MD @@ -1,8 +1,16 @@ -Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our friends that remember! +## Zero to Forge: From RL Theory to Production-Scale Implementation + +A comprehensive guide for ML Engineers building distributed RL systems for language models. + +Some of the examples mentioned below will be conceptual in nature for understanding. Please refer to API Docs (Coming Soon!) for more details + +Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our PyTorch friends that remember! + +### This section currently is structured in 3 detailed parts: -1. []() +1. [RL Fundamentals and Understanding Forge Terminology](./1_RL_and_Forge_Fundamentals.MD): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals 2. []() 3. []() From 223b2cab881168ad6c74d7bbf5707cd1f908baa7 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:06:35 -0700 Subject: [PATCH 03/22] Update 1_RL_and_Forge_Fundamentals.MD --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 96710b57a..bcffc733c 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -85,6 +85,7 @@ graph LR end subgraph Services["Forge Services (Real Classes)"] + S1["DatasetActor"] S2["Policy"] S3["RewardActor"] From b9cb2cb0a3e6eb05162a2fc91fddbe1952c16080 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:08:03 -0700 Subject: [PATCH 04/22] Update 1_RL_and_Forge_Fundamentals.MD --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index bcffc733c..223a6e152 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -85,7 +85,6 @@ graph LR end subgraph Services["Forge Services (Real Classes)"] - S1["DatasetActor"] S2["Policy"] S3["RewardActor"] @@ -109,6 +108,8 @@ graph LR ### RL Step with Forge Services +Let's look at the example from above again, but this time we would use the names from Forge: + ```python # Conceptual Example @@ -145,6 +146,8 @@ async def conceptual_forge_rl_step(services, step): **Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service. +Did you realise-we are not worrying about any Infra code here! Forge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms! + ## Why This Matters: Traditional ML Infrastructure Fails From 5a0190b4d009c30e4c736d06be024c2bfb07f07a Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:12:43 -0700 Subject: [PATCH 05/22] part 2 --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 36 +- docs/Tutorials/2_Forge_Internals.MD | 665 ++++++++++++++++++ docs/Tutorials/3_.MD | 0 docs/Tutorials/{2_.MD => 3_Monarch_101.MD} | 0 4 files changed, 685 insertions(+), 16 deletions(-) create mode 100644 docs/Tutorials/2_Forge_Internals.MD delete mode 100644 docs/Tutorials/3_.MD rename docs/Tutorials/{2_.MD => 3_Monarch_101.MD} (100%) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 223a6e152..810ef373f 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -275,15 +275,15 @@ async def real_rl_training_step(services, step): ```python responses = await policy.generate.route(prompt=question) answer = responses[0].text # responses is list[Completion] - -# Forge handles behind the scenes: -# - Routing to least loaded replica -# - GPU memory management -# - Batch optimization -# - Failure recovery -# - Auto-scaling based on demand ``` +Forge handles behind the scenes: +- Routing to least loaded replica +- GPU memory management +- Batch optimization +- Failure recovery +- Auto-scaling based on demand + ### Independent Scaling ```python @@ -361,13 +361,14 @@ group_size = 1 reward_functions=[MathReward(), ThinkingReward()], ) ) - -# Production scaling - multiply num_replicas: -# Policy: num_replicas=8 for high inference demand -# RewardActor: num_replicas=16 for parallel evaluation -# Trainer: num_replicas=4 for distributed training ``` +Production scaling - multiply num_replicas: +- Policy: num_replicas=8 for high inference demand +- RewardActor: num_replicas=16 for parallel evaluation +- Trainer: num_replicas=4 for distributed training + + ### Fault Tolerance ```python # If a policy replica fails: @@ -381,9 +382,12 @@ answer = responses[0].text score = await reward_actor.evaluate_response.route( prompt=question, response=answer, target=target ) -# -> Retries on different replica automatically -# -> Graceful degradation if all replicas fail -# -> System continues (may need application-level handling) ``` -This is fundamentally different from monolithic RL implementations where any component failure stops everything. +- Retries on different replica automatically +- Graceful degradation if all replicas fail +- System continues (may need application-level handling) + +This is fundamentally different from monolithic RL implementations where any component failure stops everything! + +In the next Section, we will go a layer deeper and learn how ForgeServices work. Continue to [Part 2 here](./2_Forge_Internals.MD) \ No newline at end of file diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD new file mode 100644 index 000000000..d55eda51a --- /dev/null +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -0,0 +1,665 @@ +# Part 2: Peeling Back the Abstraction - What Are Services? + +We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals.MD) before this, it explains RL Concepts and how they land in Forge. + +Now that you see the power of the service abstraction, let's understand what's actually happening under the hood, Grab your chai! + +## Service Anatomy: Beyond the Interface + +When you call `await policy_service.generate(question)`, here's what actually happens: + +```mermaid +graph TD + Call["Your Code:
await policy_service.generate"] + + subgraph ServiceLayer["Service Layer"] + Proxy["Service Proxy
Load balancing
Health checking
Request routing"] + LB["Load Balancer
Replica selection
Circuit breaker
Retry logic"] + end + + subgraph Replicas["Replica Management"] + R1["Replica 1
GPU 0
Healthy"] + R2["Replica 2
GPU 1
Overloaded"] + R3["Replica 3
GPU 2
Failed"] + R4["Replica 4
GPU 3
Healthy"] + end + + subgraph Compute["Actual Computation"] + Actor["Policy Actor
vLLM engine
Model weights
KV cache"] + end + + Call --> Proxy + Proxy --> LB + LB --> R1 + LB -.-> R2 + LB -.-> R3 + LB --> R4 + R1 --> Actor + R4 --> Actor + + style Call fill:#99ff99 + style LB fill:#ffcc99 + style R3 fill:#ff9999 + style Actor fill:#cc99ff +``` + +## Service Components Deep Dive + +### 1. Real Service Configuration + +Here's the actual ServiceConfig from Forge source code: + +```python +# Configuration pattern from apps/grpo/main.py: +Policy.options( + procs=1, # Processes per replica + num_replicas=4, # Number of replicas + with_gpus=True # Allocate GPUs + # Other available options: + # hosts=None +) + +# This is the ACTUAL way services are configured in Forge +``` + +### 2. Real Service Creation + +Services are created using the `spawn_service` function: + +```python +# This is what ACTUALLY works - copied directly from the notebook + +from forge.controller.service import ServiceConfig, spawn_service +from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig + +model = "Qwen/Qwen3-1.7B" + +policy = await spawn_service( + ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1), + Policy, + config=PolicyConfig( + worker_params=WorkerConfig(model=model), + sampling_params=SamplingOverrides( + num_samples=1, max_tokens=16 + ), + ), +) + +prompt = "What is 3 + 5?" +responses = await policy.generate.choose(prompt=prompt) +print(f"Response: {responses[0].text}") + +# The spawn_service() function automatically handles: +# - Spawning actor replicas across processes/GPUs +# - Load balancing with .choose() method +# - Health monitoring and failure recovery +# - Message routing and serialization + +# Cleanup when done +await shutdown_service(policy) +``` + +### 3. How Services Actually Work + +Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas: + +```python +# Forge internals - What happens behind the scenes: +# 1. .as_service() creates a ServiceInterface +# 2. ServiceInterface manages N replicas of your ForgeActor class +# 3. ServiceInterface handles routing between replicas +# 4. You get methods like .route(), .fanout(), etc. + +# Your code sees this: +responses = await policy.generate.route(prompt=prompt) + +# But behind the scenes: +# - ServiceInterface selects healthy replica +# - Routes message to that replica's Policy.generate() endpoint +# - Handles failures and retries automatically +# - Returns list[Completion] from the selected replica +``` + +### 3. Different Service Types and Their Characteristics + +```mermaid +graph TD + subgraph GPU["GPU-Intensive Services"] + PolicySvc["Policy Service
Large model inference
High GPU memory
Batch optimization"] + TrainerSvc["Trainer Service
Distributed training
Gradient sync
Massive compute"] + RefSvc["Reference Service
Frozen model
Baseline computation
Read-only ops"] + end + + subgraph CPU["CPU-Intensive Services"] + RewardSvc["Reward Service
Evaluation logic
Rule-based scoring
High throughput"] + DataSvc["Data Service
Dataset streaming
Preprocessing
I/O optimization"] + end + + subgraph Memory["Memory-Intensive Services"] + BufferSvc["Buffer Service
Experience storage
Efficient sampling
Persistence"] + MetricsSvc["Metrics Service
Logging aggregation
Performance tracking
Analytics"] + end + + style PolicySvc fill:#ff9999 + style TrainerSvc fill:#ff9999 + style RewardSvc fill:#99ff99 + style BufferSvc fill:#9999ff +``` + +## Deep Dive: Service Communication Patterns + +These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage. + +### 1. `.route()` - Load Balanced Single Replica + +**When to use**: Normal request routing where any replica can handle the request. + +```python +responses = await policy.generate.route(prompt=question) +answer = responses[0].text # Extract text from Completion object + +# Behind the scenes: +# 1. Health check eliminates failed replicas +# 2. Load balancer picks least loaded healthy replica +# 3. Request routes to that specific replica +# 4. Automatic retry on different replica if failure +``` + +**Performance characteristics**: +- **Latency**: Lowest (single network hop) +- **Throughput**: Limited by single replica capacity +- **Fault tolerance**: Automatic failover to other replicas + +**Critical insight**: `.route()` is your default choice for stateless operations in Forge services. + +### 2. `.fanout()` - Broadcast with Results Collection + +**When to use**: You need responses from ALL replicas. + +```python +# Get version from all policy replicas +current_versions = await policy.get_version.fanout() +# Returns: [version_replica_1, version_replica_2, ...] + +# Update weights on all replicas +await policy.update_weights.fanout(new_policy_version) +# Broadcasts to all replicas simultaneously +``` + +**Performance characteristics**: +- **Latency**: Slowest replica determines total latency +- **Throughput**: Network bandwidth × number of replicas +- **Fault tolerance**: Fails if ANY replica fails (unless configured otherwise) + +**Critical gotcha**: Don't use `.fanout()` for high-frequency operations - it contacts all replicas. + +### 3. Streaming Operations - Custom Implementation Pattern + +**When to use**: You want to process results as they arrive, not wait for all. + +```python +# 📝 CONCEPTUAL - Streaming requires custom implementation in your training loop +# The basic ReplayBuffer doesn't have built-in streaming methods +# Pattern from apps/grpo/main.py continuous training: + +while training: + # This is the real API call pattern + batch = await replay_buffer.sample.call_one(curr_policy_version=step) + if batch is not None: + # Process batch immediately + loss = await trainer.train_step.call_one(batch) + print(f"Training loss: {loss}") + else: + await asyncio.sleep(0.1) # Wait for more data +``` + +**Performance characteristics**: +- **Latency**: Process first result immediately +- **Throughput**: Pipeline parallelism (much higher than sequential) +- **Fault tolerance**: Continues if some replicas fail + +**Critical insight**: This is essential for high-throughput RL where you can't wait for batches. + +### 4. Fire-and-Forget Operations + +**When to use**: Side effects that don't need responses (notifications, cache updates). + +```python +# 📝 CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations +# The basic services don't have broadcast methods built-in +# You would implement custom endpoints in your ForgeActor: + +class CustomPolicy(Policy): + @endpoint + async def clear_cache(self) -> None: + """Custom endpoint for cache clearing""" + self.policy_worker.clear_kv_cache() + +# Then use it (hypothetical): +# await custom_policy.clear_cache.fanout() # Clear all replica caches +# Note: Actual cache clearing would use existing Policy methods +``` + +**Performance characteristics**: +- **Latency**: Immediately returns (doesn't wait for completion) +- **Throughput**: Network limited, but non-blocking +- **Fault tolerance**: Fire-and-forget (you don't know if it worked) + +**Critical warning**: Only use for non-critical operations - you get no confirmation. + +### 5. Service Sessions for Stateful Operations + +**When to use**: When you need multiple calls to hit the same replica (like KV cache preservation). + +```python +# This Counter example demonstrates the session pattern + +from forge.controller import ForgeActor +from forge.controller.service import ServiceConfig, spawn_service, shutdown_service +from monarch.actor import endpoint + +class ForgeCounter(ForgeActor): + def __init__(self, initial_value: int): + self.value = initial_value + + @endpoint + def increment(self) -> int: + self.value += 1 + return self.value + + @endpoint + def get_value(self) -> int: + return self.value + + @endpoint + async def reset(self): + self.value = 0 + +counter_service = await spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=4), + ForgeCounter, + initial_value=0 +) + +# Test basic operations +await counter_service.increment.choose() +results = await counter_service.increment.call() +print(f"All replica values: {results}") + +# STICKY SESSIONS +print("\nUsing sticky sessions:") +async with counter_service.session(): + await counter_service.reset.choose() + print(await counter_service.increment.choose()) # 1 + print(await counter_service.increment.choose()) # 2 + print(await counter_service.increment.choose()) # 3 + + final_value = await counter_service.get_value.choose() + print(f"Final value on this replica: {final_value}") # 3 + +# Same pattern works with Policy for multi-turn conversations: +# async with policy.session(): +# response1 = await policy.generate.choose(prompt=turn1) +# full_prompt = turn1 + response1[0].text + turn2 +# response2 = await policy.generate.choose(prompt=full_prompt) +# # Both calls hit same replica, preserving KV cache + +# Cleanup +await shutdown_service(counter_service) +``` + +**Performance impact**: Critical for maintaining KV cache in multi-turn conversations. + +## Deep Dive: State Management Reality + +The most complex challenge in distributed RL is maintaining state consistency while maximizing performance. + +### The KV Cache Problem + +**The challenge**: Policy inference is much faster with KV cache, but cache is tied to specific conversation history. + +```python +# This breaks KV cache optimization: +async def naive_multi_turn(): + # Each call might go to different replica = cache miss + response1 = await policy_service.generate.choose(question1) + response2 = await policy_service.generate.choose(question1 + response1) # Cache miss! + response3 = await policy_service.generate.choose(conversation_so_far) # Cache miss! +``` + +**The solution**: Sticky sessions ensure all calls go to same replica. + +```python +async def optimized_multi_turn(): + async with policy.session(): + # All calls guaranteed to hit same replica = cache hits + response1 = await policy.generate.route(prompt=question1) + full_prompt = question1 + response1[0].text + response2 = await policy.generate.route(prompt=full_prompt) # Cache hit! + conversation = full_prompt + response2[0].text + response3 = await policy.generate.route(prompt=conversation) # Cache hit! + + # Session ends, replica can be garbage collected or reused +``` + +**Performance impact**: Maintaining KV cache across turns avoids recomputing previous tokens. + +### Replay Buffer Consistency + +**The challenge**: Multiple trainers and experience collectors reading/writing concurrently. + +**Real Forge approach**: The ReplayBuffer actor handles concurrency internally: + +```python +# Forge ReplayBuffer endpoints (verified from source code) +# Add episodes (thread-safe by actor model) +await replay_buffer.add.call_one(episode) # Note: .call_one() not .choose() + +# Sample batches for training +batch = await replay_buffer.sample.call_one( + curr_policy_version=step_number, + batch_size=None # Optional parameter, uses default from config +) + +# Additional methods available: +# await replay_buffer.clear.call_one() # Clear buffer +# await replay_buffer.evict.call_one(curr_policy_version) # Remove old episodes +# state = await replay_buffer.state_dict.call_one() # Get state for checkpointing +``` + +**Critical insight**: The actor model provides natural thread safety - each actor processes messages sequentially. + +### Weight Synchronization Strategy + +**The challenge**: Trainer updates policy weights, but policy service needs those weights. + +```python +# Forge weight synchronization pattern from apps/grpo/main.py +async def real_weight_sync(trainer, policy, step): + # Trainer pushes weights to TorchStore with version number + await trainer.push_weights.call_one(policy_version=step + 1) + + # Policy service updates to new version from TorchStore + # Use .fanout() to update ALL policy replicas + await policy.update_weights.fanout(policy_version=step + 1) + +# Check current policy version +current_version = await policy.get_version.route() +print(f"Current policy version: {current_version}") +``` + +## Deep Dive: Asynchronous Coordination Patterns + +**The real challenge**: Different services run at different speeds, but Forge's service abstraction handles the coordination complexity. + +### The Forge Approach: Let Services Handle Coordination + +Instead of manual coordination, Forge services handle speed mismatches automatically: + +```python + +from apps.grpo.main import Episode, Group + +async def simple_rl_step(): + + # ===== Generate a rollout ===== + sample = await dataloader.__next__.choose() + prompt, target = sample["question"], sample["answer"] + + print(f"Prompt: {prompt}") + print(f"Target: {target}") + + actions = await policy.generate.choose(prompt=prompt) + print(f"Policy response: {actions[0].text}") + + ref_logprobs = await ref_model.forward.choose(actions[0].token_ids) + reward = await reward_actor.evaluate_response.choose( + prompt=prompt, + response=actions[0].text, + target=target + ) + print(f"Reward: {reward}") + + episode = Episode( + episode_id=0, + prompt=prompt, + target=target, + policy_version=0, + ) + + episode.add_group(Group( + response=actions[0].text, + ref_logprobs=ref_logprobs, + reward=reward, + )) + + advantages = await compute_advantages.__call__.choose(episode.groups) + episode.groups[0].advantage = advantages[0] + print(f"Advantage: {advantages[0]}") + await replay_buffer.add.choose(episode) + print("Episode stored in replay buffer") + + # ===== Train on the batch ===== + batch = await replay_buffer.sample.choose(curr_policy_version=0) + if batch is not None: + print("Training on batch...") + training_result = await trainer.train_step.choose(batch) + loss = training_result.get("loss", 0.0) + print(f"Training loss: {loss}") + return loss + else: + print("Not enough data in buffer yet") + return None + +for step in range(10): + print(f"\n--- RL Step {step + 1} ---") + loss = await simple_rl_step() + if loss: + print(f"Step {step + 1} complete, loss: {loss:.4f}") + else: + print(f"Step {step + 1} complete, building buffer...") +``` + +### Handling Speed Mismatches with Service Scaling + +**The insight**: Scale services independently based on their bottlenecks. + +```python +# Scale fast services with more replicas +policy = await Policy.options( + procs=1, num_replicas=8, with_gpus=True # Many replicas for high throughput +).as_service( + engine_config=EngineConfig(model=model_name) +) + +# Reward evaluation might be CPU-bound +reward_actor = await RewardActor.options( + procs=1, num_replicas=16, with_gpus=False # More CPU replicas +).as_service( + reward_functions=[MathReward()] +) + +# Training needs fewer but more powerful replicas +trainer = await RLTrainer.options( + procs=1, num_replicas=2, with_gpus=True # Fewer but GPU-heavy +).as_actor( # Trainer typically uses .as_actor() not .as_service() + optimizer=Optimizer(lr=1e-5) +) +``` + +### Natural Backpressure Through Service APIs + +```python +# backpressure pattern - The replay buffer naturally provides backpressure +batch = await replay_buffer.sample.call_one(curr_policy_version=step) +if batch is None: + # Not enough data yet - natural rate limiting + print("Buffer not ready, collecting more experiences...") + continue +else: + # Proceed with training + loss = await trainer.train_step.call_one(batch) + print(f"Training loss: {loss}") +``` + +These patterns address the core technical challenges in distributed RL. The key insight: **Forge services handle coordination complexity automatically, letting you focus on RL algorithm logic**. + +## Service Implementation Example + +Let's see how a reward service is actually implemented: + +```python +# ✅ COMPLETE WORKING EXAMPLE - Exact RewardActor from apps/grpo/main.py + +from forge.controller import ForgeActor +from monarch.actor import endpoint +from forge.data.rewards import MathReward, ThinkingReward +from forge.controller.service import ServiceConfig, spawn_service + +# EXACT class definition from apps/grpo/main.py lines 68-83 +class RewardActor(ForgeActor): + def __init__(self, reward_functions: list): + self.reward_functions = reward_functions + + @endpoint + async def evaluate_response(self, prompt: str, response: str, target: str) -> float: + """Evaluate response quality using multiple reward functions""" + total_reward = 0.0 + + for reward_fn in self.reward_functions: + # Each reward function contributes to total score + reward = reward_fn(prompt, response, target) + total_reward += reward + + # Return average reward across all functions + return total_reward / len(self.reward_functions) if self.reward_functions else 0.0 + +reward_actor = await spawn_service( + ServiceConfig(procs_per_replica=1, num_replicas=1), + RewardActor, + reward_functions=[MathReward(), ThinkingReward()] +) + +prompt = "What is 15% of 240?" +response = "15% of 240 is 36" +target = "36" + +score = await reward_actor.evaluate_response.choose( + prompt=prompt, + response=response, + target=target +) +print(f"Reward score: {score}") # Usually around 1.0 for correct math answers + +# For production scaling - increase num_replicas for parallel evaluation: +# ServiceConfig(procs_per_replica=1, num_replicas=16) # 16 parallel evaluators + +# Cleanup when done +await shutdown_service(reward_actor) +``` + +## Service Orchestration: The Training Loop + +Now let's see how services coordinate in a real training loop: + +```python +# This is the REAL way production RL systems are built with Forge + +import asyncio +from forge.actors.policy import Policy +from forge.actors.reference_model import ReferenceModel +from forge.actors.replay_buffer import ReplayBuffer +from forge.actors.trainer import RLTrainer +from forge.controller.actor import ForgeActor +from forge.data.rewards import MathReward, ThinkingReward +from monarch.actor import endpoint +from omegaconf import DictConfig + +# EXACT service creation from apps/grpo/main.py lines 322-344 +print("Initializing all services...") +( + dataloader, + policy, + trainer, + replay_buffer, + compute_advantages, + ref_model, + reward_actor, +) = await asyncio.gather( + DatasetActor.options(**cfg.actors.dataset).as_actor(**cfg.dataset), + Policy.options(**cfg.services.policy).as_service(**cfg.policy), + RLTrainer.options(**cfg.actors.trainer).as_actor( + **cfg.trainer, loss=simple_grpo_loss + ), + ReplayBuffer.options(**cfg.actors.replay_buffer).as_actor( + **cfg.replay_buffer, collate=collate + ), + ComputeAdvantages.options(**cfg.actors.compute_advantages).as_actor(), + ReferenceModel.options(**cfg.services.ref_model).as_service(**cfg.ref_model), + RewardActor.options(**cfg.services.reward_actor).as_service( + reward_functions=[MathReward(), ThinkingReward()] + ), +) + +print("All services initialized successfully!") + +# EXACT usage patterns from apps/grpo/main.py continuous training loop +async def production_training_loop(): + """Real training loop pattern from apps/grpo/main.py""" + step = 0 + + while True: + # Data generation + sample = await dataloader.sample.call_one() + + # Policy generation service call + responses = await policy.generate.route(prompt=sample["question"]) + + # Reference computation service call + ref_logprobs = await ref_model.forward.route(responses[0].token_ids) + + # Reward evaluation service call + reward = await reward_actor.evaluate_response.route( + prompt=sample["question"], + response=responses[0].text, + target=sample["answer"] + ) + + # Experience storage (simplified structure for illustration) + episode = create_episode(sample, responses[0], reward, ref_logprobs, step) + await replay_buffer.add.call_one(episode) + + # Training when ready endpoints + batch = await replay_buffer.sample.call_one(curr_policy_version=step) + if batch is not None: + loss = await trainer.train_step.call_one(batch) + + # Weight synchronization pattern + await trainer.push_weights.call_one(step + 1) + await policy.update_weights.route(step + 1) + + print(f"Step {step}, Loss: {loss:.4f}") + step += 1 + +# EXACT cleanup pattern from apps/grpo/main.py lines 493-504 +print("Shutting down services...") +await asyncio.gather( + DatasetActor.shutdown(dataloader), + policy.shutdown(), + RLTrainer.shutdown(trainer), + ReplayBuffer.shutdown(replay_buffer), + ComputeAdvantages.shutdown(compute_advantages), + ref_model.shutdown(), + reward_actor.shutdown(), +) +print("All services shut down successfully!") +``` + +**Key observations:** +1. **Parallelism**: Independent operations run concurrently +2. **Load balancing**: Each `choose()` call automatically selects optimal replica +3. **Fault tolerance**: Failures automatically retry on different replicas +4. **Resource efficiency**: CPU and GPU services scale independently +5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions) + +This is the power of the service abstraction - complex distributed coordination looks like simple async Python code. diff --git a/docs/Tutorials/3_.MD b/docs/Tutorials/3_.MD deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/Tutorials/2_.MD b/docs/Tutorials/3_Monarch_101.MD similarity index 100% rename from docs/Tutorials/2_.MD rename to docs/Tutorials/3_Monarch_101.MD From 44f562435f7a2036fb2d1a758c2327dd808cb4a7 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:15:10 -0700 Subject: [PATCH 06/22] add part 3 --- docs/Tutorials/3_Monarch_101.MD | 437 ++++++++++++++++++++++++++++++++ docs/Tutorials/ReadMe.MD | 4 +- 2 files changed, 439 insertions(+), 2 deletions(-) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index e69de29bb..9369be13a 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -0,0 +1,437 @@ +# Part 3: The Forge-Monarch Connection + +Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging. + +## The Complete Hierarchy: Service to Silicon + +```mermaid +graph TD + subgraph YourCode["1. Your RL Code"] + Call["await policy_service.generate.choose('What is 2+2?')"] + end + + subgraph ForgeServices["2. Forge Service Layer"] + ServiceInterface["ServiceInterface
• Routes .choose() to replica
• Handles load balancing
• Manages health checks"] + ServiceActor["ServiceActor
• Manages replica lifecycle
• Monitors health
• Coordinates failures"] + end + + subgraph MonarchLayer["3. Monarch Actor Layer"] + ActorMesh["ActorMesh[PolicyActor]
• 4 PolicyActor instances
• Each on different GPU
• Message passing interface"] + ProcMesh["ProcMesh
• 4 processes
• GPU topology: [0,1,2,3]
• Network interconnect"] + end + + subgraph Hardware["4. Physical Hardware"] + GPU0["GPU 0
PolicyActor #1
vLLM Engine
Model Weights"] + GPU1["GPU 1
PolicyActor #2
vLLM Engine
Model Weights"] + GPU2["GPU 2
PolicyActor #3
vLLM Engine
Model Weights"] + GPU3["GPU 3
PolicyActor #4
vLLM Engine
Model Weights"] + end + + Call --> ServiceInterface + ServiceInterface --> ServiceActor + ServiceActor --> ActorMesh + ActorMesh --> ProcMesh + ProcMesh --> GPU0 + ProcMesh --> GPU1 + ProcMesh --> GPU2 + ProcMesh --> GPU3 + + style Call fill:#99ff99 + style ServiceActor fill:#ffcc99 + style ActorMesh fill:#cc99ff + style ProcMesh fill:#ccccff +``` + +## Deep Dive: ProcMesh - The Foundation + +**ProcMesh** is Monarch's core abstraction for organizing processes across hardware. Think of it as a multi-dimensional grid that maps directly to your cluster topology. + +### Single Host ProcMesh + +```mermaid +graph TD + subgraph Host["Single Host (8 GPUs)"] + subgraph ProcMesh["ProcMesh: per_host={'gpus': 8}"] + P0["Process 0
GPU 0"] + P1["Process 1
GPU 1"] + P2["Process 2
GPU 2"] + P3["Process 3
GPU 3"] + P4["Process 4
GPU 4"] + P5["Process 5
GPU 5"] + P6["Process 6
GPU 6"] + P7["Process 7
GPU 7"] + end + + P0 -.->|"Network"| P1 + P1 -.->|"Network"| P2 + P2 -.->|"Network"| P3 + P3 -.->|"Network"| P4 + P4 -.->|"Network"| P5 + P5 -.->|"Network"| P6 + P6 -.->|"Network"| P7 + P7 -.->|"Network"| P0 + end + + style P0 fill:#ff9999 + style P1 fill:#ff9999 + style P2 fill:#ff9999 + style P3 fill:#ff9999 + style P4 fill:#ff9999 + style P5 fill:#ff9999 + style P6 fill:#ff9999 + style P7 fill:#ff9999 +``` + +### Multi-Host ProcMesh + +```mermaid +graph TD + subgraph Cluster["Multi-Host Cluster"] + subgraph Host1["Host 1"] + subgraph PM1["ProcMesh Segment 1"] + H1P0["Process 0
GPU 0"] + H1P1["Process 1
GPU 1"] + H1P2["Process 2
GPU 2"] + H1P3["Process 3
GPU 3"] + end + end + + subgraph Host2["Host 2"] + subgraph PM2["ProcMesh Segment 2"] + H2P0["Process 4
GPU 0"] + H2P1["Process 5
GPU 1"] + H2P2["Process 6
GPU 2"] + H2P3["Process 7
GPU 3"] + end + end + + subgraph Host3["Host 3"] + subgraph PM3["ProcMesh Segment 3"] + H3P0["Process 8
GPU 0"] + H3P1["Process 9
GPU 1"] + H3P2["Process 10
GPU 2"] + H3P3["Process 11
GPU 3"] + end + end + end + + H1P0 -.->|"InfiniBand"| H2P0 + H1P1 -.->|"InfiniBand"| H2P1 + H2P0 -.->|"InfiniBand"| H3P0 + H2P1 -.->|"InfiniBand"| H3P1 + + style PM1 fill:#ff9999 + style PM2 fill:#99ff99 + style PM3 fill:#99ccff +``` + +```python +# This shows the underlying actor system that powers Forge services + +from monarch.actor import Actor, endpoint, this_proc, Future +from monarch.actor import ProcMesh, this_host +import asyncio + +# STEP 1: Define a basic actor +class Counter(Actor): + def __init__(self, initial_value: int): + self.value = initial_value + + @endpoint + def increment(self) -> None: + self.value += 1 + + @endpoint + def get_value(self) -> int: + return self.value + +# STEP 2: Single actor in local process +counter: Counter = this_proc().spawn("counter", Counter, initial_value=0) + +# STEP 3: Send messages +fut: Future[int] = counter.get_value.call_one() +value = await fut +print(f"Counter value: {value}") # 0 + +# STEP 4: Multiple actors across processes +procs: ProcMesh = this_host().spawn_procs(per_host={"gpus": 8}) +counters: Counter = procs.spawn("counters", Counter, 0) + +# STEP 5: Broadcast to all actors +await counters.increment.call() + +# STEP 6: Different message patterns +# call_one() - single actor +value = await counters.get_value.call_one() +print(f"One counter: {value}") + +# choose() - random single actor +value = await counters.get_value.choose() +print(f"Random counter: {value}") + +# call() - all actors, collect results +values = await counters.get_value.call() +print(f"All counters: {values}") + +# broadcast() - fire and forget +await counters.increment.broadcast() + +# Cleanup +await procs.stop() +``` + +## Actor Meshes: Your Code Running Distributed + +**ActorMesh** is created when you spawn actors across a ProcMesh. Each process in the ProcMesh gets one instance of your actor. + +```mermaid +graph TD + subgraph Creation["Actor Creation Process"] + Code["mesh.spawn('policy', PolicyActor, model='Qwen/Qwen3-7B')"] + + subgraph ProcMesh["ProcMesh (4 processes)"] + P0["Process 0
GPU 0"] + P1["Process 1
GPU 1"] + P2["Process 2
GPU 2"] + P3["Process 3
GPU 3"] + end + + subgraph ActorMesh["ActorMesh[PolicyActor]"] + A0["PolicyActor
Instance #0
model=Qwen/Qwen3-7B
generation_count=0"] + A1["PolicyActor
Instance #1
model=Qwen/Qwen3-7B
generation_count=0"] + A2["PolicyActor
Instance #2
model=Qwen/Qwen3-7B
generation_count=0"] + A3["PolicyActor
Instance #3
model=Qwen/Qwen3-7B
generation_count=0"] + end + + Code --> ProcMesh + P0 --> A0 + P1 --> A1 + P2 --> A2 + P3 --> A3 + end + + style A0 fill:#99ff99 + style A1 fill:#99ff99 + style A2 fill:#99ff99 + style A3 fill:#99ff99 +``` + +### Message Routing Through ActorMesh + +```mermaid +graph TD + subgraph MessageFlow["Message Flow Patterns"] + Client["await policy_actors.generate.METHOD(prompt)"] + + subgraph Methods["Different Adverbs Route Differently"] + Choose["choose()
→ Routes to ONE actor
→ Load balanced"] + Call["call()
→ Routes to ALL actors
→ Collects all results"] + Broadcast["broadcast()
→ Routes to ALL actors
→ Fire and forget"] + Stream["stream()
→ Routes to ALL actors
→ Iterator of results"] + end + + subgraph ActorInstances["PolicyActor Instances"] + A0["Actor 0
GPU 0
generates response"] + A1["Actor 1
GPU 1
generates response"] + A2["Actor 2
GPU 2
generates response"] + A3["Actor 3
GPU 3
generates response"] + end + + Client --> Choose + Client --> Call + Client --> Broadcast + Client --> Stream + + Choose -.->|"Load balanced"| A1 + Call --> A0 + Call --> A1 + Call --> A2 + Call --> A3 + Broadcast --> A0 + Broadcast --> A1 + Broadcast --> A2 + Broadcast --> A3 + Stream --> A0 + Stream --> A1 + Stream --> A2 + Stream --> A3 + end + + style Choose fill:#99ff99 + style Call fill:#ffcc99 + style Broadcast fill:#ff99cc + style Stream fill:#cc99ff +``` + +## How Forge Services Use Monarch + +Now the key insight: **Forge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**. + +### The Service Creation Process + +```mermaid +graph TD + subgraph ServiceCreation["spawn_service() Process"] + Call["await spawn_service(ServiceConfig(num_replicas=4), PolicyActor, model='Qwen')"] + + ServiceActor["ServiceActor
• Manages 4 replicas
• Handles health checks
• Routes service calls"] + + subgraph Replicas["4 Independent Replicas"] + subgraph R0["Replica 0"] + PM0["ProcMesh
1 process
GPU 0"] + AM0["ActorMesh
1 PolicyActor"] + end + + subgraph R1["Replica 1"] + PM1["ProcMesh
1 process
GPU 1"] + AM1["ActorMesh
1 PolicyActor"] + end + + subgraph R2["Replica 2"] + PM2["ProcMesh
1 process
GPU 2"] + AM2["ActorMesh
1 PolicyActor"] + end + + subgraph R3["Replica 3"] + PM3["ProcMesh
1 process
GPU 3"] + AM3["ActorMesh
1 PolicyActor"] + end + end + + Call --> ServiceActor + ServiceActor --> R0 + ServiceActor --> R1 + ServiceActor --> R2 + ServiceActor --> R3 + PM0 --> AM0 + PM1 --> AM1 + PM2 --> AM2 + PM3 --> AM3 + end + + style ServiceActor fill:#ffcc99 + style AM0 fill:#99ff99 + style AM1 fill:#99ff99 + style AM2 fill:#99ff99 + style AM3 fill:#99ff99 +``` + +### Service Call to Actor Execution + +```mermaid +graph TD + subgraph CallFlow["Complete Call Flow"] + UserCall["await policy_service.generate.choose('What is 2+2?')"] + + ServiceInterface["ServiceInterface
• Receives .choose() call
• Routes to ServiceActor"] + + ServiceActor["ServiceActor
• Selects healthy replica
• Load balancing logic
• Failure handling"] + + SelectedReplica["Selected Replica #2
• ProcMesh with 1 process
• ActorMesh with 1 PolicyActor"] + + PolicyActor["PolicyActor Instance
• Loads model
• Runs vLLM inference
• Returns 'The answer is 4'"] + + GPU["GPU 2
• vLLM engine
• Model weights
• KV cache
• CUDA kernels"] + + UserCall --> ServiceInterface + ServiceInterface --> ServiceActor + ServiceActor --> SelectedReplica + SelectedReplica --> PolicyActor + PolicyActor --> GPU + + GPU -.->|"Response"| PolicyActor + PolicyActor -.->|"Response"| SelectedReplica + SelectedReplica -.->|"Response"| ServiceActor + ServiceActor -.->|"Response"| ServiceInterface + ServiceInterface -.->|"'The answer is 4'"| UserCall + end + + style UserCall fill:#99ff99 + style ServiceActor fill:#ffcc99 + style PolicyActor fill:#cc99ff + style GPU fill:#ffcccc +``` + +## Multiple Services Sharing Infrastructure + +In real RL systems, you have multiple services that can share or use separate ProcMeshes: + +```mermaid +graph TD + subgraph Cluster["RL Training Cluster"] + subgraph Services["Forge Services"] + PS["Policy Service
4 GPU replicas"] + TS["Trainer Service
2 GPU replicas"] + RS["Reward Service
4 CPU replicas"] + BS["Buffer Service
1 CPU replica"] + end + + subgraph MonarchInfra["Monarch Infrastructure"] + subgraph GPUMesh["GPU ProcMesh (6 processes)"] + G0["Process 0
GPU 0"] + G1["Process 1
GPU 1"] + G2["Process 2
GPU 2"] + G3["Process 3
GPU 3"] + G4["Process 4
GPU 4"] + G5["Process 5
GPU 5"] + end + + subgraph CPUMesh["CPU ProcMesh (5 processes)"] + C0["Process 0
CPU"] + C1["Process 1
CPU"] + C2["Process 2
CPU"] + C3["Process 3
CPU"] + C4["Process 4
CPU"] + end + end + + PS --> G0 + PS --> G1 + PS --> G2 + PS --> G3 + TS --> G4 + TS --> G5 + RS --> C0 + RS --> C1 + RS --> C2 + RS --> C3 + BS --> C4 + end + + style PS fill:#99ff99 + style TS fill:#ff99cc + style RS fill:#ffcc99 + style BS fill:#cc99ff + style GPUMesh fill:#ffe6e6 + style CPUMesh fill:#e6f3ff +``` + +## Key Insights: Why This Architecture Matters + +1. **Process Isolation**: Each actor runs in its own process - failures don't cascade +2. **Location Transparency**: Actors can be local or remote with identical APIs +3. **Structured Distribution**: ProcMesh maps directly to hardware topology +4. **Message Passing**: No shared memory means no race conditions or locks +5. **Service Abstraction**: Forge hides Monarch complexity while preserving power + +Understanding this hierarchy helps you: +- **Debug performance issues**: Is the bottleneck at service, actor, or hardware level? +- **Optimize resource usage**: How many replicas per service? GPU vs CPU processes? +- **Handle failures gracefully**: Which layer failed and how to recover? +- **Scale effectively**: Where to add resources for maximum impact? + +# Conclusion + +## What You've Learned + +1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples +2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns +3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware + +## Key Takeaways + +- **Services hide complexity**: Your RL code looks like simple async functions, but runs on distributed clusters +- **Communication patterns matter**: `.route()`, `.fanout()`, sessions, and `.call_one()` each serve specific purposes +- **Architecture understanding helps**: Knowing the Service → Actor → Process → Hardware hierarchy helps you debug, optimize, and scale +- **Always verify APIs**: This guide is verified, but cross-check with source code for latest changes +- **Real API patterns**: Use `.options().as_service()` not `spawn_service()`, use `.route()` not `.choose()`, etc. diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD index 01d750d06..7798b147d 100644 --- a/docs/Tutorials/ReadMe.MD +++ b/docs/Tutorials/ReadMe.MD @@ -11,8 +11,8 @@ Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tu This section currently is structured in 3 detailed parts: 1. [RL Fundamentals and Understanding Forge Terminology](./1_RL_and_Forge_Fundamentals.MD): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals -2. []() -3. []() +2. [Forge Internals](./2_Forge_Internals.MD): Goes a layer deeper and explains the internals of Forge +3. [Monarch 101](./3_Monarch_101.MD): It's a 101 to Monarch and how Forge Talks to Monarch Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! From 21b0924c466f71793891ded231f569807049b392 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:26:45 -0700 Subject: [PATCH 07/22] Update 2_Forge_Internals.MD --- docs/Tutorials/2_Forge_Internals.MD | 42 ++++++++++++++--------------- 1 file changed, 20 insertions(+), 22 deletions(-) diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index d55eda51a..0c810a08e 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -8,6 +8,8 @@ Now that you see the power of the service abstraction, let's understand what's a When you call `await policy_service.generate(question)`, here's what actually happens: +(Don't worry, we will understand Services right in the next section!) + ```mermaid graph TD Call["Your Code:
await policy_service.generate"] @@ -58,17 +60,19 @@ Policy.options( # Other available options: # hosts=None ) - -# This is the ACTUAL way services are configured in Forge ``` ### 2. Real Service Creation Services are created using the `spawn_service` function: -```python -# This is what ACTUALLY works - copied directly from the notebook +The spawn_service() function automatically handles: +- Spawning actor replicas across processes/GPUs +- Load balancing with .choose() method +- Health monitoring and failure recovery +- Message routing and serialization +```python from forge.controller.service import ServiceConfig, spawn_service from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig @@ -89,12 +93,6 @@ prompt = "What is 3 + 5?" responses = await policy.generate.choose(prompt=prompt) print(f"Response: {responses[0].text}") -# The spawn_service() function automatically handles: -# - Spawning actor replicas across processes/GPUs -# - Load balancing with .choose() method -# - Health monitoring and failure recovery -# - Message routing and serialization - # Cleanup when done await shutdown_service(policy) ``` @@ -103,23 +101,23 @@ await shutdown_service(policy) Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas: -```python -# Forge internals - What happens behind the scenes: -# 1. .as_service() creates a ServiceInterface -# 2. ServiceInterface manages N replicas of your ForgeActor class -# 3. ServiceInterface handles routing between replicas -# 4. You get methods like .route(), .fanout(), etc. +Forge internals - What happens behind the scenes: +1. `.as_service()` creates a `ServiceInterface` +2. `ServiceInterface` manages N replicas of your `ForgeActor` class +3. `ServiceInterface` handles routing between replicas +4. You get methods like `.route()`, `.fanout()`, etc. +```python # Your code sees this: responses = await policy.generate.route(prompt=prompt) - -# But behind the scenes: -# - ServiceInterface selects healthy replica -# - Routes message to that replica's Policy.generate() endpoint -# - Handles failures and retries automatically -# - Returns list[Completion] from the selected replica ``` +But behind the scenes: +- `ServiceInterface` selects healthy replica +- Routes message to that replica's `Policy.generate()` endpoint +- Handles failures and retries automatically +- Returns list[Completion] from the selected replica + ### 3. Different Service Types and Their Characteristics ```mermaid From b581d11bde0ce603cc89f444c51f38add87cc4e2 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:34:03 -0700 Subject: [PATCH 08/22] add --- docs/Tutorials/2_Forge_Internals.MD | 43 +++++++++-------------------- docs/Tutorials/3_Monarch_101.MD | 2 ++ 2 files changed, 15 insertions(+), 30 deletions(-) diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 0c810a08e..9018afe3d 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -155,14 +155,14 @@ These communication patterns (\"adverbs\") determine how your service calls are ```python responses = await policy.generate.route(prompt=question) answer = responses[0].text # Extract text from Completion object - -# Behind the scenes: -# 1. Health check eliminates failed replicas -# 2. Load balancer picks least loaded healthy replica -# 3. Request routes to that specific replica -# 4. Automatic retry on different replica if failure ``` +Behind the scenes: +1. Health check eliminates failed replicas +2. Load balancer picks least loaded healthy replica +3. Request routes to that specific replica +4. Automatic retry on different replica if failure + **Performance characteristics**: - **Latency**: Lowest (single network hop) - **Throughput**: Limited by single replica capacity @@ -196,7 +196,7 @@ await policy.update_weights.fanout(new_policy_version) **When to use**: You want to process results as they arrive, not wait for all. ```python -# 📝 CONCEPTUAL - Streaming requires custom implementation in your training loop +# CONCEPTUAL - Streaming requires custom implementation in your training loop # The basic ReplayBuffer doesn't have built-in streaming methods # Pattern from apps/grpo/main.py continuous training: @@ -223,7 +223,7 @@ while training: **When to use**: Side effects that don't need responses (notifications, cache updates). ```python -# 📝 CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations +# CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations # The basic services don't have broadcast methods built-in # You would implement custom endpoints in your ForgeActor: @@ -485,36 +485,19 @@ trainer = await RLTrainer.options( ) ``` -### Natural Backpressure Through Service APIs - -```python -# backpressure pattern - The replay buffer naturally provides backpressure -batch = await replay_buffer.sample.call_one(curr_policy_version=step) -if batch is None: - # Not enough data yet - natural rate limiting - print("Buffer not ready, collecting more experiences...") - continue -else: - # Proceed with training - loss = await trainer.train_step.call_one(batch) - print(f"Training loss: {loss}") -``` - -These patterns address the core technical challenges in distributed RL. The key insight: **Forge services handle coordination complexity automatically, letting you focus on RL algorithm logic**. - ## Service Implementation Example Let's see how a reward service is actually implemented: ```python -# ✅ COMPLETE WORKING EXAMPLE - Exact RewardActor from apps/grpo/main.py +# Exact RewardActor from apps/grpo/main.py from forge.controller import ForgeActor from monarch.actor import endpoint from forge.data.rewards import MathReward, ThinkingReward from forge.controller.service import ServiceConfig, spawn_service -# EXACT class definition from apps/grpo/main.py lines 68-83 +# class definition from apps/grpo/main.py class RewardActor(ForgeActor): def __init__(self, reward_functions: list): self.reward_functions = reward_functions @@ -573,7 +556,7 @@ from forge.data.rewards import MathReward, ThinkingReward from monarch.actor import endpoint from omegaconf import DictConfig -# EXACT service creation from apps/grpo/main.py lines 322-344 +# Service creation from apps/grpo/main.py lines 322-344 print("Initializing all services...") ( dataloader, @@ -601,7 +584,6 @@ print("Initializing all services...") print("All services initialized successfully!") -# EXACT usage patterns from apps/grpo/main.py continuous training loop async def production_training_loop(): """Real training loop pattern from apps/grpo/main.py""" step = 0 @@ -639,7 +621,6 @@ async def production_training_loop(): print(f"Step {step}, Loss: {loss:.4f}") step += 1 -# EXACT cleanup pattern from apps/grpo/main.py lines 493-504 print("Shutting down services...") await asyncio.gather( DatasetActor.shutdown(dataloader), @@ -661,3 +642,5 @@ print("All services shut down successfully!") 5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions) This is the power of the service abstraction - complex distributed coordination looks like simple async Python code. + +In the next part we will learn about [Monarch internals](./3_Monarch_101.MD) \ No newline at end of file diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 9369be13a..94c02c37e 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -1,5 +1,7 @@ # Part 3: The Forge-Monarch Connection +This is part 3 of our series, in the previous sections: we learned [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals.MD), [Forge Internals](./2_Forge_Internals.MD). + Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging. ## The Complete Hierarchy: Service to Silicon From cb2ce542a9a9b8611ad19fa894dc39b9cb0e7f21 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:38:48 -0700 Subject: [PATCH 09/22] Update 3_Monarch_101.MD --- docs/Tutorials/3_Monarch_101.MD | 124 ++++++++++++++++---------------- 1 file changed, 62 insertions(+), 62 deletions(-) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 94c02c37e..7b3f6d310 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -11,24 +11,24 @@ graph TD subgraph YourCode["1. Your RL Code"] Call["await policy_service.generate.choose('What is 2+2?')"] end - + subgraph ForgeServices["2. Forge Service Layer"] ServiceInterface["ServiceInterface
• Routes .choose() to replica
• Handles load balancing
• Manages health checks"] ServiceActor["ServiceActor
• Manages replica lifecycle
• Monitors health
• Coordinates failures"] end - - subgraph MonarchLayer["3. Monarch Actor Layer"] - ActorMesh["ActorMesh[PolicyActor]
• 4 PolicyActor instances
• Each on different GPU
• Message passing interface"] - ProcMesh["ProcMesh
• 4 processes
• GPU topology: [0,1,2,3]
• Network interconnect"] + + subgraph MonarchLayer["3. Monarch Actor Layer"] + ActorMesh["ActorMesh PolicyActor
• 4 PolicyActor instances
• Each on different GPU
• Message passing interface"] + ProcMesh["ProcMesh
• 4 processes
• GPU topology: 0,1,2,3
• Network interconnect"] end - + subgraph Hardware["4. Physical Hardware"] GPU0["GPU 0
PolicyActor #1
vLLM Engine
Model Weights"] - GPU1["GPU 1
PolicyActor #2
vLLM Engine
Model Weights"] + GPU1["GPU 1
PolicyActor #2
vLLM Engine
Model Weights"] GPU2["GPU 2
PolicyActor #3
vLLM Engine
Model Weights"] GPU3["GPU 3
PolicyActor #4
vLLM Engine
Model Weights"] end - + Call --> ServiceInterface ServiceInterface --> ServiceActor ServiceActor --> ActorMesh @@ -37,7 +37,7 @@ graph TD ProcMesh --> GPU1 ProcMesh --> GPU2 ProcMesh --> GPU3 - + style Call fill:#99ff99 style ServiceActor fill:#ffcc99 style ActorMesh fill:#cc99ff @@ -55,17 +55,17 @@ graph TD subgraph Host["Single Host (8 GPUs)"] subgraph ProcMesh["ProcMesh: per_host={'gpus': 8}"] P0["Process 0
GPU 0"] - P1["Process 1
GPU 1"] + P1["Process 1
GPU 1"] P2["Process 2
GPU 2"] P3["Process 3
GPU 3"] P4["Process 4
GPU 4"] P5["Process 5
GPU 5"] - P6["Process 6
GPU 6"] + P6["Process 6
GPU 6"] P7["Process 7
GPU 7"] end - + P0 -.->|"Network"| P1 - P1 -.->|"Network"| P2 + P1 -.->|"Network"| P2 P2 -.->|"Network"| P3 P3 -.->|"Network"| P4 P4 -.->|"Network"| P5 @@ -73,7 +73,7 @@ graph TD P6 -.->|"Network"| P7 P7 -.->|"Network"| P0 end - + style P0 fill:#ff9999 style P1 fill:#ff9999 style P2 fill:#ff9999 @@ -97,8 +97,8 @@ graph TD H1P3["Process 3
GPU 3"] end end - - subgraph Host2["Host 2"] + + subgraph Host2["Host 2"] subgraph PM2["ProcMesh Segment 2"] H2P0["Process 4
GPU 0"] H2P1["Process 5
GPU 1"] @@ -106,22 +106,22 @@ graph TD H2P3["Process 7
GPU 3"] end end - + subgraph Host3["Host 3"] subgraph PM3["ProcMesh Segment 3"] H3P0["Process 8
GPU 0"] H3P1["Process 9
GPU 1"] - H3P2["Process 10
GPU 2"] + H3P2["Process 10
GPU 2"] H3P3["Process 11
GPU 3"] end end end - + H1P0 -.->|"InfiniBand"| H2P0 H1P1 -.->|"InfiniBand"| H2P1 H2P0 -.->|"InfiniBand"| H3P0 H2P1 -.->|"InfiniBand"| H3P1 - + style PM1 fill:#ff9999 style PM2 fill:#99ff99 style PM3 fill:#99ccff @@ -167,7 +167,7 @@ await counters.increment.call() value = await counters.get_value.call_one() print(f"One counter: {value}") -# choose() - random single actor +# choose() - random single actor value = await counters.get_value.choose() print(f"Random counter: {value}") @@ -190,28 +190,28 @@ await procs.stop() graph TD subgraph Creation["Actor Creation Process"] Code["mesh.spawn('policy', PolicyActor, model='Qwen/Qwen3-7B')"] - + subgraph ProcMesh["ProcMesh (4 processes)"] - P0["Process 0
GPU 0"] + P0["Process 0
GPU 0"] P1["Process 1
GPU 1"] P2["Process 2
GPU 2"] P3["Process 3
GPU 3"] end - + subgraph ActorMesh["ActorMesh[PolicyActor]"] A0["PolicyActor
Instance #0
model=Qwen/Qwen3-7B
generation_count=0"] A1["PolicyActor
Instance #1
model=Qwen/Qwen3-7B
generation_count=0"] A2["PolicyActor
Instance #2
model=Qwen/Qwen3-7B
generation_count=0"] A3["PolicyActor
Instance #3
model=Qwen/Qwen3-7B
generation_count=0"] end - + Code --> ProcMesh P0 --> A0 P1 --> A1 P2 --> A2 P3 --> A3 end - + style A0 fill:#99ff99 style A1 fill:#99ff99 style A2 fill:#99ff99 @@ -224,29 +224,29 @@ graph TD graph TD subgraph MessageFlow["Message Flow Patterns"] Client["await policy_actors.generate.METHOD(prompt)"] - + subgraph Methods["Different Adverbs Route Differently"] Choose["choose()
→ Routes to ONE actor
→ Load balanced"] - Call["call()
→ Routes to ALL actors
→ Collects all results"] + Call["call()
→ Routes to ALL actors
→ Collects all results"] Broadcast["broadcast()
→ Routes to ALL actors
→ Fire and forget"] Stream["stream()
→ Routes to ALL actors
→ Iterator of results"] end - + subgraph ActorInstances["PolicyActor Instances"] A0["Actor 0
GPU 0
generates response"] - A1["Actor 1
GPU 1
generates response"] + A1["Actor 1
GPU 1
generates response"] A2["Actor 2
GPU 2
generates response"] A3["Actor 3
GPU 3
generates response"] end - + Client --> Choose Client --> Call Client --> Broadcast Client --> Stream - + Choose -.->|"Load balanced"| A1 Call --> A0 - Call --> A1 + Call --> A1 Call --> A2 Call --> A3 Broadcast --> A0 @@ -258,7 +258,7 @@ graph TD Stream --> A2 Stream --> A3 end - + style Choose fill:#99ff99 style Call fill:#ffcc99 style Broadcast fill:#ff99cc @@ -275,31 +275,31 @@ Now the key insight: **Forge services are ServiceActors that manage ActorMeshes graph TD subgraph ServiceCreation["spawn_service() Process"] Call["await spawn_service(ServiceConfig(num_replicas=4), PolicyActor, model='Qwen')"] - + ServiceActor["ServiceActor
• Manages 4 replicas
• Handles health checks
• Routes service calls"] - - subgraph Replicas["4 Independent Replicas"] + + subgraph Replicas["4 Independent Replicas"] subgraph R0["Replica 0"] PM0["ProcMesh
1 process
GPU 0"] AM0["ActorMesh
1 PolicyActor"] end - + subgraph R1["Replica 1"] - PM1["ProcMesh
1 process
GPU 1"] + PM1["ProcMesh
1 process
GPU 1"] AM1["ActorMesh
1 PolicyActor"] end - + subgraph R2["Replica 2"] PM2["ProcMesh
1 process
GPU 2"] AM2["ActorMesh
1 PolicyActor"] end - + subgraph R3["Replica 3"] PM3["ProcMesh
1 process
GPU 3"] AM3["ActorMesh
1 PolicyActor"] end end - + Call --> ServiceActor ServiceActor --> R0 ServiceActor --> R1 @@ -310,7 +310,7 @@ graph TD PM2 --> AM2 PM3 --> AM3 end - + style ServiceActor fill:#ffcc99 style AM0 fill:#99ff99 style AM1 fill:#99ff99 @@ -324,30 +324,30 @@ graph TD graph TD subgraph CallFlow["Complete Call Flow"] UserCall["await policy_service.generate.choose('What is 2+2?')"] - + ServiceInterface["ServiceInterface
• Receives .choose() call
• Routes to ServiceActor"] - + ServiceActor["ServiceActor
• Selects healthy replica
• Load balancing logic
• Failure handling"] - + SelectedReplica["Selected Replica #2
• ProcMesh with 1 process
• ActorMesh with 1 PolicyActor"] - + PolicyActor["PolicyActor Instance
• Loads model
• Runs vLLM inference
• Returns 'The answer is 4'"] - + GPU["GPU 2
• vLLM engine
• Model weights
• KV cache
• CUDA kernels"] - + UserCall --> ServiceInterface ServiceInterface --> ServiceActor ServiceActor --> SelectedReplica SelectedReplica --> PolicyActor PolicyActor --> GPU - + GPU -.->|"Response"| PolicyActor PolicyActor -.->|"Response"| SelectedReplica SelectedReplica -.->|"Response"| ServiceActor ServiceActor -.->|"Response"| ServiceInterface ServiceInterface -.->|"'The answer is 4'"| UserCall end - + style UserCall fill:#99ff99 style ServiceActor fill:#ffcc99 style PolicyActor fill:#cc99ff @@ -361,32 +361,32 @@ In real RL systems, you have multiple services that can share or use separate Pr ```mermaid graph TD subgraph Cluster["RL Training Cluster"] - subgraph Services["Forge Services"] + subgraph Services["Forge Services"] PS["Policy Service
4 GPU replicas"] - TS["Trainer Service
2 GPU replicas"] + TS["Trainer Service
2 GPU replicas"] RS["Reward Service
4 CPU replicas"] BS["Buffer Service
1 CPU replica"] end - + subgraph MonarchInfra["Monarch Infrastructure"] subgraph GPUMesh["GPU ProcMesh (6 processes)"] G0["Process 0
GPU 0"] G1["Process 1
GPU 1"] - G2["Process 2
GPU 2"] + G2["Process 2
GPU 2"] G3["Process 3
GPU 3"] G4["Process 4
GPU 4"] G5["Process 5
GPU 5"] end - + subgraph CPUMesh["CPU ProcMesh (5 processes)"] C0["Process 0
CPU"] - C1["Process 1
CPU"] + C1["Process 1
CPU"] C2["Process 2
CPU"] C3["Process 3
CPU"] C4["Process 4
CPU"] end end - + PS --> G0 PS --> G1 PS --> G2 @@ -399,7 +399,7 @@ graph TD RS --> C3 BS --> C4 end - + style PS fill:#99ff99 style TS fill:#ff99cc style RS fill:#ffcc99 @@ -411,7 +411,7 @@ graph TD ## Key Insights: Why This Architecture Matters 1. **Process Isolation**: Each actor runs in its own process - failures don't cascade -2. **Location Transparency**: Actors can be local or remote with identical APIs +2. **Location Transparency**: Actors can be local or remote with identical APIs 3. **Structured Distribution**: ProcMesh maps directly to hardware topology 4. **Message Passing**: No shared memory means no race conditions or locks 5. **Service Abstraction**: Forge hides Monarch complexity while preserving power @@ -427,13 +427,13 @@ Understanding this hierarchy helps you: ## What You've Learned 1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples -2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns +2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns 3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware ## Key Takeaways - **Services hide complexity**: Your RL code looks like simple async functions, but runs on distributed clusters -- **Communication patterns matter**: `.route()`, `.fanout()`, sessions, and `.call_one()` each serve specific purposes +- **Communication patterns matter**: `.route()`, `.fanout()`, sessions, and `.call_one()` each serve specific purposes - **Architecture understanding helps**: Knowing the Service → Actor → Process → Hardware hierarchy helps you debug, optimize, and scale - **Always verify APIs**: This guide is verified, but cross-check with source code for latest changes - **Real API patterns**: Use `.options().as_service()` not `spawn_service()`, use `.route()` not `.choose()`, etc. From 07b059777666027e91b00304d1baeafb5470ad9a Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:40:02 -0700 Subject: [PATCH 10/22] Update 3_Monarch_101.MD --- docs/Tutorials/3_Monarch_101.MD | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 7b3f6d310..0b1b4bd79 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -198,7 +198,7 @@ graph TD P3["Process 3
GPU 3"] end - subgraph ActorMesh["ActorMesh[PolicyActor]"] + subgraph ActorMesh["ActorMesh PolicyActor"] A0["PolicyActor
Instance #0
model=Qwen/Qwen3-7B
generation_count=0"] A1["PolicyActor
Instance #1
model=Qwen/Qwen3-7B
generation_count=0"] A2["PolicyActor
Instance #2
model=Qwen/Qwen3-7B
generation_count=0"] From 55dc5b452bd757deaf31f9262976c676556cf05b Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Thu, 2 Oct 2025 19:40:40 -0700 Subject: [PATCH 11/22] Update 3_Monarch_101.MD --- docs/Tutorials/3_Monarch_101.MD | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 0b1b4bd79..52a058dcc 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -1,6 +1,6 @@ # Part 3: The Forge-Monarch Connection -This is part 3 of our series, in the previous sections: we learned [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals.MD), [Forge Internals](./2_Forge_Internals.MD). +This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals.MD), Part 2: [Forge Internals](./2_Forge_Internals.MD). Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging. From 7366497a21f7268bec99fc3d454bf5c40d81dd50 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Fri, 3 Oct 2025 00:22:10 -0700 Subject: [PATCH 12/22] fix funcs --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 152 ++++++------- docs/Tutorials/2_Forge_Internals.MD | 200 ++++++++++-------- docs/Tutorials/3_Monarch_101.MD | 14 +- 3 files changed, 199 insertions(+), 167 deletions(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 810ef373f..c34ae6639 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -114,33 +114,36 @@ Let's look at the example from above again, but this time we would use the names # Conceptual Example async def conceptual_forge_rl_step(services, step): - # 1. Get a math problem - CONCEPTUAL API - sample = await services['dataloader'].get_sample() - question, target = sample["question"], sample["answer"] + # 1. Get a math problem - Using actual DatasetActor API + sample = await services['dataloader'].sample.call_one() + question, target = sample["request"], sample["target"] - # 2. Student generates answer - CONCEPTUAL API - # Actual method names vary by implementation - responses = await services['policy'].generate(prompt=question) + # 2. Student generates answer - Using actual Policy API + responses = await services['policy'].generate.route(prompt=question) answer = responses[0].text - # 3. Teacher grades it - CONCEPTUAL API - # Actual reward evaluation varies by implementation - score = await services['reward_actor'].evaluate( + # 3. Teacher grades it - Using actual RewardActor API + score = await services['reward_actor'].evaluate_response.route( prompt=question, response=answer, target=target ) - # 4. Compare to baseline - CONCEPTUAL API - ref_logprobs = await services['ref_model'].compute_baseline(responses[0].token_ids) + # 4. Compare to baseline - Using actual ReferenceModel API + # Note: ReferenceModel.forward requires input_ids, max_req_tokens, return_logprobs + ref_logprobs = await services['ref_model'].forward.route( + input_ids, max_req_tokens, return_logprobs=True + ) - # 5. Store experience - CONCEPTUAL Episode structure - # Real Episode structure in src/forge/data_models/episode.py - episode = create_episode(responses[0], score, ref_logprobs, step) - await services['replay_buffer'].store(episode) + # 5. Store experience - Using actual Episode structure from apps/grpo/main.py + episode = create_episode_from_response(responses[0], score, ref_logprobs, step) + await services['replay_buffer'].add.call_one(episode) - # 6. Improve student - CONCEPTUAL API - batch = await services['replay_buffer'].get_batch(policy_version=step) + # 6. Improve student - Using actual training pattern + batch = await services['replay_buffer'].sample.call_one( + curr_policy_version=step + ) if batch is not None: - loss = await services['trainer'].update_policy(batch) + inputs, targets = batch + loss = await services['trainer'].train_step.call(inputs, targets) return loss ``` @@ -234,34 +237,38 @@ Let's see how core RL concepts map to Forge services: async def real_rl_training_step(services, step): """Single RL step using verified Forge APIs""" - # 1. Environment interaction - sample = await services['dataloader'].__next__.call_one() - prompt, target = sample["question"], sample["answer"] + # 1. Environment interaction - Using actual DatasetActor API + sample = await services['dataloader'].sample.call_one() + prompt, target = sample["request"], sample["target"] - responses = await services['policy'].generate.route(prompt=prompt) + responses = await services['policy'].generate.route(prompt) - # 2. Reward computation + # 2. Reward computation - Using actual RewardActor API score = await services['reward_actor'].evaluate_response.route( prompt=prompt, response=responses[0].text, target=target ) - # 3. Get reference logprobs - ref_logprobs = await services['ref_model'].forward.route(responses[0].token_ids) + # 3. Get reference logprobs - Using actual ReferenceModel API + # Note: ReferenceModel requires full input_ids tensor, not just tokens + input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids]) + ref_logprobs = await services['ref_model'].forward.route( + input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True + ) - # 4. Experience storage - Episode creation pattern - # Note: Actual Episode structure requires token tensors, not text + # 4. Experience storage - Using actual Episode pattern from GRPO episode = create_episode_from_response(responses[0], score, ref_logprobs, step) await services['replay_buffer'].add.call_one(episode) - # 5. Learning - trainer endpoint + # 5. Learning - Using actual trainer pattern batch = await services['replay_buffer'].sample.call_one( curr_policy_version=step ) if batch is not None: - loss = await services['trainer'].train_step.call_one(batch) + inputs, targets = batch # GRPO returns (inputs, targets) tuple + loss = await services['trainer'].train_step.call(inputs, targets) - # 6. Policy synchronization - weight update pattern - await services['trainer'].push_weights.call_one(step + 1) + # 6. Policy synchronization - Using actual weight update pattern + await services['trainer'].push_weights.call(step + 1) await services['policy'].update_weights.fanout(step + 1) return loss @@ -287,12 +294,14 @@ Forge handles behind the scenes: ### Independent Scaling ```python -from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig +from forge.actors.policy import Policy from forge.actors.replay_buffer import ReplayBuffer -from forge.controller.service import shutdown_service -from apps.grpo.main import Trainer, RewardActor, ComputeAdvantages, RefModel, DatasetActor +from forge.actors.reference_model import ReferenceModel +from forge.actors.trainer import RLTrainer +from apps.grpo.main import DatasetActor, RewardActor, ComputeAdvantages from forge.data.rewards import MathReward, ThinkingReward import asyncio +import torch model = "Qwen/Qwen3-1.7B" group_size = 1 @@ -306,67 +315,60 @@ group_size = 1 ref_model, reward_actor, ) = await asyncio.gather( - # Dataset service - spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=1), - DatasetActor, + # Dataset actor (CPU) + DatasetActor.options(procs=1).as_actor( path="openai/gsm8k", - config_name="main", - split="train", + revision="main", + data_split="train", streaming=True, + model=model, ), # Policy service with GPU - spawn_service( - ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1), - Policy, - config=PolicyConfig( - worker_params=WorkerConfig(model=model), - sampling_params=SamplingOverrides( - num_samples=group_size, max_tokens=16 - ), - ), + Policy.options(procs=1, with_gpus=True, num_replicas=1).as_service( + engine_config={ + "model": model, + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "enforce_eager": False + }, + sampling_config={ + "n": group_size, + "max_tokens": 16, + "temperature": 1.0, + "top_p": 1.0 + } ), - # Trainer service with GPU - spawn_service( - ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1), - Trainer, - learning_rate=1e-5, - beta=0.1, - model_name=model, + # Trainer actor with GPU + RLTrainer.options(procs=1, with_gpus=True).as_actor( + # Trainer config would come from YAML in real usage + model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": f"hf://{model}"}, + optimizer={"name": "AdamW", "lr": 1e-5}, + training={"local_batch_size": 2, "seq_len": 2048} ), # Replay buffer (CPU) - spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=1), - ReplayBuffer, + ReplayBuffer.options(procs=1).as_actor( batch_size=2, max_policy_age=1, + dp_size=1 ), # Advantage computation (CPU) - spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=1), - ComputeAdvantages, - gamma=0.99, - lambda_=0.95, - ), + ComputeAdvantages.options(procs=1).as_actor(), # Reference model with GPU - spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=1, with_gpus=True), - RefModel, - model_name=model, + ReferenceModel.options(procs=1, with_gpus=True).as_actor( + model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": f"hf://{model}"}, + training={"dtype": "bfloat16"} ), # Reward actor (CPU) - spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=1), - RewardActor, - reward_functions=[MathReward(), ThinkingReward()], + RewardActor.options(procs=1, num_replicas=1).as_service( + reward_functions=[MathReward(), ThinkingReward()] ) ) ``` -Production scaling - multiply num_replicas: +Production scaling - multiply num_replicas for services or spawn multiple actors: - Policy: num_replicas=8 for high inference demand - RewardActor: num_replicas=16 for parallel evaluation -- Trainer: num_replicas=4 for distributed training +- Trainer: Multiple actors for distributed training (RLTrainer handles this internally) ### Fault Tolerance diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 9018afe3d..634f04f85 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -65,36 +65,44 @@ Policy.options( ### 2. Real Service Creation Services are created using the `spawn_service` function: +Services are created using the `.options().as_service()` pattern from the actual GRPO implementation: -The spawn_service() function automatically handles: +The service creation automatically handles: - Spawning actor replicas across processes/GPUs -- Load balancing with .choose() method +- Load balancing with .route() method for services - Health monitoring and failure recovery - Message routing and serialization ```python -from forge.controller.service import ServiceConfig, spawn_service -from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig +from forge.actors.policy import Policy model = "Qwen/Qwen3-1.7B" -policy = await spawn_service( - ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1), - Policy, - config=PolicyConfig( - worker_params=WorkerConfig(model=model), - sampling_params=SamplingOverrides( - num_samples=1, max_tokens=16 - ), - ), +policy = await Policy.options( + procs=1, + with_gpus=True, + num_replicas=1 +).as_service( + engine_config={ + "model": model, + "tensor_parallel_size": 1, + "pipeline_parallel_size": 1, + "enforce_eager": False + }, + sampling_config={ + "n": 1, + "max_tokens": 16, + "temperature": 1.0, + "top_p": 1.0 + } ) prompt = "What is 3 + 5?" -responses = await policy.generate.choose(prompt=prompt) +responses = await policy.generate.route(prompt) print(f"Response: {responses[0].text}") # Cleanup when done -await shutdown_service(policy) +await policy.shutdown() ``` ### 3. How Services Actually Work @@ -253,7 +261,6 @@ class CustomPolicy(Policy): # This Counter example demonstrates the session pattern from forge.controller import ForgeActor -from forge.controller.service import ServiceConfig, spawn_service, shutdown_service from monarch.actor import endpoint class ForgeCounter(ForgeActor): @@ -273,37 +280,35 @@ class ForgeCounter(ForgeActor): async def reset(self): self.value = 0 -counter_service = await spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=4), - ForgeCounter, - initial_value=0 -) +counter_service = await ForgeCounter.options( + procs=1, num_replicas=4 +).as_service(initial_value=0) # Test basic operations -await counter_service.increment.choose() -results = await counter_service.increment.call() +await counter_service.increment.route() +results = await counter_service.increment.fanout() # Get from all replicas print(f"All replica values: {results}") # STICKY SESSIONS print("\nUsing sticky sessions:") async with counter_service.session(): - await counter_service.reset.choose() - print(await counter_service.increment.choose()) # 1 - print(await counter_service.increment.choose()) # 2 - print(await counter_service.increment.choose()) # 3 + await counter_service.reset.route() # Uses .route() within session + print(await counter_service.increment.route()) # 1 + print(await counter_service.increment.route()) # 2 + print(await counter_service.increment.route()) # 3 - final_value = await counter_service.get_value.choose() + final_value = await counter_service.get_value.route() print(f"Final value on this replica: {final_value}") # 3 # Same pattern works with Policy for multi-turn conversations: # async with policy.session(): -# response1 = await policy.generate.choose(prompt=turn1) +# response1 = await policy.generate.route(turn1) # full_prompt = turn1 + response1[0].text + turn2 -# response2 = await policy.generate.choose(prompt=full_prompt) +# response2 = await policy.generate.route(full_prompt) # # Both calls hit same replica, preserving KV cache # Cleanup -await shutdown_service(counter_service) +await counter_service.shutdown() ``` **Performance impact**: Critical for maintaining KV cache in multi-turn conversations. @@ -395,60 +400,72 @@ print(f"Current policy version: {current_version}") Instead of manual coordination, Forge services handle speed mismatches automatically: ```python - from apps.grpo.main import Episode, Group async def simple_rl_step(): # ===== Generate a rollout ===== - sample = await dataloader.__next__.choose() - prompt, target = sample["question"], sample["answer"] + sample = await dataloader.sample.call_one() # DatasetActor is an actor, not service + prompt, target = sample["request"], sample["target"] # Correct field names print(f"Prompt: {prompt}") print(f"Target: {target}") - actions = await policy.generate.choose(prompt=prompt) + actions = await policy.generate.route(prompt=prompt) # Policy is a service print(f"Policy response: {actions[0].text}") - ref_logprobs = await ref_model.forward.choose(actions[0].token_ids) - reward = await reward_actor.evaluate_response.choose( + # Create input tensor for reference model (requires full context) + input_ids = torch.cat([actions[0].prompt_ids, actions[0].token_ids]) + ref_logprobs = await ref_model.forward.route( + input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True + ) + reward = await reward_actor.evaluate_response.route( # RewardActor is a service prompt=prompt, response=actions[0].text, target=target ) print(f"Reward: {reward}") + # Create episode using actual GRPO Episode structure episode = Episode( - episode_id=0, - prompt=prompt, - target=target, + episode_id="0", + request=prompt, policy_version=0, + pad_id=tokenizer.pad_token_id, + request_len=512, + response_len=512, + target=target ) - episode.add_group(Group( - response=actions[0].text, - ref_logprobs=ref_logprobs, - reward=reward, - )) + # Add response data + episode.response = actions[0].text + episode.request_tokens = actions[0].prompt_ids.tolist() + episode.response_tokens = actions[0].token_ids.tolist() + episode.ref_logprobs = ref_logprobs[0] # Extract from batch dimension + episode.reward = reward - advantages = await compute_advantages.__call__.choose(episode.groups) - episode.groups[0].advantage = advantages[0] + # Compute advantages using actual ComputeAdvantages actor + group = Group.new_group(0, 1, prompt, 0, tokenizer.pad_token_id, 512, 512, target) + group.episodes[0] = episode + advantages = await compute_advantages.compute.call_one(group) # ComputeAdvantages is an actor + episode.advantage = advantages[0] print(f"Advantage: {advantages[0]}") - await replay_buffer.add.choose(episode) + await replay_buffer.add.call_one(episode) # ReplayBuffer is an actor print("Episode stored in replay buffer") # ===== Train on the batch ===== - batch = await replay_buffer.sample.choose(curr_policy_version=0) + batch = await replay_buffer.sample.call_one(curr_policy_version=0) if batch is not None: print("Training on batch...") - training_result = await trainer.train_step.choose(batch) - loss = training_result.get("loss", 0.0) + inputs, targets = batch # GRPO returns (inputs, targets) tuple + loss = await trainer.train_step.call(inputs, targets) # RLTrainer is an actor print(f"Training loss: {loss}") return loss else: print("Not enough data in buffer yet") return None +# Note: This simplified example assumes tokenizer and services are already initialized for step in range(10): print(f"\n--- RL Step {step + 1} ---") loss = await simple_rl_step() @@ -467,7 +484,7 @@ for step in range(10): policy = await Policy.options( procs=1, num_replicas=8, with_gpus=True # Many replicas for high throughput ).as_service( - engine_config=EngineConfig(model=model_name) + engine_config={"model": model_name, "tensor_parallel_size": 1} ) # Reward evaluation might be CPU-bound @@ -479,9 +496,10 @@ reward_actor = await RewardActor.options( # Training needs fewer but more powerful replicas trainer = await RLTrainer.options( - procs=1, num_replicas=2, with_gpus=True # Fewer but GPU-heavy + procs=1, with_gpus=True # Fewer but GPU-heavy ).as_actor( # Trainer typically uses .as_actor() not .as_service() - optimizer=Optimizer(lr=1e-5) + model={"name": "qwen3", "flavor": "1.7B"}, + optimizer={"name": "AdamW", "lr": 1e-5} ) ``` @@ -495,7 +513,6 @@ Let's see how a reward service is actually implemented: from forge.controller import ForgeActor from monarch.actor import endpoint from forge.data.rewards import MathReward, ThinkingReward -from forge.controller.service import ServiceConfig, spawn_service # class definition from apps/grpo/main.py class RewardActor(ForgeActor): @@ -515,9 +532,9 @@ class RewardActor(ForgeActor): # Return average reward across all functions return total_reward / len(self.reward_functions) if self.reward_functions else 0.0 -reward_actor = await spawn_service( - ServiceConfig(procs_per_replica=1, num_replicas=1), - RewardActor, +reward_actor = await RewardActor.options( + procs=1, num_replicas=1 +).as_service( reward_functions=[MathReward(), ThinkingReward()] ) @@ -525,7 +542,7 @@ prompt = "What is 15% of 240?" response = "15% of 240 is 36" target = "36" -score = await reward_actor.evaluate_response.choose( +score = await reward_actor.evaluate_response.route( prompt=prompt, response=response, target=target @@ -533,10 +550,10 @@ score = await reward_actor.evaluate_response.choose( print(f"Reward score: {score}") # Usually around 1.0 for correct math answers # For production scaling - increase num_replicas for parallel evaluation: -# ServiceConfig(procs_per_replica=1, num_replicas=16) # 16 parallel evaluators +# RewardActor.options(procs=1, num_replicas=16) # 16 parallel evaluators # Cleanup when done -await shutdown_service(reward_actor) +await reward_actor.shutdown() ``` ## Service Orchestration: The Training Loop @@ -547,16 +564,15 @@ Now let's see how services coordinate in a real training loop: # This is the REAL way production RL systems are built with Forge import asyncio +import torch from forge.actors.policy import Policy from forge.actors.reference_model import ReferenceModel from forge.actors.replay_buffer import ReplayBuffer from forge.actors.trainer import RLTrainer -from forge.controller.actor import ForgeActor +from apps.grpo.main import DatasetActor, RewardActor, ComputeAdvantages from forge.data.rewards import MathReward, ThinkingReward -from monarch.actor import endpoint -from omegaconf import DictConfig -# Service creation from apps/grpo/main.py lines 322-344 +# Service creation pattern from apps/grpo/main.py lines 322-344 print("Initializing all services...") ( dataloader, @@ -567,17 +583,27 @@ print("Initializing all services...") ref_model, reward_actor, ) = await asyncio.gather( - DatasetActor.options(**cfg.actors.dataset).as_actor(**cfg.dataset), - Policy.options(**cfg.services.policy).as_service(**cfg.policy), - RLTrainer.options(**cfg.actors.trainer).as_actor( - **cfg.trainer, loss=simple_grpo_loss + DatasetActor.options(procs=1).as_actor( + path="openai/gsm8k", revision="main", data_split="train", + streaming=True, model="Qwen/Qwen3-1.7B" + ), + Policy.options(procs=1, with_gpus=True, num_replicas=1).as_service( + engine_config={"model": "Qwen/Qwen3-1.7B", "tensor_parallel_size": 1}, + sampling_config={"n": 1, "max_tokens": 512} ), - ReplayBuffer.options(**cfg.actors.replay_buffer).as_actor( - **cfg.replay_buffer, collate=collate + RLTrainer.options(procs=1, with_gpus=True).as_actor( + model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": "hf://Qwen/Qwen3-1.7B"}, + optimizer={"name": "AdamW", "lr": 1e-5}, + training={"local_batch_size": 2, "seq_len": 2048} ), - ComputeAdvantages.options(**cfg.actors.compute_advantages).as_actor(), - ReferenceModel.options(**cfg.services.ref_model).as_service(**cfg.ref_model), - RewardActor.options(**cfg.services.reward_actor).as_service( + ReplayBuffer.options(procs=1).as_actor( + batch_size=2, max_policy_age=1, dp_size=1 + ), + ComputeAdvantages.options(procs=1).as_actor(), + ReferenceModel.options(procs=1, with_gpus=True).as_actor( + model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": "hf://Qwen/Qwen3-1.7B"} + ), + RewardActor.options(procs=1, num_replicas=1).as_service( reward_functions=[MathReward(), ThinkingReward()] ), ) @@ -593,10 +619,13 @@ async def production_training_loop(): sample = await dataloader.sample.call_one() # Policy generation service call - responses = await policy.generate.route(prompt=sample["question"]) + responses = await policy.generate.route(sample["request"]) # Correct field name - # Reference computation service call - ref_logprobs = await ref_model.forward.route(responses[0].token_ids) + # Reference computation service call (requires full input tensor) + input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids]) + ref_logprobs = await ref_model.forward.route( + input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True + ) # Reward evaluation service call reward = await reward_actor.evaluate_response.route( @@ -605,18 +634,19 @@ async def production_training_loop(): target=sample["answer"] ) - # Experience storage (simplified structure for illustration) - episode = create_episode(sample, responses[0], reward, ref_logprobs, step) + # Experience storage (using actual Episode structure) + episode = create_episode_from_grpo_data(sample, responses[0], reward, ref_logprobs[0], step) await replay_buffer.add.call_one(episode) - # Training when ready endpoints + # Training when ready batch = await replay_buffer.sample.call_one(curr_policy_version=step) if batch is not None: - loss = await trainer.train_step.call_one(batch) + inputs, targets = batch # GRPO returns (inputs, targets) tuple + loss = await trainer.train_step.call(inputs, targets) # Weight synchronization pattern - await trainer.push_weights.call_one(step + 1) - await policy.update_weights.route(step + 1) + await trainer.push_weights.call(step + 1) + await policy.update_weights.fanout(step + 1) # Fanout to all replicas print(f"Step {step}, Loss: {loss:.4f}") step += 1 @@ -628,7 +658,7 @@ await asyncio.gather( RLTrainer.shutdown(trainer), ReplayBuffer.shutdown(replay_buffer), ComputeAdvantages.shutdown(compute_advantages), - ref_model.shutdown(), + ReferenceModel.shutdown(ref_model), reward_actor.shutdown(), ) print("All services shut down successfully!") @@ -636,7 +666,7 @@ print("All services shut down successfully!") **Key observations:** 1. **Parallelism**: Independent operations run concurrently -2. **Load balancing**: Each `choose()` call automatically selects optimal replica +2. **Load balancing**: Each `.route()` call automatically selects optimal replica 3. **Fault tolerance**: Failures automatically retry on different replicas 4. **Resource efficiency**: CPU and GPU services scale independently 5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 52a058dcc..0cbdcbd88 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -9,11 +9,11 @@ Now let's peel back the layers. Forge services are built on top of **Monarch**, ```mermaid graph TD subgraph YourCode["1. Your RL Code"] - Call["await policy_service.generate.choose('What is 2+2?')"] + Call["await policy_service.generate.route('What is 2+2?')"] end subgraph ForgeServices["2. Forge Service Layer"] - ServiceInterface["ServiceInterface
• Routes .choose() to replica
• Handles load balancing
• Manages health checks"] + ServiceInterface["ServiceInterface
• Routes .route() to replica
• Handles load balancing
• Manages health checks"] ServiceActor["ServiceActor
• Manages replica lifecycle
• Monitors health
• Coordinates failures"] end @@ -167,7 +167,7 @@ await counters.increment.call() value = await counters.get_value.call_one() print(f"One counter: {value}") -# choose() - random single actor +# choose() - random single actor (actors only, not services) value = await counters.get_value.choose() print(f"Random counter: {value}") @@ -273,8 +273,8 @@ Now the key insight: **Forge services are ServiceActors that manage ActorMeshes ```mermaid graph TD - subgraph ServiceCreation["spawn_service() Process"] - Call["await spawn_service(ServiceConfig(num_replicas=4), PolicyActor, model='Qwen')"] + subgraph ServiceCreation["Service Creation Process"] + Call["await PolicyActor.options(num_replicas=4, procs=1).as_service(model='Qwen')"] ServiceActor["ServiceActor
• Manages 4 replicas
• Handles health checks
• Routes service calls"] @@ -323,9 +323,9 @@ graph TD ```mermaid graph TD subgraph CallFlow["Complete Call Flow"] - UserCall["await policy_service.generate.choose('What is 2+2?')"] + UserCall["await policy_service.generate.route('What is 2+2?')"] - ServiceInterface["ServiceInterface
• Receives .choose() call
• Routes to ServiceActor"] + ServiceInterface["ServiceInterface
• Receives .route() call
• Routes to ServiceActor"] ServiceActor["ServiceActor
• Selects healthy replica
• Load balancing logic
• Failure handling"] From 75352ab1090b855dabbe9b7420043a68fe8a7a7b Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Fri, 3 Oct 2025 13:50:49 -0700 Subject: [PATCH 13/22] Update docs/Tutorials/2_Forge_Internals.MD Co-authored-by: Allen Wang <9057208+allenwang28@users.noreply.github.com> --- docs/Tutorials/2_Forge_Internals.MD | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 634f04f85..09c39fb7e 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -58,7 +58,7 @@ Policy.options( num_replicas=4, # Number of replicas with_gpus=True # Allocate GPUs # Other available options: - # hosts=None + # hosts=None # the number of remote hosts used per replica ) ``` From d0ea7709f8622448934595a9ce3deeafcb14eaec Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Fri, 10 Oct 2025 14:10:38 -0700 Subject: [PATCH 14/22] update part 1 and 2 --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 4 +- docs/Tutorials/2_Forge_Internals.MD | 56 +------------------ 2 files changed, 3 insertions(+), 57 deletions(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index c34ae6639..32ada41cb 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -213,7 +213,7 @@ Each step has different: Unlike supervised learning where you process independent batches, RL requires coordination: ```python -# This won't work - creates bottlenecks and resource waste +# While this does work, it creates bottlenecks and resource waste def naive_rl_step(): # Policy waits idle while reward model works response = policy_model.generate(prompt) # GPU busy @@ -368,7 +368,7 @@ group_size = 1 Production scaling - multiply num_replicas for services or spawn multiple actors: - Policy: num_replicas=8 for high inference demand - RewardActor: num_replicas=16 for parallel evaluation -- Trainer: Multiple actors for distributed training (RLTrainer handles this internally) +- Trainer: Multiple processes for distributed training (RLTrainer handles this internally) ### Fault Tolerance diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 09c39fb7e..c21485bb0 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -64,7 +64,6 @@ Policy.options( ### 2. Real Service Creation -Services are created using the `spawn_service` function: Services are created using the `.options().as_service()` pattern from the actual GRPO implementation: The service creation automatically handles: @@ -126,32 +125,6 @@ But behind the scenes: - Handles failures and retries automatically - Returns list[Completion] from the selected replica -### 3. Different Service Types and Their Characteristics - -```mermaid -graph TD - subgraph GPU["GPU-Intensive Services"] - PolicySvc["Policy Service
Large model inference
High GPU memory
Batch optimization"] - TrainerSvc["Trainer Service
Distributed training
Gradient sync
Massive compute"] - RefSvc["Reference Service
Frozen model
Baseline computation
Read-only ops"] - end - - subgraph CPU["CPU-Intensive Services"] - RewardSvc["Reward Service
Evaluation logic
Rule-based scoring
High throughput"] - DataSvc["Data Service
Dataset streaming
Preprocessing
I/O optimization"] - end - - subgraph Memory["Memory-Intensive Services"] - BufferSvc["Buffer Service
Experience storage
Efficient sampling
Persistence"] - MetricsSvc["Metrics Service
Logging aggregation
Performance tracking
Analytics"] - end - - style PolicySvc fill:#ff9999 - style TrainerSvc fill:#ff9999 - style RewardSvc fill:#99ff99 - style BufferSvc fill:#9999ff -``` - ## Deep Dive: Service Communication Patterns These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage. @@ -226,34 +199,7 @@ while training: **Critical insight**: This is essential for high-throughput RL where you can't wait for batches. -### 4. Fire-and-Forget Operations - -**When to use**: Side effects that don't need responses (notifications, cache updates). - -```python -# CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations -# The basic services don't have broadcast methods built-in -# You would implement custom endpoints in your ForgeActor: - -class CustomPolicy(Policy): - @endpoint - async def clear_cache(self) -> None: - """Custom endpoint for cache clearing""" - self.policy_worker.clear_kv_cache() - -# Then use it (hypothetical): -# await custom_policy.clear_cache.fanout() # Clear all replica caches -# Note: Actual cache clearing would use existing Policy methods -``` - -**Performance characteristics**: -- **Latency**: Immediately returns (doesn't wait for completion) -- **Throughput**: Network limited, but non-blocking -- **Fault tolerance**: Fire-and-forget (you don't know if it worked) - -**Critical warning**: Only use for non-critical operations - you get no confirmation. - -### 5. Service Sessions for Stateful Operations +### 3. Service Sessions for Stateful Operations **When to use**: When you need multiple calls to hit the same replica (like KV cache preservation). From 9d4be6073f1de9fa3eb4280f2848a5e7b87102f4 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Sun, 12 Oct 2025 11:43:57 -0700 Subject: [PATCH 15/22] address more comments --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 8 ++++---- docs/Tutorials/2_Forge_Internals.MD | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 32ada41cb..66b32a2b3 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -204,8 +204,8 @@ graph LR Each step has different: - **Latency requirements**: Policy inference needs low latency, training can batch -- **Scaling patterns**: Reward evaluation scales with response count, training with model size -- **Failure modes**: Policy failure stops generation, reward failure affects learning quality +- **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference) +- **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover) - **Resource utilization**: GPUs for inference/training, CPUs for data processing ### Problem 3: The Coordination Challenge @@ -229,9 +229,9 @@ def naive_rl_step(): ## Enter Forge: RL-Native Architecture -Forge solves these problems by treating each RL component as an **independent, scalable service** +Forge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ) -Let's see how core RL concepts map to Forge services: +Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2): ```python async def real_rl_training_step(services, step): diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index c21485bb0..2ed3301e5 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -140,7 +140,7 @@ answer = responses[0].text # Extract text from Completion object Behind the scenes: 1. Health check eliminates failed replicas -2. Load balancer picks least loaded healthy replica +2. Load balancer picks replica (currently round robin, configurable balancers coming soon) 3. Request routes to that specific replica 4. Automatic retry on different replica if failure @@ -302,7 +302,7 @@ async def optimized_multi_turn(): ```python # Forge ReplayBuffer endpoints (verified from source code) # Add episodes (thread-safe by actor model) -await replay_buffer.add.call_one(episode) # Note: .call_one() not .choose() +await replay_buffer.add.call_one(episode) # .choose() would work too, but .call_one() clarifies it's a singleton actor not ActorMesh # Sample batches for training batch = await replay_buffer.sample.call_one( From 1cebab5d1b3e2c70ab704f9c1589708c9415373a Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Sun, 12 Oct 2025 11:46:05 -0700 Subject: [PATCH 16/22] fix multi line issue --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 20 +++---- docs/Tutorials/2_Forge_Internals.MD | 14 ++--- docs/Tutorials/3_Monarch_101.MD | 60 +++++++++---------- 3 files changed, 47 insertions(+), 47 deletions(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 66b32a2b3..26f90092c 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -9,12 +9,12 @@ Let's start with a simple math tutoring example to understand RL concepts with t ```mermaid graph TD subgraph Example["Math Tutoring RL Example"] - Dataset["Dataset
math problems
'What is 2+2?'"] - Policy["Policy
student AI
generates: 'The answer is 4'"] - Reward["Reward Model
Evaluation Exam
scores: 0.95 (excellent)"] - Reference["Reference Model
original student
baseline comparison"] - ReplayBuffer["Replay Buffer
notebook
stores experiences"] - Trainer["Trainer
tutor
improves student"] + Dataset["Dataset: math problems"] + Policy["Policy: student AI"] + Reward["Reward Model: scores answers"] + Reference["Reference Model: baseline"] + ReplayBuffer["Replay Buffer: stores experiences"] + Trainer["Trainer: improves student"] end Dataset --> Policy @@ -163,13 +163,13 @@ Our simple RL loop above has complex requirements: ```mermaid graph TD subgraph Components["Each Component Needs Different Resources"] - Policy["Policy (Student AI)
Generates: 'The answer is 4'
Needs: Large GPU memory
Scaling: Multiple replicas for speed"] + Policy["Policy (Student AI): Large GPU memory, Multiple replicas"] - Reward["Reward Model (Teacher)
Scores answers: 0.95
Needs: Moderate compute
Scaling: CPU or small GPU"] + Reward["Reward Model (Teacher): Moderate compute, CPU/small GPU"] - Trainer["Trainer (Tutor)
Improves student weights
Needs: Massive GPU compute
Scaling: Distributed training"] + Trainer["Trainer (Tutor): Massive GPU compute, Distributed training"] - Dataset["Dataset (Question Bank)
Provides: 'What is 2+2?'
Needs: CPU intensive I/O
Scaling: High memory bandwidth"] + Dataset["Dataset (Question Bank): CPU intensive I/O, High memory bandwidth"] end style Policy fill:#99ff99 diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 2ed3301e5..ef53ddfe5 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -15,19 +15,19 @@ graph TD Call["Your Code:
await policy_service.generate"] subgraph ServiceLayer["Service Layer"] - Proxy["Service Proxy
Load balancing
Health checking
Request routing"] - LB["Load Balancer
Replica selection
Circuit breaker
Retry logic"] + Proxy["Service Proxy: Load balancing, Health checking"] + LB["Load Balancer: Replica selection, Circuit breaker"] end subgraph Replicas["Replica Management"] - R1["Replica 1
GPU 0
Healthy"] - R2["Replica 2
GPU 1
Overloaded"] - R3["Replica 3
GPU 2
Failed"] - R4["Replica 4
GPU 3
Healthy"] + R1["Replica 1: GPU 0, Healthy"] + R2["Replica 2: GPU 1, Overloaded"] + R3["Replica 3: GPU 2, Failed"] + R4["Replica 4: GPU 3, Healthy"] end subgraph Compute["Actual Computation"] - Actor["Policy Actor
vLLM engine
Model weights
KV cache"] + Actor["Policy Actor: vLLM engine, Model weights, KV cache"] end Call --> Proxy diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 0cbdcbd88..502d8a34d 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -13,20 +13,20 @@ graph TD end subgraph ForgeServices["2. Forge Service Layer"] - ServiceInterface["ServiceInterface
• Routes .route() to replica
• Handles load balancing
• Manages health checks"] - ServiceActor["ServiceActor
• Manages replica lifecycle
• Monitors health
• Coordinates failures"] + ServiceInterface["ServiceInterface: Routes requests, Load balancing, Health checks"] + ServiceActor["ServiceActor: Manages replicas, Monitors health, Coordinates failures"] end subgraph MonarchLayer["3. Monarch Actor Layer"] - ActorMesh["ActorMesh PolicyActor
• 4 PolicyActor instances
• Each on different GPU
• Message passing interface"] - ProcMesh["ProcMesh
• 4 processes
• GPU topology: 0,1,2,3
• Network interconnect"] + ActorMesh["ActorMesh PolicyActor: 4 instances, Different GPUs, Message passing"] + ProcMesh["ProcMesh: 4 processes, GPU topology 0,1,2,3, Network interconnect"] end subgraph Hardware["4. Physical Hardware"] - GPU0["GPU 0
PolicyActor #1
vLLM Engine
Model Weights"] - GPU1["GPU 1
PolicyActor #2
vLLM Engine
Model Weights"] - GPU2["GPU 2
PolicyActor #3
vLLM Engine
Model Weights"] - GPU3["GPU 3
PolicyActor #4
vLLM Engine
Model Weights"] + GPU0["GPU 0: PolicyActor #1, vLLM Engine, Model Weights"] + GPU1["GPU 1: PolicyActor #2, vLLM Engine, Model Weights"] + GPU2["GPU 2: PolicyActor #3, vLLM Engine, Model Weights"] + GPU3["GPU 3: PolicyActor #4, vLLM Engine, Model Weights"] end Call --> ServiceInterface @@ -199,10 +199,10 @@ graph TD end subgraph ActorMesh["ActorMesh PolicyActor"] - A0["PolicyActor
Instance #0
model=Qwen/Qwen3-7B
generation_count=0"] - A1["PolicyActor
Instance #1
model=Qwen/Qwen3-7B
generation_count=0"] - A2["PolicyActor
Instance #2
model=Qwen/Qwen3-7B
generation_count=0"] - A3["PolicyActor
Instance #3
model=Qwen/Qwen3-7B
generation_count=0"] + A0["PolicyActor Instance #0: model=Qwen/Qwen3-7B"] + A1["PolicyActor Instance #1: model=Qwen/Qwen3-7B"] + A2["PolicyActor Instance #2: model=Qwen/Qwen3-7B"] + A3["PolicyActor Instance #3: model=Qwen/Qwen3-7B"] end Code --> ProcMesh @@ -226,17 +226,17 @@ graph TD Client["await policy_actors.generate.METHOD(prompt)"] subgraph Methods["Different Adverbs Route Differently"] - Choose["choose()
→ Routes to ONE actor
→ Load balanced"] - Call["call()
→ Routes to ALL actors
→ Collects all results"] - Broadcast["broadcast()
→ Routes to ALL actors
→ Fire and forget"] - Stream["stream()
→ Routes to ALL actors
→ Iterator of results"] + Choose["choose(): Routes to ONE actor, Load balanced"] + Call["call(): Routes to ALL actors, Collects results"] + Broadcast["broadcast(): Routes to ALL actors, Fire and forget"] + Stream["stream(): Routes to ALL actors, Iterator of results"] end subgraph ActorInstances["PolicyActor Instances"] - A0["Actor 0
GPU 0
generates response"] - A1["Actor 1
GPU 1
generates response"] - A2["Actor 2
GPU 2
generates response"] - A3["Actor 3
GPU 3
generates response"] + A0["Actor 0: GPU 0, generates response"] + A1["Actor 1: GPU 1, generates response"] + A2["Actor 2: GPU 2, generates response"] + A3["Actor 3: GPU 3, generates response"] end Client --> Choose @@ -276,26 +276,26 @@ graph TD subgraph ServiceCreation["Service Creation Process"] Call["await PolicyActor.options(num_replicas=4, procs=1).as_service(model='Qwen')"] - ServiceActor["ServiceActor
• Manages 4 replicas
• Handles health checks
• Routes service calls"] + ServiceActor["ServiceActor: Manages 4 replicas, Health checks, Routes calls"] subgraph Replicas["4 Independent Replicas"] subgraph R0["Replica 0"] - PM0["ProcMesh
1 process
GPU 0"] + PM0["ProcMesh: 1 process, GPU 0"] AM0["ActorMesh
1 PolicyActor"] end subgraph R1["Replica 1"] - PM1["ProcMesh
1 process
GPU 1"] + PM1["ProcMesh: 1 process, GPU 1"] AM1["ActorMesh
1 PolicyActor"] end subgraph R2["Replica 2"] - PM2["ProcMesh
1 process
GPU 2"] + PM2["ProcMesh: 1 process, GPU 2"] AM2["ActorMesh
1 PolicyActor"] end subgraph R3["Replica 3"] - PM3["ProcMesh
1 process
GPU 3"] + PM3["ProcMesh: 1 process, GPU 3"] AM3["ActorMesh
1 PolicyActor"] end end @@ -325,15 +325,15 @@ graph TD subgraph CallFlow["Complete Call Flow"] UserCall["await policy_service.generate.route('What is 2+2?')"] - ServiceInterface["ServiceInterface
• Receives .route() call
• Routes to ServiceActor"] + ServiceInterface["ServiceInterface: Receives .route() call, Routes to ServiceActor"] - ServiceActor["ServiceActor
• Selects healthy replica
• Load balancing logic
• Failure handling"] + ServiceActor["ServiceActor: Selects healthy replica, Load balancing, Failure handling"] - SelectedReplica["Selected Replica #2
• ProcMesh with 1 process
• ActorMesh with 1 PolicyActor"] + SelectedReplica["Selected Replica #2: ProcMesh 1 process, ActorMesh 1 PolicyActor"] - PolicyActor["PolicyActor Instance
• Loads model
• Runs vLLM inference
• Returns 'The answer is 4'"] + PolicyActor["PolicyActor Instance: Loads model, Runs vLLM inference"] - GPU["GPU 2
• vLLM engine
• Model weights
• KV cache
• CUDA kernels"] + GPU["GPU 2: vLLM engine, Model weights, KV cache, CUDA kernels"] UserCall --> ServiceInterface ServiceInterface --> ServiceActor From a001a8da844e613e36d4d90fc8a8c59eef636ede Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Sun, 12 Oct 2025 11:48:16 -0700 Subject: [PATCH 17/22] fix colours --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 32 ++++---- docs/Tutorials/2_Forge_Internals.MD | 8 +- docs/Tutorials/3_Monarch_101.MD | 76 +++++++++---------- 3 files changed, 58 insertions(+), 58 deletions(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 26f90092c..2565d626e 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -25,9 +25,9 @@ graph TD ReplayBuffer --> Trainer Trainer --> Policy - style Policy fill:#99ff99 - style Reward fill:#ffcc99 - style Trainer fill:#ff99cc + style Policy fill:#4CAF50 + style Reward fill:#FF9800 + style Trainer fill:#E91E63 ``` ### RL Components Defined (Forge Names) @@ -100,10 +100,10 @@ graph LR C5 --> S5 C6 --> S6 - style C2 fill:#99ff99 - style S2 fill:#99ff99 - style C3 fill:#ffcc99 - style S3 fill:#ffcc99 + style C2 fill:#4CAF50 + style S2 fill:#4CAF50 + style C3 fill:#FF9800 + style S3 fill:#FF9800 ``` ### RL Step with Forge Services @@ -172,10 +172,10 @@ graph TD Dataset["Dataset (Question Bank): CPU intensive I/O, High memory bandwidth"] end - style Policy fill:#99ff99 - style Reward fill:#ffcc99 - style Trainer fill:#ff99cc - style Dataset fill:#ccccff + style Policy fill:#4CAF50 + style Reward fill:#FF9800 + style Trainer fill:#E91E63 + style Dataset fill:#2196F3 ``` ### Problem 2: Complex Interdependencies @@ -195,11 +195,11 @@ graph LR D --> E E --> A - style A fill:#99ff99 - style B fill:#ffcc99 - style C fill:#99ccff - style D fill:#ccff99 - style E fill:#ff99cc + style A fill:#4CAF50 + style B fill:#FF9800 + style C fill:#2196F3 + style D fill:#8BC34A + style E fill:#E91E63 ``` Each step has different: diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index ef53ddfe5..05a40e4a5 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -39,10 +39,10 @@ graph TD R1 --> Actor R4 --> Actor - style Call fill:#99ff99 - style LB fill:#ffcc99 - style R3 fill:#ff9999 - style Actor fill:#cc99ff + style Call fill:#4CAF50 + style LB fill:#FF9800 + style R3 fill:#F44336 + style Actor fill:#9C27B0 ``` ## Service Components Deep Dive diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 502d8a34d..52bdb17d0 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -38,10 +38,10 @@ graph TD ProcMesh --> GPU2 ProcMesh --> GPU3 - style Call fill:#99ff99 - style ServiceActor fill:#ffcc99 - style ActorMesh fill:#cc99ff - style ProcMesh fill:#ccccff + style Call fill:#4CAF50 + style ServiceActor fill:#FF9800 + style ActorMesh fill:#9C27B0 + style ProcMesh fill:#2196F3 ``` ## Deep Dive: ProcMesh - The Foundation @@ -74,14 +74,14 @@ graph TD P7 -.->|"Network"| P0 end - style P0 fill:#ff9999 - style P1 fill:#ff9999 - style P2 fill:#ff9999 - style P3 fill:#ff9999 - style P4 fill:#ff9999 - style P5 fill:#ff9999 - style P6 fill:#ff9999 - style P7 fill:#ff9999 + style P0 fill:#F44336 + style P1 fill:#F44336 + style P2 fill:#F44336 + style P3 fill:#F44336 + style P4 fill:#F44336 + style P5 fill:#F44336 + style P6 fill:#F44336 + style P7 fill:#F44336 ``` ### Multi-Host ProcMesh @@ -122,9 +122,9 @@ graph TD H2P0 -.->|"InfiniBand"| H3P0 H2P1 -.->|"InfiniBand"| H3P1 - style PM1 fill:#ff9999 - style PM2 fill:#99ff99 - style PM3 fill:#99ccff + style PM1 fill:#F44336 + style PM2 fill:#4CAF50 + style PM3 fill:#2196F3 ``` ```python @@ -212,10 +212,10 @@ graph TD P3 --> A3 end - style A0 fill:#99ff99 - style A1 fill:#99ff99 - style A2 fill:#99ff99 - style A3 fill:#99ff99 + style A0 fill:#4CAF50 + style A1 fill:#4CAF50 + style A2 fill:#4CAF50 + style A3 fill:#4CAF50 ``` ### Message Routing Through ActorMesh @@ -259,10 +259,10 @@ graph TD Stream --> A3 end - style Choose fill:#99ff99 - style Call fill:#ffcc99 - style Broadcast fill:#ff99cc - style Stream fill:#cc99ff + style Choose fill:#4CAF50 + style Call fill:#FF9800 + style Broadcast fill:#E91E63 + style Stream fill:#9C27B0 ``` ## How Forge Services Use Monarch @@ -311,11 +311,11 @@ graph TD PM3 --> AM3 end - style ServiceActor fill:#ffcc99 - style AM0 fill:#99ff99 - style AM1 fill:#99ff99 - style AM2 fill:#99ff99 - style AM3 fill:#99ff99 + style ServiceActor fill:#FF9800 + style AM0 fill:#4CAF50 + style AM1 fill:#4CAF50 + style AM2 fill:#4CAF50 + style AM3 fill:#4CAF50 ``` ### Service Call to Actor Execution @@ -348,10 +348,10 @@ graph TD ServiceInterface -.->|"'The answer is 4'"| UserCall end - style UserCall fill:#99ff99 - style ServiceActor fill:#ffcc99 - style PolicyActor fill:#cc99ff - style GPU fill:#ffcccc + style UserCall fill:#4CAF50 + style ServiceActor fill:#FF9800 + style PolicyActor fill:#9C27B0 + style GPU fill:#FF5722 ``` ## Multiple Services Sharing Infrastructure @@ -400,12 +400,12 @@ graph TD BS --> C4 end - style PS fill:#99ff99 - style TS fill:#ff99cc - style RS fill:#ffcc99 - style BS fill:#cc99ff - style GPUMesh fill:#ffe6e6 - style CPUMesh fill:#e6f3ff + style PS fill:#4CAF50 + style TS fill:#E91E63 + style RS fill:#FF9800 + style BS fill:#9C27B0 + style GPUMesh fill:#FFEBEE + style CPUMesh fill:#E3F2FD ``` ## Key Insights: Why This Architecture Matters From 6b23e25fa48881e3869ad82fb6504eba99063ab6 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Sun, 12 Oct 2025 12:03:09 -0700 Subject: [PATCH 18/22] fix linter and ohter comments --- docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 114 +++++++++--------- docs/Tutorials/2_Forge_Internals.MD | 98 +++++++-------- docs/Tutorials/ReadMe.MD | 6 +- 3 files changed, 109 insertions(+), 109 deletions(-) diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD index 2565d626e..39b6d62aa 100644 --- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD +++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD @@ -16,7 +16,7 @@ graph TD ReplayBuffer["Replay Buffer: stores experiences"] Trainer["Trainer: improves student"] end - + Dataset --> Policy Policy --> Reward Policy --> Reference @@ -24,7 +24,7 @@ graph TD Reference --> ReplayBuffer ReplayBuffer --> Trainer Trainer --> Policy - + style Policy fill:#4CAF50 style Reward fill:#FF9800 style Trainer fill:#E91E63 @@ -47,25 +47,25 @@ graph TD def conceptual_rl_step(): # 1. Get a math problem question = dataset.sample() # "What is 2+2?" - - # 2. Student generates answer + + # 2. Student generates answer answer = policy.generate(question) # "The answer is 4" - + # 3. Teacher grades it score = reward_model.evaluate(question, answer) # 0.95 - + # 4. Compare to original student baseline = reference_model.compute_logprobs(question, answer) - + # 5. Store the experience experience = Episode(question, answer, score, baseline) replay_buffer.add(experience) - + # 6. When enough experiences collected, improve student batch = replay_buffer.sample(curr_policy_version=0) if batch is not None: trainer.train_step(batch) # Student gets better! - + # 🔄 See complete working example below with actual Forge service calls ``` @@ -83,7 +83,7 @@ graph LR C5["Replay Buffer"] C6["Trainer"] end - + subgraph Services["Forge Services (Real Classes)"] S1["DatasetActor"] S2["Policy"] @@ -92,14 +92,14 @@ graph LR S5["ReplayBuffer"] S6["RLTrainer"] end - + C1 --> S1 C2 --> S2 C3 --> S3 C4 --> S4 C5 --> S5 C6 --> S6 - + style C2 fill:#4CAF50 style S2 fill:#4CAF50 style C3 fill:#FF9800 @@ -117,26 +117,26 @@ async def conceptual_forge_rl_step(services, step): # 1. Get a math problem - Using actual DatasetActor API sample = await services['dataloader'].sample.call_one() question, target = sample["request"], sample["target"] - + # 2. Student generates answer - Using actual Policy API responses = await services['policy'].generate.route(prompt=question) - answer = responses[0].text - + answer = responses[0].text + # 3. Teacher grades it - Using actual RewardActor API score = await services['reward_actor'].evaluate_response.route( prompt=question, response=answer, target=target ) - + # 4. Compare to baseline - Using actual ReferenceModel API # Note: ReferenceModel.forward requires input_ids, max_req_tokens, return_logprobs ref_logprobs = await services['ref_model'].forward.route( input_ids, max_req_tokens, return_logprobs=True ) - + # 5. Store experience - Using actual Episode structure from apps/grpo/main.py episode = create_episode_from_response(responses[0], score, ref_logprobs, step) await services['replay_buffer'].add.call_one(episode) - + # 6. Improve student - Using actual training pattern batch = await services['replay_buffer'].sample.call_one( curr_policy_version=step @@ -160,23 +160,12 @@ Our simple RL loop above has complex requirements: #### Problem 1: Different Resource Needs -```mermaid -graph TD - subgraph Components["Each Component Needs Different Resources"] - Policy["Policy (Student AI): Large GPU memory, Multiple replicas"] - - Reward["Reward Model (Teacher): Moderate compute, CPU/small GPU"] - - Trainer["Trainer (Tutor): Massive GPU compute, Distributed training"] - - Dataset["Dataset (Question Bank): CPU intensive I/O, High memory bandwidth"] - end - - style Policy fill:#4CAF50 - style Reward fill:#FF9800 - style Trainer fill:#E91E63 - style Dataset fill:#2196F3 -``` +| Component | Resource Needs | Scaling Strategy | +|-----------|----------------|------------------| +| **Policy** (Student AI) | Large GPU memory | Multiple replicas for throughput | +| **Reward Heuristic** (Teacher) | Small compute | CPU or small GPU | +| **Trainer** (Tutor) | Massive GPU compute | Distributed training | +| **Dataset** (Question Bank) | CPU intensive I/O | High memory bandwidth | ### Problem 2: Complex Interdependencies @@ -187,14 +176,14 @@ graph LR C["Reference: Original Student
Provides baseline comparison"] D["Replay Buffer: Notebook
Stores: question + answer + score"] E["Trainer: Tutor
Improves student using experiences"] - + A --> B A --> C B --> D C --> D D --> E E --> A - + style A fill:#4CAF50 style B fill:#FF9800 style C fill:#2196F3 @@ -203,7 +192,7 @@ graph LR ``` Each step has different: -- **Latency requirements**: Policy inference needs low latency, training can batch +- **Latency requirements**: Policy inference needs low latency (each episode waits), training can batch multiple episodes together - **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference) - **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover) - **Resource utilization**: GPUs for inference/training, CPUs for data processing @@ -218,10 +207,10 @@ def naive_rl_step(): # Policy waits idle while reward model works response = policy_model.generate(prompt) # GPU busy reward = reward_model.evaluate(prompt, response) # Policy GPU idle - - # Training waits for single episode + + # Training waits for single episode loss = compute_loss(response, reward) # Batch size = 1, inefficient - + # Everything stops if any component fails if policy_fails or reward_fails or trainer_fails: entire_system_stops() @@ -233,32 +222,37 @@ Forge solves these problems by treating each RL component as an **independent, d Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2): +**Quick API Reference:** (covered in detail in Part 2: Service Communication Patterns) +- `.route()` - Send request to any healthy replica in a service (load balanced) +- `.call_one()` - Send request to a single actor instance +- `.fanout()` - Send request to ALL replicas in a service + ```python async def real_rl_training_step(services, step): """Single RL step using verified Forge APIs""" - + # 1. Environment interaction - Using actual DatasetActor API sample = await services['dataloader'].sample.call_one() prompt, target = sample["request"], sample["target"] - + responses = await services['policy'].generate.route(prompt) - + # 2. Reward computation - Using actual RewardActor API score = await services['reward_actor'].evaluate_response.route( prompt=prompt, response=responses[0].text, target=target ) - + # 3. Get reference logprobs - Using actual ReferenceModel API # Note: ReferenceModel requires full input_ids tensor, not just tokens input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids]) ref_logprobs = await services['ref_model'].forward.route( input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True ) - + # 4. Experience storage - Using actual Episode pattern from GRPO episode = create_episode_from_response(responses[0], score, ref_logprobs, step) await services['replay_buffer'].add.call_one(episode) - + # 5. Learning - Using actual trainer pattern batch = await services['replay_buffer'].sample.call_one( curr_policy_version=step @@ -266,11 +260,11 @@ async def real_rl_training_step(services, step): if batch is not None: inputs, targets = batch # GRPO returns (inputs, targets) tuple loss = await services['trainer'].train_step.call(inputs, targets) - + # 6. Policy synchronization - Using actual weight update pattern await services['trainer'].push_weights.call(step + 1) await services['policy'].update_weights.fanout(step + 1) - + return loss ``` @@ -286,7 +280,7 @@ answer = responses[0].text # responses is list[Completion] Forge handles behind the scenes: - Routing to least loaded replica -- GPU memory management +- GPU memory management - Batch optimization - Failure recovery - Auto-scaling based on demand @@ -365,10 +359,16 @@ group_size = 1 ) ``` -Production scaling - multiply num_replicas for services or spawn multiple actors: -- Policy: num_replicas=8 for high inference demand -- RewardActor: num_replicas=16 for parallel evaluation -- Trainer: Multiple processes for distributed training (RLTrainer handles this internally) +**Forge Components: Services vs Actors** + +Forge has two types of distributed components: +- **Services**: Multiple replicas with automatic load balancing (like Policy, RewardActor) +- **Actors**: Single instances that handle their own internal distribution (like RLTrainer, ReplayBuffer) + +We cover this distinction in detail in Part 2, but for now this explains the scaling patterns: +- Policy service: num_replicas=8 for high inference demand +- RewardActor service: num_replicas=16 for parallel evaluation +- RLTrainer actor: Single instance with internal distributed training ### Fault Tolerance @@ -377,13 +377,13 @@ Production scaling - multiply num_replicas for services or spawn multiple actors responses = await policy.generate.route(prompt=question) answer = responses[0].text # -> Forge automatically routes to healthy replica -# -> Failed replica respawns in background +# -> Failed replica respawns in background # -> No impact on training loop # If reward service fails: score = await reward_actor.evaluate_response.route( prompt=question, response=answer, target=target -) +) ``` - Retries on different replica automatically @@ -392,4 +392,4 @@ score = await reward_actor.evaluate_response.route( This is fundamentally different from monolithic RL implementations where any component failure stops everything! -In the next Section, we will go a layer deeper and learn how ForgeServices work. Continue to [Part 2 here](./2_Forge_Internals.MD) \ No newline at end of file +In the next Section, we will go a layer deeper and learn how ForgeServices work. Continue to [Part 2 here](./2_Forge_Internals.MD) diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 05a40e4a5..e1af9cde3 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -13,23 +13,23 @@ When you call `await policy_service.generate(question)`, here's what actually ha ```mermaid graph TD Call["Your Code:
await policy_service.generate"] - + subgraph ServiceLayer["Service Layer"] Proxy["Service Proxy: Load balancing, Health checking"] LB["Load Balancer: Replica selection, Circuit breaker"] end - + subgraph Replicas["Replica Management"] R1["Replica 1: GPU 0, Healthy"] R2["Replica 2: GPU 1, Overloaded"] R3["Replica 3: GPU 2, Failed"] R4["Replica 4: GPU 3, Healthy"] end - + subgraph Compute["Actual Computation"] Actor["Policy Actor: vLLM engine, Model weights, KV cache"] end - + Call --> Proxy Proxy --> LB LB --> R1 @@ -38,7 +38,7 @@ graph TD LB --> R4 R1 --> Actor R4 --> Actor - + style Call fill:#4CAF50 style LB fill:#FF9800 style R3 fill:#F44336 @@ -55,7 +55,7 @@ Here's the actual ServiceConfig from Forge source code: # Configuration pattern from apps/grpo/main.py: Policy.options( procs=1, # Processes per replica - num_replicas=4, # Number of replicas + num_replicas=4, # Number of replicas with_gpus=True # Allocate GPUs # Other available options: # hosts=None # the number of remote hosts used per replica @@ -69,7 +69,7 @@ Services are created using the `.options().as_service()` pattern from the actual The service creation automatically handles: - Spawning actor replicas across processes/GPUs - Load balancing with .route() method for services -- Health monitoring and failure recovery +- Health monitoring and failure recovery - Message routing and serialization ```python @@ -78,8 +78,8 @@ from forge.actors.policy import Policy model = "Qwen/Qwen3-1.7B" policy = await Policy.options( - procs=1, - with_gpus=True, + procs=1, + with_gpus=True, num_replicas=1 ).as_service( engine_config={ @@ -158,7 +158,7 @@ Behind the scenes: ```python # Get version from all policy replicas current_versions = await policy.get_version.fanout() -# Returns: [version_replica_1, version_replica_2, ...] +# Returns: [version_replica_1, version_replica_2, ...] # Update weights on all replicas await policy.update_weights.fanout(new_policy_version) @@ -193,8 +193,8 @@ while training: ``` **Performance characteristics**: -- **Latency**: Process first result immediately -- **Throughput**: Pipeline parallelism (much higher than sequential) +- **Latency**: Process first result immediately +- **Throughput**: Non-blocking async operations (much higher than waiting for full batches) - **Fault tolerance**: Continues if some replicas fail **Critical insight**: This is essential for high-throughput RL where you can't wait for batches. @@ -242,7 +242,7 @@ async with counter_service.session(): print(await counter_service.increment.route()) # 1 print(await counter_service.increment.route()) # 2 print(await counter_service.increment.route()) # 3 - + final_value = await counter_service.get_value.route() print(f"Final value on this replica: {final_value}") # 3 @@ -263,7 +263,7 @@ await counter_service.shutdown() The most complex challenge in distributed RL is maintaining state consistency while maximizing performance. -### The KV Cache Problem +### The KV Cache Problem **The challenge**: Policy inference is much faster with KV cache, but cache is tied to specific conversation history. @@ -278,16 +278,16 @@ async def naive_multi_turn(): **The solution**: Sticky sessions ensure all calls go to same replica. -```python +```python async def optimized_multi_turn(): async with policy.session(): # All calls guaranteed to hit same replica = cache hits response1 = await policy.generate.route(prompt=question1) - full_prompt = question1 + response1[0].text + full_prompt = question1 + response1[0].text response2 = await policy.generate.route(prompt=full_prompt) # Cache hit! conversation = full_prompt + response2[0].text response3 = await policy.generate.route(prompt=conversation) # Cache hit! - + # Session ends, replica can be garbage collected or reused ``` @@ -327,11 +327,11 @@ batch = await replay_buffer.sample.call_one( async def real_weight_sync(trainer, policy, step): # Trainer pushes weights to TorchStore with version number await trainer.push_weights.call_one(policy_version=step + 1) - - # Policy service updates to new version from TorchStore + + # Policy service updates to new version from TorchStore # Use .fanout() to update ALL policy replicas await policy.update_weights.fanout(policy_version=step + 1) - + # Check current policy version current_version = await policy.get_version.route() print(f"Current policy version: {current_version}") @@ -349,29 +349,29 @@ Instead of manual coordination, Forge services handle speed mismatches automatic from apps.grpo.main import Episode, Group async def simple_rl_step(): - + # ===== Generate a rollout ===== sample = await dataloader.sample.call_one() # DatasetActor is an actor, not service prompt, target = sample["request"], sample["target"] # Correct field names - + print(f"Prompt: {prompt}") print(f"Target: {target}") - + actions = await policy.generate.route(prompt=prompt) # Policy is a service print(f"Policy response: {actions[0].text}") - + # Create input tensor for reference model (requires full context) input_ids = torch.cat([actions[0].prompt_ids, actions[0].token_ids]) ref_logprobs = await ref_model.forward.route( input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True - ) + ) reward = await reward_actor.evaluate_response.route( # RewardActor is a service - prompt=prompt, - response=actions[0].text, + prompt=prompt, + response=actions[0].text, target=target ) print(f"Reward: {reward}") - + # Create episode using actual GRPO Episode structure episode = Episode( episode_id="0", @@ -382,24 +382,24 @@ async def simple_rl_step(): response_len=512, target=target ) - + # Add response data episode.response = actions[0].text episode.request_tokens = actions[0].prompt_ids.tolist() episode.response_tokens = actions[0].token_ids.tolist() episode.ref_logprobs = ref_logprobs[0] # Extract from batch dimension episode.reward = reward - + # Compute advantages using actual ComputeAdvantages actor group = Group.new_group(0, 1, prompt, 0, tokenizer.pad_token_id, 512, 512, target) group.episodes[0] = episode advantages = await compute_advantages.compute.call_one(group) # ComputeAdvantages is an actor episode.advantage = advantages[0] - print(f"Advantage: {advantages[0]}") + print(f"Advantage: {advantages[0]}") await replay_buffer.add.call_one(episode) # ReplayBuffer is an actor print("Episode stored in replay buffer") - - # ===== Train on the batch ===== + + # ===== Train on the batch ===== batch = await replay_buffer.sample.call_one(curr_policy_version=0) if batch is not None: print("Training on batch...") @@ -469,12 +469,12 @@ class RewardActor(ForgeActor): async def evaluate_response(self, prompt: str, response: str, target: str) -> float: """Evaluate response quality using multiple reward functions""" total_reward = 0.0 - + for reward_fn in self.reward_functions: # Each reward function contributes to total score reward = reward_fn(prompt, response, target) total_reward += reward - + # Return average reward across all functions return total_reward / len(self.reward_functions) if self.reward_functions else 0.0 @@ -490,7 +490,7 @@ target = "36" score = await reward_actor.evaluate_response.route( prompt=prompt, - response=response, + response=response, target=target ) print(f"Reward score: {score}") # Usually around 1.0 for correct math answers @@ -530,7 +530,7 @@ print("Initializing all services...") reward_actor, ) = await asyncio.gather( DatasetActor.options(procs=1).as_actor( - path="openai/gsm8k", revision="main", data_split="train", + path="openai/gsm8k", revision="main", data_split="train", streaming=True, model="Qwen/Qwen3-1.7B" ), Policy.options(procs=1, with_gpus=True, num_replicas=1).as_service( @@ -559,41 +559,41 @@ print("All services initialized successfully!") async def production_training_loop(): """Real training loop pattern from apps/grpo/main.py""" step = 0 - + while True: - # Data generation + # Data generation sample = await dataloader.sample.call_one() - + # Policy generation service call responses = await policy.generate.route(sample["request"]) # Correct field name - + # Reference computation service call (requires full input tensor) input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids]) ref_logprobs = await ref_model.forward.route( input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True ) - - # Reward evaluation service call + + # Reward evaluation service call reward = await reward_actor.evaluate_response.route( prompt=sample["question"], response=responses[0].text, target=sample["answer"] ) - + # Experience storage (using actual Episode structure) episode = create_episode_from_grpo_data(sample, responses[0], reward, ref_logprobs[0], step) await replay_buffer.add.call_one(episode) - + # Training when ready batch = await replay_buffer.sample.call_one(curr_policy_version=step) if batch is not None: inputs, targets = batch # GRPO returns (inputs, targets) tuple loss = await trainer.train_step.call(inputs, targets) - + # Weight synchronization pattern await trainer.push_weights.call(step + 1) await policy.update_weights.fanout(step + 1) # Fanout to all replicas - + print(f"Step {step}, Loss: {loss:.4f}") step += 1 @@ -612,11 +612,11 @@ print("All services shut down successfully!") **Key observations:** 1. **Parallelism**: Independent operations run concurrently -2. **Load balancing**: Each `.route()` call automatically selects optimal replica +2. **Load balancing**: Each `.route()` call automatically selects optimal replica 3. **Fault tolerance**: Failures automatically retry on different replicas 4. **Resource efficiency**: CPU and GPU services scale independently 5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions) This is the power of the service abstraction - complex distributed coordination looks like simple async Python code. -In the next part we will learn about [Monarch internals](./3_Monarch_101.MD) \ No newline at end of file +In the next part we will learn about [Monarch internals](./3_Monarch_101.MD) diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD index 7798b147d..084710853 100644 --- a/docs/Tutorials/ReadMe.MD +++ b/docs/Tutorials/ReadMe.MD @@ -4,7 +4,7 @@ A comprehensive guide for ML Engineers building distributed RL systems for langu Some of the examples mentioned below will be conceptual in nature for understanding. Please refer to API Docs (Coming Soon!) for more details -Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our PyTorch friends that remember! +Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our PyTorch friends that remember! ### @@ -14,6 +14,6 @@ This section currently is structured in 3 detailed parts: 2. [Forge Internals](./2_Forge_Internals.MD): Goes a layer deeper and explains the internals of Forge 3. [Monarch 101](./3_Monarch_101.MD): It's a 101 to Monarch and how Forge Talks to Monarch -Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! +Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! -If you're eager, please checkout our SFT Tutorial too (Coming soon!) as well as [App Examples](../../apps/). \ No newline at end of file +If you're eager, please checkout our SFT Tutorial too (Coming soon!) as well as [App Examples](../../apps/). From d03e84a851bb6adfdd2bd4e8c418e6d62a3d4ace Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Mon, 13 Oct 2025 15:04:16 -0700 Subject: [PATCH 19/22] address felipe's comments, add image and fix sticky session examples --- docs/Tutorials/2_Forge_Internals.MD | 83 +++++++++++++++++++++++------ 1 file changed, 66 insertions(+), 17 deletions(-) diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index e1af9cde3..8189cf8a5 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -108,22 +108,54 @@ await policy.shutdown() Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas: -Forge internals - What happens behind the scenes: -1. `.as_service()` creates a `ServiceInterface` -2. `ServiceInterface` manages N replicas of your `ForgeActor` class -3. `ServiceInterface` handles routing between replicas -4. You get methods like `.route()`, `.fanout()`, etc. +When you call `.as_service()`, Forge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc. ```python -# Your code sees this: +# Your code sees this simple interface: responses = await policy.generate.route(prompt=prompt) +# But Forge handles all the complexity of replica management, load balancing, and fault tolerance ``` -But behind the scenes: -- `ServiceInterface` selects healthy replica -- Routes message to that replica's `Policy.generate()` endpoint -- Handles failures and retries automatically -- Returns list[Completion] from the selected replica +## Communication Patterns: Quick Reference + +**API Summary:** +- `.route()` - Send request to any healthy replica in a service (load balanced) +- `.call_one()` - Send request to a single actor instance +- `.fanout()` - Send request to ALL replicas in a service + +```mermaid +graph LR + subgraph Request["Your Request"] + Code["await service.method.ADVERB()"] + end + + subgraph Patterns["Communication Patterns"] + Route[".route()
→ One healthy replica"] + CallOne[".call_one()
→ Single actor"] + Fanout[".fanout()
→ ALL replicas"] + end + + subgraph Replicas["Replicas/Actors"] + R1["Replica 1"] + R2["Replica 2"] + R3["Replica 3"] + A1["Actor"] + end + + Code --> Route + Code --> CallOne + Code --> Fanout + + Route --> R2 + CallOne --> A1 + Fanout --> R1 + Fanout --> R2 + Fanout --> R3 + + style Route fill:#4CAF50 + style CallOne fill:#FF9800 + style Fanout fill:#9C27B0 +``` ## Deep Dive: Service Communication Patterns @@ -203,8 +235,10 @@ while training: **When to use**: When you need multiple calls to hit the same replica (like KV cache preservation). +**What are sticky sessions?** A session ensures all your service calls within the `async with` block go to the same replica, instead of being load-balanced across different replicas. + ```python -# This Counter example demonstrates the session pattern +# This Counter example demonstrates the difference between regular routing and sessions from forge.controller import ForgeActor from monarch.actor import endpoint @@ -230,22 +264,37 @@ counter_service = await ForgeCounter.options( procs=1, num_replicas=4 ).as_service(initial_value=0) -# Test basic operations -await counter_service.increment.route() +# WITHOUT SESSIONS: Each .route() call goes to a different replica +await counter_service.increment.route() # Might go to replica 2 +await counter_service.increment.route() # Might go to replica 1 +await counter_service.increment.route() # Might go to replica 3 + results = await counter_service.increment.fanout() # Get from all replicas print(f"All replica values: {results}") +# Output: All replica values: [1, 2, 1, 1] - Each replica has different state! +``` -# STICKY SESSIONS +The problem: each `.route()` call can go to different replicas, creating inconsistent state. + +```python +# WITH SESSIONS: All calls go to the SAME replica print("\nUsing sticky sessions:") -async with counter_service.session(): +async with counter_service.session(): # Creates a session that picks one replica await counter_service.reset.route() # Uses .route() within session print(await counter_service.increment.route()) # 1 - print(await counter_service.increment.route()) # 2 + print(await counter_service.increment.route()) # 2 print(await counter_service.increment.route()) # 3 final_value = await counter_service.get_value.route() print(f"Final value on this replica: {final_value}") # 3 +# Output: +# Using sticky sessions: +# 1 +# 2 +# 3 +# Final value on this replica: 3 + # Same pattern works with Policy for multi-turn conversations: # async with policy.session(): # response1 = await policy.generate.route(turn1) From 6ace584dcfd84a3b5e17a357f23b98e6d0d52c69 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Mon, 13 Oct 2025 15:07:27 -0700 Subject: [PATCH 20/22] fix PR tests --- docs/Tutorials/2_Forge_Internals.MD | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD index 8189cf8a5..1a9421a96 100644 --- a/docs/Tutorials/2_Forge_Internals.MD +++ b/docs/Tutorials/2_Forge_Internals.MD @@ -120,7 +120,7 @@ responses = await policy.generate.route(prompt=prompt) **API Summary:** - `.route()` - Send request to any healthy replica in a service (load balanced) -- `.call_one()` - Send request to a single actor instance +- `.call_one()` - Send request to a single actor instance - `.fanout()` - Send request to ALL replicas in a service ```mermaid @@ -128,30 +128,30 @@ graph LR subgraph Request["Your Request"] Code["await service.method.ADVERB()"] end - + subgraph Patterns["Communication Patterns"] Route[".route()
→ One healthy replica"] CallOne[".call_one()
→ Single actor"] Fanout[".fanout()
→ ALL replicas"] end - + subgraph Replicas["Replicas/Actors"] R1["Replica 1"] R2["Replica 2"] R3["Replica 3"] A1["Actor"] end - + Code --> Route Code --> CallOne Code --> Fanout - + Route --> R2 CallOne --> A1 Fanout --> R1 Fanout --> R2 Fanout --> R3 - + style Route fill:#4CAF50 style CallOne fill:#FF9800 style Fanout fill:#9C27B0 @@ -266,7 +266,7 @@ counter_service = await ForgeCounter.options( # WITHOUT SESSIONS: Each .route() call goes to a different replica await counter_service.increment.route() # Might go to replica 2 -await counter_service.increment.route() # Might go to replica 1 +await counter_service.increment.route() # Might go to replica 1 await counter_service.increment.route() # Might go to replica 3 results = await counter_service.increment.fanout() # Get from all replicas @@ -282,7 +282,7 @@ print("\nUsing sticky sessions:") async with counter_service.session(): # Creates a session that picks one replica await counter_service.reset.route() # Uses .route() within session print(await counter_service.increment.route()) # 1 - print(await counter_service.increment.route()) # 2 + print(await counter_service.increment.route()) # 2 print(await counter_service.increment.route()) # 3 final_value = await counter_service.get_value.route() From b67d2e3c6a47298d02fb02558b2b93a5f5e260be Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Tue, 14 Oct 2025 13:54:10 -0700 Subject: [PATCH 21/22] Update 3_Monarch_101.MD --- docs/Tutorials/3_Monarch_101.MD | 201 +++++++++----------------------- 1 file changed, 52 insertions(+), 149 deletions(-) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index 52bdb17d0..f3c5c5f37 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -50,85 +50,50 @@ graph TD ### Single Host ProcMesh -```mermaid -graph TD - subgraph Host["Single Host (8 GPUs)"] - subgraph ProcMesh["ProcMesh: per_host={'gpus': 8}"] - P0["Process 0
GPU 0"] - P1["Process 1
GPU 1"] - P2["Process 2
GPU 2"] - P3["Process 3
GPU 3"] - P4["Process 4
GPU 4"] - P5["Process 5
GPU 5"] - P6["Process 6
GPU 6"] - P7["Process 7
GPU 7"] - end - - P0 -.->|"Network"| P1 - P1 -.->|"Network"| P2 - P2 -.->|"Network"| P3 - P3 -.->|"Network"| P4 - P4 -.->|"Network"| P5 - P5 -.->|"Network"| P6 - P6 -.->|"Network"| P7 - P7 -.->|"Network"| P0 - end +**Key insight**: ProcMesh creates one process per GPU, automatically handling the process-to-hardware mapping. - style P0 fill:#F44336 - style P1 fill:#F44336 - style P2 fill:#F44336 - style P3 fill:#F44336 - style P4 fill:#F44336 - style P5 fill:#F44336 - style P6 fill:#F44336 - style P7 fill:#F44336 +```python +# This simple call: +procs = this_host().spawn_procs(per_host={"gpus": 8}) + +# Creates: +# Process 0 → GPU 0 +# Process 1 → GPU 1 +# Process 2 → GPU 2 +# Process 3 → GPU 3 +# Process 4 → GPU 4 +# Process 5 → GPU 5 +# Process 6 → GPU 6 +# Process 7 → GPU 7 ``` -### Multi-Host ProcMesh - -```mermaid -graph TD - subgraph Cluster["Multi-Host Cluster"] - subgraph Host1["Host 1"] - subgraph PM1["ProcMesh Segment 1"] - H1P0["Process 0
GPU 0"] - H1P1["Process 1
GPU 1"] - H1P2["Process 2
GPU 2"] - H1P3["Process 3
GPU 3"] - end - end - - subgraph Host2["Host 2"] - subgraph PM2["ProcMesh Segment 2"] - H2P0["Process 4
GPU 0"] - H2P1["Process 5
GPU 1"] - H2P2["Process 6
GPU 2"] - H2P3["Process 7
GPU 3"] - end - end +The beauty: you don't manage individual processes or GPU assignments - ProcMesh handles the topology for you. - subgraph Host3["Host 3"] - subgraph PM3["ProcMesh Segment 3"] - H3P0["Process 8
GPU 0"] - H3P1["Process 9
GPU 1"] - H3P2["Process 10
GPU 2"] - H3P3["Process 11
GPU 3"] - end - end - end +### Multi-Host ProcMesh - H1P0 -.->|"InfiniBand"| H2P0 - H1P1 -.->|"InfiniBand"| H2P1 - H2P0 -.->|"InfiniBand"| H3P0 - H2P1 -.->|"InfiniBand"| H3P1 +**Key insight**: ProcMesh seamlessly scales across multiple hosts with continuous process numbering. - style PM1 fill:#F44336 - style PM2 fill:#4CAF50 - style PM3 fill:#2196F3 +```python +# Same simple API works across hosts: +cluster_procs = spawn_cluster_procs( + hosts=["host1", "host2", "host3"], + per_host={"gpus": 4} +) + +# Automatically creates: +# Host 1: Processes 0-3 → GPUs 0-3 +# Host 2: Processes 4-7 → GPUs 0-3 +# Host 3: Processes 8-11 → GPUs 0-3 + +# Your code stays the same whether it's 1 host or 100 hosts +actors = cluster_procs.spawn("my_actor", MyActor) ``` +**The power**: Scale from single host to cluster without changing your actor code - ProcMesh handles all the complexity. + ```python # This shows the underlying actor system that powers Forge services +# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real Forge apps! from monarch.actor import Actor, endpoint, this_proc, Future from monarch.actor import ProcMesh, this_host @@ -165,104 +130,42 @@ await counters.increment.call() # STEP 6: Different message patterns # call_one() - single actor value = await counters.get_value.call_one() -print(f"One counter: {value}") +print(f"One counter: {value}") # Output: One counter: 1 # choose() - random single actor (actors only, not services) value = await counters.get_value.choose() -print(f"Random counter: {value}") +print(f"Random counter: {value}") # Output: Random counter: 1 # call() - all actors, collect results values = await counters.get_value.call() -print(f"All counters: {values}") +print(f"All counters: {values}") # Output: All counters: [1, 1, 1, 1, 1, 1, 1, 1] # broadcast() - fire and forget -await counters.increment.broadcast() +await counters.increment.broadcast() # No return value - just sends to all actors # Cleanup await procs.stop() -``` - -## Actor Meshes: Your Code Running Distributed - -**ActorMesh** is created when you spawn actors across a ProcMesh. Each process in the ProcMesh gets one instance of your actor. - -```mermaid -graph TD - subgraph Creation["Actor Creation Process"] - Code["mesh.spawn('policy', PolicyActor, model='Qwen/Qwen3-7B')"] - - subgraph ProcMesh["ProcMesh (4 processes)"] - P0["Process 0
GPU 0"] - P1["Process 1
GPU 1"] - P2["Process 2
GPU 2"] - P3["Process 3
GPU 3"] - end - - subgraph ActorMesh["ActorMesh PolicyActor"] - A0["PolicyActor Instance #0: model=Qwen/Qwen3-7B"] - A1["PolicyActor Instance #1: model=Qwen/Qwen3-7B"] - A2["PolicyActor Instance #2: model=Qwen/Qwen3-7B"] - A3["PolicyActor Instance #3: model=Qwen/Qwen3-7B"] - end - - Code --> ProcMesh - P0 --> A0 - P1 --> A1 - P2 --> A2 - P3 --> A3 - end - style A0 fill:#4CAF50 - style A1 fill:#4CAF50 - style A2 fill:#4CAF50 - style A3 fill:#4CAF50 +# Remember: This raw Monarch code is for understanding how Forge works internally. +# In your Forge applications, use ForgeActor, .as_service(), .as_actor() instead! ``` -### Message Routing Through ActorMesh +## Actor Meshes: Your Code Running Distributed -```mermaid -graph TD - subgraph MessageFlow["Message Flow Patterns"] - Client["await policy_actors.generate.METHOD(prompt)"] - - subgraph Methods["Different Adverbs Route Differently"] - Choose["choose(): Routes to ONE actor, Load balanced"] - Call["call(): Routes to ALL actors, Collects results"] - Broadcast["broadcast(): Routes to ALL actors, Fire and forget"] - Stream["stream(): Routes to ALL actors, Iterator of results"] - end +**ActorMesh** is created when you spawn actors across a ProcMesh. Key points: - subgraph ActorInstances["PolicyActor Instances"] - A0["Actor 0: GPU 0, generates response"] - A1["Actor 1: GPU 1, generates response"] - A2["Actor 2: GPU 2, generates response"] - A3["Actor 3: GPU 3, generates response"] - end +- **One actor instance per process**: `mesh.spawn("policy", PolicyActor)` creates one PolicyActor in each process +- **Same constructor arguments**: All instances get the same initialization parameters +- **Independent state**: Each actor instance maintains its own state and memory +- **Message routing**: You can send messages to one actor or all actors using different methods - Client --> Choose - Client --> Call - Client --> Broadcast - Client --> Stream - - Choose -.->|"Load balanced"| A1 - Call --> A0 - Call --> A1 - Call --> A2 - Call --> A3 - Broadcast --> A0 - Broadcast --> A1 - Broadcast --> A2 - Broadcast --> A3 - Stream --> A0 - Stream --> A1 - Stream --> A2 - Stream --> A3 - end +```python +# Simple example: +procs = spawn_procs(per_host={"gpus": 4}) # 4 processes +policy_actors = procs.spawn("policy", PolicyActor, model="Qwen/Qwen3-7B") - style Choose fill:#4CAF50 - style Call fill:#FF9800 - style Broadcast fill:#E91E63 - style Stream fill:#9C27B0 +# Now you have 4 PolicyActor instances, one per GPU +# All initialized with the same model parameter ``` ## How Forge Services Use Monarch From cda22d849c44d3788e37179836f9093179006608 Mon Sep 17 00:00:00 2001 From: Sanyam Bhutani Date: Tue, 14 Oct 2025 13:56:36 -0700 Subject: [PATCH 22/22] fix linter --- docs/Tutorials/3_Monarch_101.MD | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD index f3c5c5f37..2213e9bb5 100644 --- a/docs/Tutorials/3_Monarch_101.MD +++ b/docs/Tutorials/3_Monarch_101.MD @@ -58,7 +58,7 @@ procs = this_host().spawn_procs(per_host={"gpus": 8}) # Creates: # Process 0 → GPU 0 -# Process 1 → GPU 1 +# Process 1 → GPU 1 # Process 2 → GPU 2 # Process 3 → GPU 3 # Process 4 → GPU 4 @@ -76,13 +76,13 @@ The beauty: you don't manage individual processes or GPU assignments - ProcMesh ```python # Same simple API works across hosts: cluster_procs = spawn_cluster_procs( - hosts=["host1", "host2", "host3"], + hosts=["host1", "host2", "host3"], per_host={"gpus": 4} ) # Automatically creates: # Host 1: Processes 0-3 → GPUs 0-3 -# Host 2: Processes 4-7 → GPUs 0-3 +# Host 2: Processes 4-7 → GPUs 0-3 # Host 3: Processes 8-11 → GPUs 0-3 # Your code stays the same whether it's 1 host or 100 hosts @@ -155,7 +155,7 @@ await procs.stop() **ActorMesh** is created when you spawn actors across a ProcMesh. Key points: - **One actor instance per process**: `mesh.spawn("policy", PolicyActor)` creates one PolicyActor in each process -- **Same constructor arguments**: All instances get the same initialization parameters +- **Same constructor arguments**: All instances get the same initialization parameters - **Independent state**: Each actor instance maintains its own state and memory - **Message routing**: You can send messages to one actor or all actors using different methods