From 4c92653bae1e5f8eeb72fd24ba321f41ead7364d Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 18:50:19 -0700
Subject: [PATCH 01/22] Create ReadMe.MD

---
 docs/Tutorials/ReadMe.MD | 11 +++++++++++
 1 file changed, 11 insertions(+)
 create mode 100644 docs/Tutorials/ReadMe.MD
diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD
new file mode 100644
index 000000000..6294c8ec8
--- /dev/null
+++ b/docs/Tutorials/ReadMe.MD
@@ -0,0 +1,11 @@
+Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our friends that remember! 
+
+This section currently is structured in 3 detailed parts:
+
+1. []()
+2. []()
+3. []()
+
+Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! 
+
+If you're eager, please checkout our SFT Tutorial too (Coming soon!) as well as [App Examples](../../apps/).
\ No newline at end of file

From 430a45e4af363f6ba3cd265e652435809a3dcad5 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:02:51 -0700
Subject: [PATCH 02/22] add part 1

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 385 ++++++++++++++++++
 docs/Tutorials/2_.MD                          |   0
 docs/Tutorials/3_.MD                          |   0
 docs/Tutorials/ReadMe.MD                      |  12 +-
 4 files changed, 395 insertions(+), 2 deletions(-)
 create mode 100644 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
 create mode 100644 docs/Tutorials/2_.MD
 create mode 100644 docs/Tutorials/3_.MD

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
new file mode 100644
index 000000000..96710b57a
--- /dev/null
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -0,0 +1,385 @@
+# Part 1: RL Fundamentals - Using Forge Terminology
+
+## Core RL Components in Forge
+
+Let's start with a simple math tutoring example to understand RL concepts with the exact names Forge uses:
+
+### The Toy Example: Teaching Math
+
+```mermaid
+graph TD
+    subgraph Example["Math Tutoring RL Example"]
+        Dataset["Dataset<br/>math problems<br/>'What is 2+2?'"]
+        Policy["Policy<br/>student AI<br/>generates: 'The answer is 4'"]
+        Reward["Reward Model<br/>Evaluation Exam<br/>scores: 0.95 (excellent)"]
+        Reference["Reference Model<br/>original student<br/>baseline comparison"]
+        ReplayBuffer["Replay Buffer<br/>notebook<br/>stores experiences"]
+        Trainer["Trainer<br/>tutor<br/>improves student"]
+    end
+    
+    Dataset --> Policy
+    Policy --> Reward
+    Policy --> Reference
+    Reward --> ReplayBuffer
+    Reference --> ReplayBuffer
+    ReplayBuffer --> Trainer
+    Trainer --> Policy
+    
+    style Policy fill:#99ff99
+    style Reward fill:#ffcc99
+    style Trainer fill:#ff99cc
+```
+
+### RL Components Defined (Forge Names)
+
+1. **Dataset**: Provides questions/prompts (like "What is 2+2?")
+2. **Policy**: The AI being trained (generates answers like "The answer is 4")
+3. **Reward Model**: Evaluates answer quality (gives scores like 0.95)
+4. **Reference Model**: Original policy copy (prevents drift from baseline)
+5. **Replay Buffer**: Stores experiences (question + answer + score)
+6. **Trainer**: Updates the policy weights based on experiences
+
+### The RL Learning Flow
+
+```python
+# CONCEPTUAL EXAMPLE - see apps/grpo/main.py for GRPO Code
+
+def conceptual_rl_step():
+    # 1. Get a math problem
+    question = dataset.sample()  # "What is 2+2?"
+    
+    # 2. Student generates answer  
+    answer = policy.generate(question)  # "The answer is 4"
+    
+    # 3. Teacher grades it
+    score = reward_model.evaluate(question, answer)  # 0.95
+    
+    # 4. Compare to original student
+    baseline = reference_model.compute_logprobs(question, answer)
+    
+    # 5. Store the experience
+    experience = Episode(question, answer, score, baseline)
+    replay_buffer.add(experience)
+    
+    # 6. When enough experiences collected, improve student
+    batch = replay_buffer.sample(curr_policy_version=0)
+    if batch is not None:
+        trainer.train_step(batch)  # Student gets better!
+        
+# 🔄 See complete working example below with actual Forge service calls
+```
+
+## From Concepts to Forge Services
+
+Here's the key insight: **Each RL component becomes a Forge service**. The toy example above maps directly to Forge:
+
+```mermaid
+graph LR
+    subgraph Concepts["RL Concepts"]
+        C1["Dataset"]
+        C2["Policy"]
+        C3["Reward Model"]
+        C4["Reference Model"]
+        C5["Replay Buffer"]
+        C6["Trainer"]
+    end
+    
+    subgraph Services["Forge Services (Real Classes)"]
+        S1["DatasetActor"]
+        S2["Policy"]
+        S3["RewardActor"]
+        S4["ReferenceModel"]
+        S5["ReplayBuffer"]
+        S6["RLTrainer"]
+    end
+    
+    C1 --> S1
+    C2 --> S2
+    C3 --> S3
+    C4 --> S4
+    C5 --> S5
+    C6 --> S6
+    
+    style C2 fill:#99ff99
+    style S2 fill:#99ff99
+    style C3 fill:#ffcc99
+    style S3 fill:#ffcc99
+```
+
+### RL Step with Forge Services
+
+```python
+# Conceptual Example
+
+async def conceptual_forge_rl_step(services, step):
+    # 1. Get a math problem - CONCEPTUAL API
+    sample = await services['dataloader'].get_sample()
+    question, target = sample["question"], sample["answer"]
+    
+    # 2. Student generates answer - CONCEPTUAL API
+    # Actual method names vary by implementation
+    responses = await services['policy'].generate(prompt=question)
+    answer = responses[0].text  
+    
+    # 3. Teacher grades it - CONCEPTUAL API  
+    # Actual reward evaluation varies by implementation
+    score = await services['reward_actor'].evaluate(
+        prompt=question, response=answer, target=target
+    )
+    
+    # 4. Compare to baseline - CONCEPTUAL API
+    ref_logprobs = await services['ref_model'].compute_baseline(responses[0].token_ids)
+    
+    # 5. Store experience - CONCEPTUAL Episode structure
+    # Real Episode structure in src/forge/data_models/episode.py
+    episode = create_episode(responses[0], score, ref_logprobs, step)
+    await services['replay_buffer'].store(episode)
+    
+    # 6. Improve student - CONCEPTUAL API
+    batch = await services['replay_buffer'].get_batch(policy_version=step)
+    if batch is not None:
+        loss = await services['trainer'].update_policy(batch)
+        return loss
+```
+
+**Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service.
+
+
+## Why This Matters: Traditional ML Infrastructure Fails
+
+### The Infrastructure Challenge
+
+Our simple RL loop above has complex requirements:
+
+#### Problem 1: Different Resource Needs
+
+```mermaid
+graph TD
+    subgraph Components["Each Component Needs Different Resources"]
+        Policy["Policy (Student AI)<br/>Generates: 'The answer is 4'<br/>Needs: Large GPU memory<br/>Scaling: Multiple replicas for speed"]
+        
+        Reward["Reward Model (Teacher)<br/>Scores answers: 0.95<br/>Needs: Moderate compute<br/>Scaling: CPU or small GPU"]
+        
+        Trainer["Trainer (Tutor)<br/>Improves student weights<br/>Needs: Massive GPU compute<br/>Scaling: Distributed training"]
+        
+        Dataset["Dataset (Question Bank)<br/>Provides: 'What is 2+2?'<br/>Needs: CPU intensive I/O<br/>Scaling: High memory bandwidth"]
+    end
+    
+    style Policy fill:#99ff99
+    style Reward fill:#ffcc99
+    style Trainer fill:#ff99cc
+    style Dataset fill:#ccccff
+```
+
+### Problem 2: Complex Interdependencies
+
+```mermaid
+graph LR
+    A["Policy: Student AI<br/>'What is 2+2?' → 'The answer is 4'"]
+    B["Reward: Teacher<br/>Scores answer: 0.95"]
+    C["Reference: Original Student<br/>Provides baseline comparison"]
+    D["Replay Buffer: Notebook<br/>Stores: question + answer + score"]
+    E["Trainer: Tutor<br/>Improves student using experiences"]
+    
+    A --> B
+    A --> C
+    B --> D
+    C --> D
+    D --> E
+    E --> A
+    
+    style A fill:#99ff99
+    style B fill:#ffcc99
+    style C fill:#99ccff
+    style D fill:#ccff99
+    style E fill:#ff99cc
+```
+
+Each step has different:
+- **Latency requirements**: Policy inference needs low latency, training can batch
+- **Scaling patterns**: Reward evaluation scales with response count, training with model size
+- **Failure modes**: Policy failure stops generation, reward failure affects learning quality
+- **Resource utilization**: GPUs for inference/training, CPUs for data processing
+
+### Problem 3: The Coordination Challenge
+
+Unlike supervised learning where you process independent batches, RL requires coordination:
+
+```python
+# This won't work - creates bottlenecks and resource waste
+def naive_rl_step():
+    # Policy waits idle while reward model works
+    response = policy_model.generate(prompt)  # GPU busy
+    reward = reward_model.evaluate(prompt, response)  # Policy GPU idle
+    
+    # Training waits for single episode  
+    loss = compute_loss(response, reward)  # Batch size = 1, inefficient
+    
+    # Everything stops if any component fails
+    if policy_fails or reward_fails or trainer_fails:
+        entire_system_stops()
+```
+
+## Enter Forge: RL-Native Architecture
+
+Forge solves these problems by treating each RL component as an **independent, scalable service**
+
+Let's see how core RL concepts map to Forge services:
+
+```python
+async def real_rl_training_step(services, step):
+    """Single RL step using verified Forge APIs"""
+    
+    # 1. Environment interaction
+    sample = await services['dataloader'].__next__.call_one()
+    prompt, target = sample["question"], sample["answer"]
+    
+    responses = await services['policy'].generate.route(prompt=prompt)
+    
+    # 2. Reward computation
+    score = await services['reward_actor'].evaluate_response.route(
+        prompt=prompt, response=responses[0].text, target=target
+    )
+    
+    # 3. Get reference logprobs
+    ref_logprobs = await services['ref_model'].forward.route(responses[0].token_ids)
+    
+    # 4. Experience storage - Episode creation pattern
+    # Note: Actual Episode structure requires token tensors, not text
+    episode = create_episode_from_response(responses[0], score, ref_logprobs, step)
+    await services['replay_buffer'].add.call_one(episode)
+    
+    # 5. Learning - trainer endpoint
+    batch = await services['replay_buffer'].sample.call_one(
+        curr_policy_version=step
+    )
+    if batch is not None:
+        loss = await services['trainer'].train_step.call_one(batch)
+        
+        # 6. Policy synchronization - weight update pattern
+        await services['trainer'].push_weights.call_one(step + 1)
+        await services['policy'].update_weights.fanout(step + 1)
+        
+        return loss
+```
+
+**Key insight**: Each line of RL pseudocode becomes a service call. The complexity of distribution, scaling, and fault tolerance is hidden behind these simple interfaces.
+
+## What Makes This Powerful
+
+### Automatic Resource Management
+```python
+responses = await policy.generate.route(prompt=question)
+answer = responses[0].text  # responses is list[Completion]
+
+# Forge handles behind the scenes:
+# - Routing to least loaded replica
+# - GPU memory management  
+# - Batch optimization
+# - Failure recovery
+# - Auto-scaling based on demand
+```
+
+### Independent Scaling
+```python
+
+from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig
+from forge.actors.replay_buffer import ReplayBuffer
+from forge.controller.service import shutdown_service
+from apps.grpo.main import Trainer, RewardActor, ComputeAdvantages, RefModel, DatasetActor
+from forge.data.rewards import MathReward, ThinkingReward
+import asyncio
+
+model = "Qwen/Qwen3-1.7B"
+group_size = 1
+
+(
+    dataloader,
+    policy,
+    trainer,
+    replay_buffer,
+    compute_advantages,
+    ref_model,
+    reward_actor,
+) = await asyncio.gather(
+        # Dataset service
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, num_replicas=1),
+            DatasetActor,
+            path="openai/gsm8k",
+            config_name="main",
+            split="train",
+            streaming=True,
+        ),
+        # Policy service with GPU
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1),
+            Policy,
+            config=PolicyConfig(
+                worker_params=WorkerConfig(model=model),
+                sampling_params=SamplingOverrides(
+                    num_samples=group_size, max_tokens=16
+                ),
+            ),
+        ),
+        # Trainer service with GPU
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1),
+            Trainer,
+            learning_rate=1e-5,
+            beta=0.1,
+            model_name=model,
+        ),
+        # Replay buffer (CPU)
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, num_replicas=1),
+            ReplayBuffer,
+            batch_size=2,
+            max_policy_age=1,
+        ),
+        # Advantage computation (CPU)
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, num_replicas=1),
+            ComputeAdvantages,
+            gamma=0.99,
+            lambda_=0.95,
+        ),
+        # Reference model with GPU
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, num_replicas=1, with_gpus=True),
+            RefModel,
+            model_name=model,
+        ),
+        # Reward actor (CPU)
+        spawn_service(
+            ServiceConfig(procs_per_replica=1, num_replicas=1),
+            RewardActor,
+            reward_functions=[MathReward(), ThinkingReward()],
+        )
+    )
+
+# Production scaling - multiply num_replicas:
+# Policy: num_replicas=8 for high inference demand
+# RewardActor: num_replicas=16 for parallel evaluation
+# Trainer: num_replicas=4 for distributed training
+```
+
+### Fault Tolerance
+```python
+# If a policy replica fails:
+responses = await policy.generate.route(prompt=question)
+answer = responses[0].text
+# -> Forge automatically routes to healthy replica
+# -> Failed replica respawns in background  
+# -> No impact on training loop
+
+# If reward service fails:
+score = await reward_actor.evaluate_response.route(
+    prompt=question, response=answer, target=target
+) 
+# -> Retries on different replica automatically
+# -> Graceful degradation if all replicas fail
+# -> System continues (may need application-level handling)
+```
+
+This is fundamentally different from monolithic RL implementations where any component failure stops everything.
diff --git a/docs/Tutorials/2_.MD b/docs/Tutorials/2_.MD
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/Tutorials/3_.MD b/docs/Tutorials/3_.MD
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD
index 6294c8ec8..01d750d06 100644
--- a/docs/Tutorials/ReadMe.MD
+++ b/docs/Tutorials/ReadMe.MD
@@ -1,8 +1,16 @@
-Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our friends that remember! 
+## Zero to Forge: From RL Theory to Production-Scale Implementation
+
+A comprehensive guide for ML Engineers building distributed RL systems for language models.
+
+Some of the examples mentioned below will be conceptual in nature for understanding. Please refer to API Docs (Coming Soon!) for more details
+
+Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our PyTorch friends that remember! 
+
+###
 
 This section currently is structured in 3 detailed parts:
 
-1. []()
+1. [RL Fundamentals and Understanding Forge Terminology](./1_RL_and_Forge_Fundamentals.MD): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals
 2. []()
 3. []()
 

From 223b2cab881168ad6c74d7bbf5707cd1f908baa7 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:06:35 -0700
Subject: [PATCH 03/22] Update 1_RL_and_Forge_Fundamentals.MD

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 96710b57a..bcffc733c 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -85,6 +85,7 @@ graph LR
     end
     
     subgraph Services["Forge Services (Real Classes)"]
+    
         S1["DatasetActor"]
         S2["Policy"]
         S3["RewardActor"]

From b9cb2cb0a3e6eb05162a2fc91fddbe1952c16080 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:08:03 -0700
Subject: [PATCH 04/22] Update 1_RL_and_Forge_Fundamentals.MD

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index bcffc733c..223a6e152 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -85,7 +85,6 @@ graph LR
     end
     
     subgraph Services["Forge Services (Real Classes)"]
-    
         S1["DatasetActor"]
         S2["Policy"]
         S3["RewardActor"]
@@ -109,6 +108,8 @@ graph LR
 
 ### RL Step with Forge Services
 
+Let's look at the example from above again, but this time we would use the names from Forge:
+
 ```python
 # Conceptual Example
 
@@ -145,6 +146,8 @@ async def conceptual_forge_rl_step(services, step):
 
 **Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service.
 
+Did you realise-we are not worrying about any Infra code here! Forge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms!
+
 
 ## Why This Matters: Traditional ML Infrastructure Fails
 

From 5a0190b4d009c30e4c736d06be024c2bfb07f07a Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:12:43 -0700
Subject: [PATCH 05/22] part 2

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD |  36 +-
 docs/Tutorials/2_Forge_Internals.MD           | 665 ++++++++++++++++++
 docs/Tutorials/3_.MD                          |   0
 docs/Tutorials/{2_.MD => 3_Monarch_101.MD}    |   0
 4 files changed, 685 insertions(+), 16 deletions(-)
 create mode 100644 docs/Tutorials/2_Forge_Internals.MD
 delete mode 100644 docs/Tutorials/3_.MD
 rename docs/Tutorials/{2_.MD => 3_Monarch_101.MD} (100%)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 223a6e152..810ef373f 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -275,15 +275,15 @@ async def real_rl_training_step(services, step):
 ```python
 responses = await policy.generate.route(prompt=question)
 answer = responses[0].text  # responses is list[Completion]
-
-# Forge handles behind the scenes:
-# - Routing to least loaded replica
-# - GPU memory management  
-# - Batch optimization
-# - Failure recovery
-# - Auto-scaling based on demand
 ```
 
+Forge handles behind the scenes:
+- Routing to least loaded replica
+- GPU memory management  
+- Batch optimization
+- Failure recovery
+- Auto-scaling based on demand
+
 ### Independent Scaling
 ```python
 
@@ -361,13 +361,14 @@ group_size = 1
             reward_functions=[MathReward(), ThinkingReward()],
         )
     )
-
-# Production scaling - multiply num_replicas:
-# Policy: num_replicas=8 for high inference demand
-# RewardActor: num_replicas=16 for parallel evaluation
-# Trainer: num_replicas=4 for distributed training
 ```
 
+Production scaling - multiply num_replicas:
+- Policy: num_replicas=8 for high inference demand
+- RewardActor: num_replicas=16 for parallel evaluation
+- Trainer: num_replicas=4 for distributed training
+
+
 ### Fault Tolerance
 ```python
 # If a policy replica fails:
@@ -381,9 +382,12 @@ answer = responses[0].text
 score = await reward_actor.evaluate_response.route(
     prompt=question, response=answer, target=target
 ) 
-# -> Retries on different replica automatically
-# -> Graceful degradation if all replicas fail
-# -> System continues (may need application-level handling)
 ```
 
-This is fundamentally different from monolithic RL implementations where any component failure stops everything.
+- Retries on different replica automatically
+- Graceful degradation if all replicas fail
+- System continues (may need application-level handling)
+
+This is fundamentally different from monolithic RL implementations where any component failure stops everything!
+
+In the next Section, we will go a layer deeper and learn how ForgeServices work. Continue to [Part 2 here](./2_Forge_Internals.MD)
\ No newline at end of file
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
new file mode 100644
index 000000000..d55eda51a
--- /dev/null
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -0,0 +1,665 @@
+# Part 2: Peeling Back the Abstraction - What Are Services?
+
+We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals.MD) before this, it explains RL Concepts and how they land in Forge.
+
+Now that you see the power of the service abstraction, let's understand what's actually happening under the hood, Grab your chai!
+
+## Service Anatomy: Beyond the Interface
+
+When you call `await policy_service.generate(question)`, here's what actually happens:
+
+```mermaid
+graph TD
+    Call["Your Code:<br/>await policy_service.generate"]
+    
+    subgraph ServiceLayer["Service Layer"]
+        Proxy["Service Proxy<br/>Load balancing<br/>Health checking<br/>Request routing"]
+        LB["Load Balancer<br/>Replica selection<br/>Circuit breaker<br/>Retry logic"]
+    end
+    
+    subgraph Replicas["Replica Management"]
+        R1["Replica 1<br/>GPU 0<br/>Healthy"]
+        R2["Replica 2<br/>GPU 1<br/>Overloaded"]
+        R3["Replica 3<br/>GPU 2<br/>Failed"]
+        R4["Replica 4<br/>GPU 3<br/>Healthy"]
+    end
+    
+    subgraph Compute["Actual Computation"]
+        Actor["Policy Actor<br/>vLLM engine<br/>Model weights<br/>KV cache"]
+    end
+    
+    Call --> Proxy
+    Proxy --> LB
+    LB --> R1
+    LB -.-> R2
+    LB -.-> R3
+    LB --> R4
+    R1 --> Actor
+    R4 --> Actor
+    
+    style Call fill:#99ff99
+    style LB fill:#ffcc99
+    style R3 fill:#ff9999
+    style Actor fill:#cc99ff
+```
+
+## Service Components Deep Dive
+
+### 1. Real Service Configuration
+
+Here's the actual ServiceConfig from Forge source code:
+
+```python
+# Configuration pattern from apps/grpo/main.py:
+Policy.options(
+    procs=1,           # Processes per replica
+    num_replicas=4,    # Number of replicas  
+    with_gpus=True     # Allocate GPUs
+    # Other available options:
+    # hosts=None
+)
+
+# This is the ACTUAL way services are configured in Forge
+```
+
+### 2. Real Service Creation
+
+Services are created using the `spawn_service` function:
+
+```python
+# This is what ACTUALLY works - copied directly from the notebook
+
+from forge.controller.service import ServiceConfig, spawn_service
+from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig
+
+model = "Qwen/Qwen3-1.7B"
+
+policy = await spawn_service(
+    ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1),
+    Policy,
+    config=PolicyConfig(
+        worker_params=WorkerConfig(model=model),
+        sampling_params=SamplingOverrides(
+            num_samples=1, max_tokens=16
+        ),
+    ),
+)
+
+prompt = "What is 3 + 5?"
+responses = await policy.generate.choose(prompt=prompt)
+print(f"Response: {responses[0].text}")
+
+# The spawn_service() function automatically handles:
+# - Spawning actor replicas across processes/GPUs
+# - Load balancing with .choose() method
+# - Health monitoring and failure recovery  
+# - Message routing and serialization
+
+# Cleanup when done
+await shutdown_service(policy)
+```
+
+### 3. How Services Actually Work
+
+Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
+
+```python
+# Forge internals - What happens behind the scenes:
+# 1. .as_service() creates a ServiceInterface
+# 2. ServiceInterface manages N replicas of your ForgeActor class
+# 3. ServiceInterface handles routing between replicas
+# 4. You get methods like .route(), .fanout(), etc.
+
+# Your code sees this:
+responses = await policy.generate.route(prompt=prompt)
+
+# But behind the scenes:
+# - ServiceInterface selects healthy replica
+# - Routes message to that replica's Policy.generate() endpoint
+# - Handles failures and retries automatically
+# - Returns list[Completion] from the selected replica
+```
+
+### 3. Different Service Types and Their Characteristics
+
+```mermaid
+graph TD
+    subgraph GPU["GPU-Intensive Services"]
+        PolicySvc["Policy Service<br/>Large model inference<br/>High GPU memory<br/>Batch optimization"]
+        TrainerSvc["Trainer Service<br/>Distributed training<br/>Gradient sync<br/>Massive compute"]
+        RefSvc["Reference Service<br/>Frozen model<br/>Baseline computation<br/>Read-only ops"]
+    end
+    
+    subgraph CPU["CPU-Intensive Services"]
+        RewardSvc["Reward Service<br/>Evaluation logic<br/>Rule-based scoring<br/>High throughput"]
+        DataSvc["Data Service<br/>Dataset streaming<br/>Preprocessing<br/>I/O optimization"]
+    end
+    
+    subgraph Memory["Memory-Intensive Services"]
+        BufferSvc["Buffer Service<br/>Experience storage<br/>Efficient sampling<br/>Persistence"]
+        MetricsSvc["Metrics Service<br/>Logging aggregation<br/>Performance tracking<br/>Analytics"]
+    end
+    
+    style PolicySvc fill:#ff9999
+    style TrainerSvc fill:#ff9999
+    style RewardSvc fill:#99ff99
+    style BufferSvc fill:#9999ff
+```
+
+## Deep Dive: Service Communication Patterns
+
+These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage.
+
+### 1. `.route()` - Load Balanced Single Replica
+
+**When to use**: Normal request routing where any replica can handle the request.
+
+```python
+responses = await policy.generate.route(prompt=question)
+answer = responses[0].text  # Extract text from Completion object
+
+# Behind the scenes:
+# 1. Health check eliminates failed replicas
+# 2. Load balancer picks least loaded healthy replica  
+# 3. Request routes to that specific replica
+# 4. Automatic retry on different replica if failure
+```
+
+**Performance characteristics**:
+- **Latency**: Lowest (single network hop)
+- **Throughput**: Limited by single replica capacity
+- **Fault tolerance**: Automatic failover to other replicas
+
+**Critical insight**: `.route()` is your default choice for stateless operations in Forge services.
+
+### 2. `.fanout()` - Broadcast with Results Collection
+
+**When to use**: You need responses from ALL replicas.
+
+```python
+# Get version from all policy replicas
+current_versions = await policy.get_version.fanout()
+# Returns: [version_replica_1, version_replica_2, ...] 
+
+# Update weights on all replicas
+await policy.update_weights.fanout(new_policy_version)
+# Broadcasts to all replicas simultaneously
+```
+
+**Performance characteristics**:
+- **Latency**: Slowest replica determines total latency
+- **Throughput**: Network bandwidth × number of replicas
+- **Fault tolerance**: Fails if ANY replica fails (unless configured otherwise)
+
+**Critical gotcha**: Don't use `.fanout()` for high-frequency operations - it contacts all replicas.
+
+### 3. Streaming Operations - Custom Implementation Pattern
+
+**When to use**: You want to process results as they arrive, not wait for all.
+
+```python
+# 📝 CONCEPTUAL - Streaming requires custom implementation in your training loop
+# The basic ReplayBuffer doesn't have built-in streaming methods
+# Pattern from apps/grpo/main.py continuous training:
+
+while training:
+    # This is the real API call pattern
+    batch = await replay_buffer.sample.call_one(curr_policy_version=step)
+    if batch is not None:
+        # Process batch immediately
+        loss = await trainer.train_step.call_one(batch)
+        print(f"Training loss: {loss}")
+    else:
+        await asyncio.sleep(0.1)  # Wait for more data
+```
+
+**Performance characteristics**:
+- **Latency**: Process first result immediately  
+- **Throughput**: Pipeline parallelism (much higher than sequential)
+- **Fault tolerance**: Continues if some replicas fail
+
+**Critical insight**: This is essential for high-throughput RL where you can't wait for batches.
+
+### 4. Fire-and-Forget Operations
+
+**When to use**: Side effects that don't need responses (notifications, cache updates).
+
+```python
+# 📝 CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations
+# The basic services don't have broadcast methods built-in
+# You would implement custom endpoints in your ForgeActor:
+
+class CustomPolicy(Policy):
+    @endpoint
+    async def clear_cache(self) -> None:
+        """Custom endpoint for cache clearing"""
+        self.policy_worker.clear_kv_cache()
+
+# Then use it (hypothetical):
+# await custom_policy.clear_cache.fanout()  # Clear all replica caches
+# Note: Actual cache clearing would use existing Policy methods
+```
+
+**Performance characteristics**:
+- **Latency**: Immediately returns (doesn't wait for completion)
+- **Throughput**: Network limited, but non-blocking
+- **Fault tolerance**: Fire-and-forget (you don't know if it worked)
+
+**Critical warning**: Only use for non-critical operations - you get no confirmation.
+
+### 5. Service Sessions for Stateful Operations
+
+**When to use**: When you need multiple calls to hit the same replica (like KV cache preservation).
+
+```python
+# This Counter example demonstrates the session pattern
+
+from forge.controller import ForgeActor
+from forge.controller.service import ServiceConfig, spawn_service, shutdown_service
+from monarch.actor import endpoint
+
+class ForgeCounter(ForgeActor):
+    def __init__(self, initial_value: int):
+        self.value = initial_value
+
+    @endpoint
+    def increment(self) -> int:
+        self.value += 1
+        return self.value
+
+    @endpoint
+    def get_value(self) -> int:
+        return self.value
+
+    @endpoint
+    async def reset(self):
+        self.value = 0
+
+counter_service = await spawn_service(
+    ServiceConfig(procs_per_replica=1, num_replicas=4),
+    ForgeCounter,
+    initial_value=0
+)
+
+# Test basic operations
+await counter_service.increment.choose()
+results = await counter_service.increment.call()
+print(f"All replica values: {results}")
+
+# STICKY SESSIONS
+print("\nUsing sticky sessions:")
+async with counter_service.session():
+    await counter_service.reset.choose()
+    print(await counter_service.increment.choose())  # 1
+    print(await counter_service.increment.choose())  # 2
+    print(await counter_service.increment.choose())  # 3
+          
+    final_value = await counter_service.get_value.choose()
+    print(f"Final value on this replica: {final_value}")  # 3
+
+# Same pattern works with Policy for multi-turn conversations:
+# async with policy.session():
+#     response1 = await policy.generate.choose(prompt=turn1)
+#     full_prompt = turn1 + response1[0].text + turn2
+#     response2 = await policy.generate.choose(prompt=full_prompt)
+#     # Both calls hit same replica, preserving KV cache
+
+# Cleanup
+await shutdown_service(counter_service)
+```
+
+**Performance impact**: Critical for maintaining KV cache in multi-turn conversations.
+
+## Deep Dive: State Management Reality
+
+The most complex challenge in distributed RL is maintaining state consistency while maximizing performance.
+
+### The KV Cache Problem  
+
+**The challenge**: Policy inference is much faster with KV cache, but cache is tied to specific conversation history.
+
+```python
+# This breaks KV cache optimization:
+async def naive_multi_turn():
+    # Each call might go to different replica = cache miss
+    response1 = await policy_service.generate.choose(question1)
+    response2 = await policy_service.generate.choose(question1 + response1) # Cache miss!
+    response3 = await policy_service.generate.choose(conversation_so_far)   # Cache miss!
+```
+
+**The solution**: Sticky sessions ensure all calls go to same replica.
+
+```python  
+async def optimized_multi_turn():
+    async with policy.session():
+        # All calls guaranteed to hit same replica = cache hits
+        response1 = await policy.generate.route(prompt=question1)
+        full_prompt = question1 + response1[0].text  
+        response2 = await policy.generate.route(prompt=full_prompt) # Cache hit!
+        conversation = full_prompt + response2[0].text
+        response3 = await policy.generate.route(prompt=conversation)   # Cache hit!
+        
+    # Session ends, replica can be garbage collected or reused
+```
+
+**Performance impact**: Maintaining KV cache across turns avoids recomputing previous tokens.
+
+### Replay Buffer Consistency
+
+**The challenge**: Multiple trainers and experience collectors reading/writing concurrently.
+
+**Real Forge approach**: The ReplayBuffer actor handles concurrency internally:
+
+```python
+# Forge ReplayBuffer endpoints (verified from source code)
+# Add episodes (thread-safe by actor model)
+await replay_buffer.add.call_one(episode)  # Note: .call_one() not .choose()
+
+# Sample batches for training
+batch = await replay_buffer.sample.call_one(
+    curr_policy_version=step_number,
+    batch_size=None  # Optional parameter, uses default from config
+)
+
+# Additional methods available:
+# await replay_buffer.clear.call_one()  # Clear buffer
+# await replay_buffer.evict.call_one(curr_policy_version)  # Remove old episodes
+# state = await replay_buffer.state_dict.call_one()  # Get state for checkpointing
+```
+
+**Critical insight**: The actor model provides natural thread safety - each actor processes messages sequentially.
+
+### Weight Synchronization Strategy
+
+**The challenge**: Trainer updates policy weights, but policy service needs those weights.
+
+```python
+# Forge weight synchronization pattern from apps/grpo/main.py
+async def real_weight_sync(trainer, policy, step):
+    # Trainer pushes weights to TorchStore with version number
+    await trainer.push_weights.call_one(policy_version=step + 1)
+    
+    # Policy service updates to new version from TorchStore  
+    # Use .fanout() to update ALL policy replicas
+    await policy.update_weights.fanout(policy_version=step + 1)
+    
+# Check current policy version
+current_version = await policy.get_version.route()
+print(f"Current policy version: {current_version}")
+```
+
+## Deep Dive: Asynchronous Coordination Patterns
+
+**The real challenge**: Different services run at different speeds, but Forge's service abstraction handles the coordination complexity.
+
+### The Forge Approach: Let Services Handle Coordination
+
+Instead of manual coordination, Forge services handle speed mismatches automatically:
+
+```python
+
+from apps.grpo.main import Episode, Group
+
+async def simple_rl_step():
+    
+    # ===== Generate a rollout =====
+    sample = await dataloader.__next__.choose()
+    prompt, target = sample["question"], sample["answer"]
+    
+    print(f"Prompt: {prompt}")
+    print(f"Target: {target}")
+    
+    actions = await policy.generate.choose(prompt=prompt)
+    print(f"Policy response: {actions[0].text}")
+    
+    ref_logprobs = await ref_model.forward.choose(actions[0].token_ids)    
+    reward = await reward_actor.evaluate_response.choose(
+        prompt=prompt, 
+        response=actions[0].text, 
+        target=target
+    )
+    print(f"Reward: {reward}")
+    
+    episode = Episode(
+        episode_id=0,
+        prompt=prompt,
+        target=target, 
+        policy_version=0,
+    )
+    
+    episode.add_group(Group(
+        response=actions[0].text,
+        ref_logprobs=ref_logprobs,
+        reward=reward,
+    ))
+    
+    advantages = await compute_advantages.__call__.choose(episode.groups)
+    episode.groups[0].advantage = advantages[0]
+    print(f"Advantage: {advantages[0]}")    
+    await replay_buffer.add.choose(episode)
+    print("Episode stored in replay buffer")
+    
+    # ===== Train on the batch ===== 
+    batch = await replay_buffer.sample.choose(curr_policy_version=0)
+    if batch is not None:
+        print("Training on batch...")
+        training_result = await trainer.train_step.choose(batch)
+        loss = training_result.get("loss", 0.0)
+        print(f"Training loss: {loss}")
+        return loss
+    else:
+        print("Not enough data in buffer yet")
+        return None
+
+for step in range(10):
+    print(f"\n--- RL Step {step + 1} ---")
+    loss = await simple_rl_step()
+    if loss:
+        print(f"Step {step + 1} complete, loss: {loss:.4f}")
+    else:
+        print(f"Step {step + 1} complete, building buffer...")
+```
+
+### Handling Speed Mismatches with Service Scaling
+
+**The insight**: Scale services independently based on their bottlenecks.
+
+```python
+# Scale fast services with more replicas
+policy = await Policy.options(
+    procs=1, num_replicas=8, with_gpus=True  # Many replicas for high throughput
+).as_service(
+    engine_config=EngineConfig(model=model_name)
+)
+
+# Reward evaluation might be CPU-bound
+reward_actor = await RewardActor.options(
+    procs=1, num_replicas=16, with_gpus=False  # More CPU replicas
+).as_service(
+    reward_functions=[MathReward()]
+)
+
+# Training needs fewer but more powerful replicas
+trainer = await RLTrainer.options(
+    procs=1, num_replicas=2, with_gpus=True  # Fewer but GPU-heavy
+).as_actor(  # Trainer typically uses .as_actor() not .as_service()
+    optimizer=Optimizer(lr=1e-5)
+)
+```
+
+### Natural Backpressure Through Service APIs
+
+```python
+# backpressure pattern - The replay buffer naturally provides backpressure
+batch = await replay_buffer.sample.call_one(curr_policy_version=step)
+if batch is None:
+    # Not enough data yet - natural rate limiting
+    print("Buffer not ready, collecting more experiences...")
+    continue
+else:
+    # Proceed with training
+    loss = await trainer.train_step.call_one(batch)
+    print(f"Training loss: {loss}")
+```
+
+These patterns address the core technical challenges in distributed RL. The key insight: **Forge services handle coordination complexity automatically, letting you focus on RL algorithm logic**.
+
+## Service Implementation Example
+
+Let's see how a reward service is actually implemented:
+
+```python
+# ✅ COMPLETE WORKING EXAMPLE - Exact RewardActor from apps/grpo/main.py
+
+from forge.controller import ForgeActor
+from monarch.actor import endpoint
+from forge.data.rewards import MathReward, ThinkingReward
+from forge.controller.service import ServiceConfig, spawn_service
+
+# EXACT class definition from apps/grpo/main.py lines 68-83  
+class RewardActor(ForgeActor):
+    def __init__(self, reward_functions: list):
+        self.reward_functions = reward_functions
+
+    @endpoint
+    async def evaluate_response(self, prompt: str, response: str, target: str) -> float:
+        """Evaluate response quality using multiple reward functions"""
+        total_reward = 0.0
+        
+        for reward_fn in self.reward_functions:
+            # Each reward function contributes to total score
+            reward = reward_fn(prompt, response, target)
+            total_reward += reward
+            
+        # Return average reward across all functions
+        return total_reward / len(self.reward_functions) if self.reward_functions else 0.0
+
+reward_actor = await spawn_service(
+    ServiceConfig(procs_per_replica=1, num_replicas=1),
+    RewardActor,
+    reward_functions=[MathReward(), ThinkingReward()]
+)
+
+prompt = "What is 15% of 240?"
+response = "15% of 240 is 36"
+target = "36"
+
+score = await reward_actor.evaluate_response.choose(
+    prompt=prompt,
+    response=response, 
+    target=target
+)
+print(f"Reward score: {score}")  # Usually around 1.0 for correct math answers
+
+# For production scaling - increase num_replicas for parallel evaluation:
+# ServiceConfig(procs_per_replica=1, num_replicas=16)  # 16 parallel evaluators
+
+# Cleanup when done
+await shutdown_service(reward_actor)
+```
+
+## Service Orchestration: The Training Loop
+
+Now let's see how services coordinate in a real training loop:
+
+```python
+# This is the REAL way production RL systems are built with Forge
+
+import asyncio
+from forge.actors.policy import Policy
+from forge.actors.reference_model import ReferenceModel
+from forge.actors.replay_buffer import ReplayBuffer
+from forge.actors.trainer import RLTrainer
+from forge.controller.actor import ForgeActor
+from forge.data.rewards import MathReward, ThinkingReward
+from monarch.actor import endpoint
+from omegaconf import DictConfig
+
+# EXACT service creation from apps/grpo/main.py lines 322-344
+print("Initializing all services...")
+(
+    dataloader,
+    policy,
+    trainer,
+    replay_buffer,
+    compute_advantages,
+    ref_model,
+    reward_actor,
+) = await asyncio.gather(
+    DatasetActor.options(**cfg.actors.dataset).as_actor(**cfg.dataset),
+    Policy.options(**cfg.services.policy).as_service(**cfg.policy),
+    RLTrainer.options(**cfg.actors.trainer).as_actor(
+        **cfg.trainer, loss=simple_grpo_loss
+    ),
+    ReplayBuffer.options(**cfg.actors.replay_buffer).as_actor(
+        **cfg.replay_buffer, collate=collate
+    ),
+    ComputeAdvantages.options(**cfg.actors.compute_advantages).as_actor(),
+    ReferenceModel.options(**cfg.services.ref_model).as_service(**cfg.ref_model),
+    RewardActor.options(**cfg.services.reward_actor).as_service(
+        reward_functions=[MathReward(), ThinkingReward()]
+    ),
+)
+
+print("All services initialized successfully!")
+
+# EXACT usage patterns from apps/grpo/main.py continuous training loop
+async def production_training_loop():
+    """Real training loop pattern from apps/grpo/main.py"""
+    step = 0
+    
+    while True:
+        # Data generation 
+        sample = await dataloader.sample.call_one()
+        
+        # Policy generation service call
+        responses = await policy.generate.route(prompt=sample["question"])
+        
+        # Reference computation service call
+        ref_logprobs = await ref_model.forward.route(responses[0].token_ids)
+        
+        # Reward evaluation service call 
+        reward = await reward_actor.evaluate_response.route(
+            prompt=sample["question"],
+            response=responses[0].text,
+            target=sample["answer"]
+        )
+        
+        # Experience storage (simplified structure for illustration)
+        episode = create_episode(sample, responses[0], reward, ref_logprobs, step)
+        await replay_buffer.add.call_one(episode)
+        
+        # Training when ready endpoints
+        batch = await replay_buffer.sample.call_one(curr_policy_version=step)
+        if batch is not None:
+            loss = await trainer.train_step.call_one(batch)
+            
+            # Weight synchronization pattern
+            await trainer.push_weights.call_one(step + 1)
+            await policy.update_weights.route(step + 1)
+            
+            print(f"Step {step}, Loss: {loss:.4f}")
+            step += 1
+
+# EXACT cleanup pattern from apps/grpo/main.py lines 493-504  
+print("Shutting down services...")
+await asyncio.gather(
+    DatasetActor.shutdown(dataloader),
+    policy.shutdown(),
+    RLTrainer.shutdown(trainer),
+    ReplayBuffer.shutdown(replay_buffer),
+    ComputeAdvantages.shutdown(compute_advantages),
+    ref_model.shutdown(),
+    reward_actor.shutdown(),
+)
+print("All services shut down successfully!")
+```
+
+**Key observations:**
+1. **Parallelism**: Independent operations run concurrently
+2. **Load balancing**: Each `choose()` call automatically selects optimal replica  
+3. **Fault tolerance**: Failures automatically retry on different replicas
+4. **Resource efficiency**: CPU and GPU services scale independently
+5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions)
+
+This is the power of the service abstraction - complex distributed coordination looks like simple async Python code.
diff --git a/docs/Tutorials/3_.MD b/docs/Tutorials/3_.MD
deleted file mode 100644
index e69de29bb..000000000
diff --git a/docs/Tutorials/2_.MD b/docs/Tutorials/3_Monarch_101.MD
similarity index 100%
rename from docs/Tutorials/2_.MD
rename to docs/Tutorials/3_Monarch_101.MD

From 44f562435f7a2036fb2d1a758c2327dd808cb4a7 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:15:10 -0700
Subject: [PATCH 06/22] add part 3

---
 docs/Tutorials/3_Monarch_101.MD | 437 ++++++++++++++++++++++++++++++++
 docs/Tutorials/ReadMe.MD        |   4 +-
 2 files changed, 439 insertions(+), 2 deletions(-)

diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index e69de29bb..9369be13a 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -0,0 +1,437 @@
+# Part 3: The Forge-Monarch Connection
+
+Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
+
+## The Complete Hierarchy: Service to Silicon
+
+```mermaid
+graph TD
+    subgraph YourCode["1. Your RL Code"]
+        Call["await policy_service.generate.choose('What is 2+2?')"]
+    end
+    
+    subgraph ForgeServices["2. Forge Service Layer"]
+        ServiceInterface["ServiceInterface<br/>• Routes .choose() to replica<br/>• Handles load balancing<br/>• Manages health checks"]
+        ServiceActor["ServiceActor<br/>• Manages replica lifecycle<br/>• Monitors health<br/>• Coordinates failures"]
+    end
+    
+    subgraph MonarchLayer["3. Monarch Actor Layer"]  
+        ActorMesh["ActorMesh[PolicyActor]<br/>• 4 PolicyActor instances<br/>• Each on different GPU<br/>• Message passing interface"]
+        ProcMesh["ProcMesh<br/>• 4 processes<br/>• GPU topology: [0,1,2,3]<br/>• Network interconnect"]
+    end
+    
+    subgraph Hardware["4. Physical Hardware"]
+        GPU0["GPU 0<br/>PolicyActor #1<br/>vLLM Engine<br/>Model Weights"]
+        GPU1["GPU 1<br/>PolicyActor #2<br/>vLLM Engine<br/>Model Weights"] 
+        GPU2["GPU 2<br/>PolicyActor #3<br/>vLLM Engine<br/>Model Weights"]
+        GPU3["GPU 3<br/>PolicyActor #4<br/>vLLM Engine<br/>Model Weights"]
+    end
+    
+    Call --> ServiceInterface
+    ServiceInterface --> ServiceActor
+    ServiceActor --> ActorMesh
+    ActorMesh --> ProcMesh
+    ProcMesh --> GPU0
+    ProcMesh --> GPU1
+    ProcMesh --> GPU2
+    ProcMesh --> GPU3
+    
+    style Call fill:#99ff99
+    style ServiceActor fill:#ffcc99
+    style ActorMesh fill:#cc99ff
+    style ProcMesh fill:#ccccff
+```
+
+## Deep Dive: ProcMesh - The Foundation
+
+**ProcMesh** is Monarch's core abstraction for organizing processes across hardware. Think of it as a multi-dimensional grid that maps directly to your cluster topology.
+
+### Single Host ProcMesh
+
+```mermaid
+graph TD
+    subgraph Host["Single Host (8 GPUs)"]
+        subgraph ProcMesh["ProcMesh: per_host={'gpus': 8}"]
+            P0["Process 0<br/>GPU 0"]
+            P1["Process 1<br/>GPU 1"] 
+            P2["Process 2<br/>GPU 2"]
+            P3["Process 3<br/>GPU 3"]
+            P4["Process 4<br/>GPU 4"]
+            P5["Process 5<br/>GPU 5"]
+            P6["Process 6<br/>GPU 6"] 
+            P7["Process 7<br/>GPU 7"]
+        end
+        
+        P0 -.->|"Network"| P1
+        P1 -.->|"Network"| P2  
+        P2 -.->|"Network"| P3
+        P3 -.->|"Network"| P4
+        P4 -.->|"Network"| P5
+        P5 -.->|"Network"| P6
+        P6 -.->|"Network"| P7
+        P7 -.->|"Network"| P0
+    end
+    
+    style P0 fill:#ff9999
+    style P1 fill:#ff9999
+    style P2 fill:#ff9999
+    style P3 fill:#ff9999
+    style P4 fill:#ff9999
+    style P5 fill:#ff9999
+    style P6 fill:#ff9999
+    style P7 fill:#ff9999
+```
+
+### Multi-Host ProcMesh
+
+```mermaid
+graph TD
+    subgraph Cluster["Multi-Host Cluster"]
+        subgraph Host1["Host 1"]
+            subgraph PM1["ProcMesh Segment 1"]
+                H1P0["Process 0<br/>GPU 0"]
+                H1P1["Process 1<br/>GPU 1"]
+                H1P2["Process 2<br/>GPU 2"]
+                H1P3["Process 3<br/>GPU 3"]
+            end
+        end
+        
+        subgraph Host2["Host 2"] 
+            subgraph PM2["ProcMesh Segment 2"]
+                H2P0["Process 4<br/>GPU 0"]
+                H2P1["Process 5<br/>GPU 1"]
+                H2P2["Process 6<br/>GPU 2"]
+                H2P3["Process 7<br/>GPU 3"]
+            end
+        end
+        
+        subgraph Host3["Host 3"]
+            subgraph PM3["ProcMesh Segment 3"]
+                H3P0["Process 8<br/>GPU 0"]
+                H3P1["Process 9<br/>GPU 1"]
+                H3P2["Process 10<br/>GPU 2"] 
+                H3P3["Process 11<br/>GPU 3"]
+            end
+        end
+    end
+    
+    H1P0 -.->|"InfiniBand"| H2P0
+    H1P1 -.->|"InfiniBand"| H2P1
+    H2P0 -.->|"InfiniBand"| H3P0
+    H2P1 -.->|"InfiniBand"| H3P1
+    
+    style PM1 fill:#ff9999
+    style PM2 fill:#99ff99
+    style PM3 fill:#99ccff
+```
+
+```python
+# This shows the underlying actor system that powers Forge services
+
+from monarch.actor import Actor, endpoint, this_proc, Future
+from monarch.actor import ProcMesh, this_host
+import asyncio
+
+# STEP 1: Define a basic actor
+class Counter(Actor):
+    def __init__(self, initial_value: int):
+        self.value = initial_value
+
+    @endpoint
+    def increment(self) -> None:
+        self.value += 1
+
+    @endpoint
+    def get_value(self) -> int:
+        return self.value
+
+# STEP 2: Single actor in local process
+counter: Counter = this_proc().spawn("counter", Counter, initial_value=0)
+
+# STEP 3: Send messages
+fut: Future[int] = counter.get_value.call_one()
+value = await fut
+print(f"Counter value: {value}")  # 0
+
+# STEP 4: Multiple actors across processes
+procs: ProcMesh = this_host().spawn_procs(per_host={"gpus": 8})
+counters: Counter = procs.spawn("counters", Counter, 0)
+
+# STEP 5: Broadcast to all actors
+await counters.increment.call()
+
+# STEP 6: Different message patterns
+# call_one() - single actor
+value = await counters.get_value.call_one()
+print(f"One counter: {value}")
+
+# choose() - random single actor  
+value = await counters.get_value.choose()
+print(f"Random counter: {value}")
+
+# call() - all actors, collect results
+values = await counters.get_value.call()
+print(f"All counters: {values}")
+
+# broadcast() - fire and forget
+await counters.increment.broadcast()
+
+# Cleanup
+await procs.stop()
+```
+
+## Actor Meshes: Your Code Running Distributed
+
+**ActorMesh** is created when you spawn actors across a ProcMesh. Each process in the ProcMesh gets one instance of your actor.
+
+```mermaid
+graph TD
+    subgraph Creation["Actor Creation Process"]
+        Code["mesh.spawn('policy', PolicyActor, model='Qwen/Qwen3-7B')"]
+        
+        subgraph ProcMesh["ProcMesh (4 processes)"]
+            P0["Process 0<br/>GPU 0"] 
+            P1["Process 1<br/>GPU 1"]
+            P2["Process 2<br/>GPU 2"]
+            P3["Process 3<br/>GPU 3"]
+        end
+        
+        subgraph ActorMesh["ActorMesh[PolicyActor]"]
+            A0["PolicyActor<br/>Instance #0<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
+            A1["PolicyActor<br/>Instance #1<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
+            A2["PolicyActor<br/>Instance #2<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
+            A3["PolicyActor<br/>Instance #3<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
+        end
+        
+        Code --> ProcMesh
+        P0 --> A0
+        P1 --> A1
+        P2 --> A2
+        P3 --> A3
+    end
+    
+    style A0 fill:#99ff99
+    style A1 fill:#99ff99
+    style A2 fill:#99ff99
+    style A3 fill:#99ff99
+```
+
+### Message Routing Through ActorMesh
+
+```mermaid
+graph TD
+    subgraph MessageFlow["Message Flow Patterns"]
+        Client["await policy_actors.generate.METHOD(prompt)"]
+        
+        subgraph Methods["Different Adverbs Route Differently"]
+            Choose["choose()<br/>→ Routes to ONE actor<br/>→ Load balanced"]
+            Call["call()<br/>→ Routes to ALL actors<br/>→ Collects all results"] 
+            Broadcast["broadcast()<br/>→ Routes to ALL actors<br/>→ Fire and forget"]
+            Stream["stream()<br/>→ Routes to ALL actors<br/>→ Iterator of results"]
+        end
+        
+        subgraph ActorInstances["PolicyActor Instances"]
+            A0["Actor 0<br/>GPU 0<br/>generates response"]
+            A1["Actor 1<br/>GPU 1<br/>generates response"] 
+            A2["Actor 2<br/>GPU 2<br/>generates response"]
+            A3["Actor 3<br/>GPU 3<br/>generates response"]
+        end
+        
+        Client --> Choose
+        Client --> Call
+        Client --> Broadcast
+        Client --> Stream
+        
+        Choose -.->|"Load balanced"| A1
+        Call --> A0
+        Call --> A1  
+        Call --> A2
+        Call --> A3
+        Broadcast --> A0
+        Broadcast --> A1
+        Broadcast --> A2
+        Broadcast --> A3
+        Stream --> A0
+        Stream --> A1
+        Stream --> A2
+        Stream --> A3
+    end
+    
+    style Choose fill:#99ff99
+    style Call fill:#ffcc99
+    style Broadcast fill:#ff99cc
+    style Stream fill:#cc99ff
+```
+
+## How Forge Services Use Monarch
+
+Now the key insight: **Forge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**.
+
+### The Service Creation Process
+
+```mermaid
+graph TD
+    subgraph ServiceCreation["spawn_service() Process"]
+        Call["await spawn_service(ServiceConfig(num_replicas=4), PolicyActor, model='Qwen')"]
+        
+        ServiceActor["ServiceActor<br/>• Manages 4 replicas<br/>• Handles health checks<br/>• Routes service calls"]
+        
+        subgraph Replicas["4 Independent Replicas"] 
+            subgraph R0["Replica 0"]
+                PM0["ProcMesh<br/>1 process<br/>GPU 0"]
+                AM0["ActorMesh<br/>1 PolicyActor"]
+            end
+            
+            subgraph R1["Replica 1"]
+                PM1["ProcMesh<br/>1 process<br/>GPU 1"] 
+                AM1["ActorMesh<br/>1 PolicyActor"]
+            end
+            
+            subgraph R2["Replica 2"]
+                PM2["ProcMesh<br/>1 process<br/>GPU 2"]
+                AM2["ActorMesh<br/>1 PolicyActor"]
+            end
+            
+            subgraph R3["Replica 3"]
+                PM3["ProcMesh<br/>1 process<br/>GPU 3"]
+                AM3["ActorMesh<br/>1 PolicyActor"]
+            end
+        end
+        
+        Call --> ServiceActor
+        ServiceActor --> R0
+        ServiceActor --> R1
+        ServiceActor --> R2
+        ServiceActor --> R3
+        PM0 --> AM0
+        PM1 --> AM1
+        PM2 --> AM2
+        PM3 --> AM3
+    end
+    
+    style ServiceActor fill:#ffcc99
+    style AM0 fill:#99ff99
+    style AM1 fill:#99ff99
+    style AM2 fill:#99ff99
+    style AM3 fill:#99ff99
+```
+
+### Service Call to Actor Execution
+
+```mermaid
+graph TD
+    subgraph CallFlow["Complete Call Flow"]
+        UserCall["await policy_service.generate.choose('What is 2+2?')"]
+        
+        ServiceInterface["ServiceInterface<br/>• Receives .choose() call<br/>• Routes to ServiceActor"]
+        
+        ServiceActor["ServiceActor<br/>• Selects healthy replica<br/>• Load balancing logic<br/>• Failure handling"]
+        
+        SelectedReplica["Selected Replica #2<br/>• ProcMesh with 1 process<br/>• ActorMesh with 1 PolicyActor"]
+        
+        PolicyActor["PolicyActor Instance<br/>• Loads model<br/>• Runs vLLM inference<br/>• Returns 'The answer is 4'"]
+        
+        GPU["GPU 2<br/>• vLLM engine<br/>• Model weights<br/>• KV cache<br/>• CUDA kernels"]
+        
+        UserCall --> ServiceInterface
+        ServiceInterface --> ServiceActor
+        ServiceActor --> SelectedReplica
+        SelectedReplica --> PolicyActor
+        PolicyActor --> GPU
+        
+        GPU -.->|"Response"| PolicyActor
+        PolicyActor -.->|"Response"| SelectedReplica
+        SelectedReplica -.->|"Response"| ServiceActor
+        ServiceActor -.->|"Response"| ServiceInterface
+        ServiceInterface -.->|"'The answer is 4'"| UserCall
+    end
+    
+    style UserCall fill:#99ff99
+    style ServiceActor fill:#ffcc99
+    style PolicyActor fill:#cc99ff
+    style GPU fill:#ffcccc
+```
+
+## Multiple Services Sharing Infrastructure
+
+In real RL systems, you have multiple services that can share or use separate ProcMeshes:
+
+```mermaid
+graph TD
+    subgraph Cluster["RL Training Cluster"]
+        subgraph Services["Forge Services"] 
+            PS["Policy Service<br/>4 GPU replicas"]
+            TS["Trainer Service<br/>2 GPU replicas"] 
+            RS["Reward Service<br/>4 CPU replicas"]
+            BS["Buffer Service<br/>1 CPU replica"]
+        end
+        
+        subgraph MonarchInfra["Monarch Infrastructure"]
+            subgraph GPUMesh["GPU ProcMesh (6 processes)"]
+                G0["Process 0<br/>GPU 0"]
+                G1["Process 1<br/>GPU 1"]
+                G2["Process 2<br/>GPU 2"] 
+                G3["Process 3<br/>GPU 3"]
+                G4["Process 4<br/>GPU 4"]
+                G5["Process 5<br/>GPU 5"]
+            end
+            
+            subgraph CPUMesh["CPU ProcMesh (5 processes)"]
+                C0["Process 0<br/>CPU"]
+                C1["Process 1<br/>CPU"] 
+                C2["Process 2<br/>CPU"]
+                C3["Process 3<br/>CPU"]
+                C4["Process 4<br/>CPU"]
+            end
+        end
+        
+        PS --> G0
+        PS --> G1
+        PS --> G2
+        PS --> G3
+        TS --> G4
+        TS --> G5
+        RS --> C0
+        RS --> C1
+        RS --> C2
+        RS --> C3
+        BS --> C4
+    end
+    
+    style PS fill:#99ff99
+    style TS fill:#ff99cc
+    style RS fill:#ffcc99
+    style BS fill:#cc99ff
+    style GPUMesh fill:#ffe6e6
+    style CPUMesh fill:#e6f3ff
+```
+
+## Key Insights: Why This Architecture Matters
+
+1. **Process Isolation**: Each actor runs in its own process - failures don't cascade
+2. **Location Transparency**: Actors can be local or remote with identical APIs  
+3. **Structured Distribution**: ProcMesh maps directly to hardware topology
+4. **Message Passing**: No shared memory means no race conditions or locks
+5. **Service Abstraction**: Forge hides Monarch complexity while preserving power
+
+Understanding this hierarchy helps you:
+- **Debug performance issues**: Is the bottleneck at service, actor, or hardware level?
+- **Optimize resource usage**: How many replicas per service? GPU vs CPU processes?
+- **Handle failures gracefully**: Which layer failed and how to recover?
+- **Scale effectively**: Where to add resources for maximum impact?
+
+# Conclusion
+
+## What You've Learned
+
+1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples
+2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns  
+3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware
+
+## Key Takeaways
+
+- **Services hide complexity**: Your RL code looks like simple async functions, but runs on distributed clusters
+- **Communication patterns matter**: `.route()`, `.fanout()`, sessions, and `.call_one()` each serve specific purposes  
+- **Architecture understanding helps**: Knowing the Service → Actor → Process → Hardware hierarchy helps you debug, optimize, and scale
+- **Always verify APIs**: This guide is verified, but cross-check with source code for latest changes
+- **Real API patterns**: Use `.options().as_service()` not `spawn_service()`, use `.route()` not `.choose()`, etc.
diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD
index 01d750d06..7798b147d 100644
--- a/docs/Tutorials/ReadMe.MD
+++ b/docs/Tutorials/ReadMe.MD
@@ -11,8 +11,8 @@ Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tu
 This section currently is structured in 3 detailed parts:
 
 1. [RL Fundamentals and Understanding Forge Terminology](./1_RL_and_Forge_Fundamentals.MD): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals
-2. []()
-3. []()
+2. [Forge Internals](./2_Forge_Internals.MD): Goes a layer deeper and explains the internals of Forge
+3. [Monarch 101](./3_Monarch_101.MD): It's a 101 to Monarch and how Forge Talks to Monarch
 
 Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! 
 

From 21b0924c466f71793891ded231f569807049b392 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:26:45 -0700
Subject: [PATCH 07/22] Update 2_Forge_Internals.MD

---
 docs/Tutorials/2_Forge_Internals.MD | 42 ++++++++++++++---------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index d55eda51a..0c810a08e 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -8,6 +8,8 @@ Now that you see the power of the service abstraction, let's understand what's a
 
 When you call `await policy_service.generate(question)`, here's what actually happens:
 
+(Don't worry, we will understand Services right in the next section!)
+
 ```mermaid
 graph TD
     Call["Your Code:<br/>await policy_service.generate"]
@@ -58,17 +60,19 @@ Policy.options(
     # Other available options:
     # hosts=None
 )
-
-# This is the ACTUAL way services are configured in Forge
 ```
 
 ### 2. Real Service Creation
 
 Services are created using the `spawn_service` function:
 
-```python
-# This is what ACTUALLY works - copied directly from the notebook
+The spawn_service() function automatically handles:
+- Spawning actor replicas across processes/GPUs
+- Load balancing with .choose() method
+- Health monitoring and failure recovery  
+- Message routing and serialization
 
+```python
 from forge.controller.service import ServiceConfig, spawn_service
 from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig
 
@@ -89,12 +93,6 @@ prompt = "What is 3 + 5?"
 responses = await policy.generate.choose(prompt=prompt)
 print(f"Response: {responses[0].text}")
 
-# The spawn_service() function automatically handles:
-# - Spawning actor replicas across processes/GPUs
-# - Load balancing with .choose() method
-# - Health monitoring and failure recovery  
-# - Message routing and serialization
-
 # Cleanup when done
 await shutdown_service(policy)
 ```
@@ -103,23 +101,23 @@ await shutdown_service(policy)
 
 Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
 
-```python
-# Forge internals - What happens behind the scenes:
-# 1. .as_service() creates a ServiceInterface
-# 2. ServiceInterface manages N replicas of your ForgeActor class
-# 3. ServiceInterface handles routing between replicas
-# 4. You get methods like .route(), .fanout(), etc.
+Forge internals - What happens behind the scenes:
+1. `.as_service()` creates a `ServiceInterface`
+2. `ServiceInterface` manages N replicas of your `ForgeActor` class
+3. `ServiceInterface` handles routing between replicas
+4. You get methods like `.route()`, `.fanout()`, etc.
 
+```python
 # Your code sees this:
 responses = await policy.generate.route(prompt=prompt)
-
-# But behind the scenes:
-# - ServiceInterface selects healthy replica
-# - Routes message to that replica's Policy.generate() endpoint
-# - Handles failures and retries automatically
-# - Returns list[Completion] from the selected replica
 ```
 
+But behind the scenes:
+- `ServiceInterface` selects healthy replica
+- Routes message to that replica's `Policy.generate()` endpoint
+- Handles failures and retries automatically
+- Returns list[Completion] from the selected replica
+
 ### 3. Different Service Types and Their Characteristics
 
 ```mermaid

From b581d11bde0ce603cc89f444c51f38add87cc4e2 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:34:03 -0700
Subject: [PATCH 08/22] add

---
 docs/Tutorials/2_Forge_Internals.MD | 43 +++++++++--------------------
 docs/Tutorials/3_Monarch_101.MD     |  2 ++
 2 files changed, 15 insertions(+), 30 deletions(-)

diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 0c810a08e..9018afe3d 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -155,14 +155,14 @@ These communication patterns (\"adverbs\") determine how your service calls are
 ```python
 responses = await policy.generate.route(prompt=question)
 answer = responses[0].text  # Extract text from Completion object
-
-# Behind the scenes:
-# 1. Health check eliminates failed replicas
-# 2. Load balancer picks least loaded healthy replica  
-# 3. Request routes to that specific replica
-# 4. Automatic retry on different replica if failure
 ```
 
+Behind the scenes:
+1. Health check eliminates failed replicas
+2. Load balancer picks least loaded healthy replica  
+3. Request routes to that specific replica
+4. Automatic retry on different replica if failure
+
 **Performance characteristics**:
 - **Latency**: Lowest (single network hop)
 - **Throughput**: Limited by single replica capacity
@@ -196,7 +196,7 @@ await policy.update_weights.fanout(new_policy_version)
 **When to use**: You want to process results as they arrive, not wait for all.
 
 ```python
-# 📝 CONCEPTUAL - Streaming requires custom implementation in your training loop
+# CONCEPTUAL - Streaming requires custom implementation in your training loop
 # The basic ReplayBuffer doesn't have built-in streaming methods
 # Pattern from apps/grpo/main.py continuous training:
 
@@ -223,7 +223,7 @@ while training:
 **When to use**: Side effects that don't need responses (notifications, cache updates).
 
 ```python
-# 📝 CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations
+# CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations
 # The basic services don't have broadcast methods built-in
 # You would implement custom endpoints in your ForgeActor:
 
@@ -485,36 +485,19 @@ trainer = await RLTrainer.options(
 )
 ```
 
-### Natural Backpressure Through Service APIs
-
-```python
-# backpressure pattern - The replay buffer naturally provides backpressure
-batch = await replay_buffer.sample.call_one(curr_policy_version=step)
-if batch is None:
-    # Not enough data yet - natural rate limiting
-    print("Buffer not ready, collecting more experiences...")
-    continue
-else:
-    # Proceed with training
-    loss = await trainer.train_step.call_one(batch)
-    print(f"Training loss: {loss}")
-```
-
-These patterns address the core technical challenges in distributed RL. The key insight: **Forge services handle coordination complexity automatically, letting you focus on RL algorithm logic**.
-
 ## Service Implementation Example
 
 Let's see how a reward service is actually implemented:
 
 ```python
-# ✅ COMPLETE WORKING EXAMPLE - Exact RewardActor from apps/grpo/main.py
+# Exact RewardActor from apps/grpo/main.py
 
 from forge.controller import ForgeActor
 from monarch.actor import endpoint
 from forge.data.rewards import MathReward, ThinkingReward
 from forge.controller.service import ServiceConfig, spawn_service
 
-# EXACT class definition from apps/grpo/main.py lines 68-83  
+# class definition from apps/grpo/main.py
 class RewardActor(ForgeActor):
     def __init__(self, reward_functions: list):
         self.reward_functions = reward_functions
@@ -573,7 +556,7 @@ from forge.data.rewards import MathReward, ThinkingReward
 from monarch.actor import endpoint
 from omegaconf import DictConfig
 
-# EXACT service creation from apps/grpo/main.py lines 322-344
+# Service creation from apps/grpo/main.py lines 322-344
 print("Initializing all services...")
 (
     dataloader,
@@ -601,7 +584,6 @@ print("Initializing all services...")
 
 print("All services initialized successfully!")
 
-# EXACT usage patterns from apps/grpo/main.py continuous training loop
 async def production_training_loop():
     """Real training loop pattern from apps/grpo/main.py"""
     step = 0
@@ -639,7 +621,6 @@ async def production_training_loop():
             print(f"Step {step}, Loss: {loss:.4f}")
             step += 1
 
-# EXACT cleanup pattern from apps/grpo/main.py lines 493-504  
 print("Shutting down services...")
 await asyncio.gather(
     DatasetActor.shutdown(dataloader),
@@ -661,3 +642,5 @@ print("All services shut down successfully!")
 5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions)
 
 This is the power of the service abstraction - complex distributed coordination looks like simple async Python code.
+
+In the next part we will learn about [Monarch internals](./3_Monarch_101.MD)
\ No newline at end of file
diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 9369be13a..94c02c37e 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -1,5 +1,7 @@
 # Part 3: The Forge-Monarch Connection
 
+This is part 3 of our series, in the previous sections: we learned [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals.MD), [Forge Internals](./2_Forge_Internals.MD).
+
 Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
 
 ## The Complete Hierarchy: Service to Silicon

From cb2ce542a9a9b8611ad19fa894dc39b9cb0e7f21 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:38:48 -0700
Subject: [PATCH 09/22] Update 3_Monarch_101.MD

---
 docs/Tutorials/3_Monarch_101.MD | 124 ++++++++++++++++----------------
 1 file changed, 62 insertions(+), 62 deletions(-)

diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 94c02c37e..7b3f6d310 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -11,24 +11,24 @@ graph TD
     subgraph YourCode["1. Your RL Code"]
         Call["await policy_service.generate.choose('What is 2+2?')"]
     end
-    
+
     subgraph ForgeServices["2. Forge Service Layer"]
         ServiceInterface["ServiceInterface<br/>• Routes .choose() to replica<br/>• Handles load balancing<br/>• Manages health checks"]
         ServiceActor["ServiceActor<br/>• Manages replica lifecycle<br/>• Monitors health<br/>• Coordinates failures"]
     end
-    
-    subgraph MonarchLayer["3. Monarch Actor Layer"]  
-        ActorMesh["ActorMesh[PolicyActor]<br/>• 4 PolicyActor instances<br/>• Each on different GPU<br/>• Message passing interface"]
-        ProcMesh["ProcMesh<br/>• 4 processes<br/>• GPU topology: [0,1,2,3]<br/>• Network interconnect"]
+
+    subgraph MonarchLayer["3. Monarch Actor Layer"]
+        ActorMesh["ActorMesh PolicyActor<br/>• 4 PolicyActor instances<br/>• Each on different GPU<br/>• Message passing interface"]
+        ProcMesh["ProcMesh<br/>• 4 processes<br/>• GPU topology: 0,1,2,3<br/>• Network interconnect"]
     end
-    
+
     subgraph Hardware["4. Physical Hardware"]
         GPU0["GPU 0<br/>PolicyActor #1<br/>vLLM Engine<br/>Model Weights"]
-        GPU1["GPU 1<br/>PolicyActor #2<br/>vLLM Engine<br/>Model Weights"] 
+        GPU1["GPU 1<br/>PolicyActor #2<br/>vLLM Engine<br/>Model Weights"]
         GPU2["GPU 2<br/>PolicyActor #3<br/>vLLM Engine<br/>Model Weights"]
         GPU3["GPU 3<br/>PolicyActor #4<br/>vLLM Engine<br/>Model Weights"]
     end
-    
+
     Call --> ServiceInterface
     ServiceInterface --> ServiceActor
     ServiceActor --> ActorMesh
@@ -37,7 +37,7 @@ graph TD
     ProcMesh --> GPU1
     ProcMesh --> GPU2
     ProcMesh --> GPU3
-    
+
     style Call fill:#99ff99
     style ServiceActor fill:#ffcc99
     style ActorMesh fill:#cc99ff
@@ -55,17 +55,17 @@ graph TD
     subgraph Host["Single Host (8 GPUs)"]
         subgraph ProcMesh["ProcMesh: per_host={'gpus': 8}"]
             P0["Process 0<br/>GPU 0"]
-            P1["Process 1<br/>GPU 1"] 
+            P1["Process 1<br/>GPU 1"]
             P2["Process 2<br/>GPU 2"]
             P3["Process 3<br/>GPU 3"]
             P4["Process 4<br/>GPU 4"]
             P5["Process 5<br/>GPU 5"]
-            P6["Process 6<br/>GPU 6"] 
+            P6["Process 6<br/>GPU 6"]
             P7["Process 7<br/>GPU 7"]
         end
-        
+
         P0 -.->|"Network"| P1
-        P1 -.->|"Network"| P2  
+        P1 -.->|"Network"| P2
         P2 -.->|"Network"| P3
         P3 -.->|"Network"| P4
         P4 -.->|"Network"| P5
@@ -73,7 +73,7 @@ graph TD
         P6 -.->|"Network"| P7
         P7 -.->|"Network"| P0
     end
-    
+
     style P0 fill:#ff9999
     style P1 fill:#ff9999
     style P2 fill:#ff9999
@@ -97,8 +97,8 @@ graph TD
                 H1P3["Process 3<br/>GPU 3"]
             end
         end
-        
-        subgraph Host2["Host 2"] 
+
+        subgraph Host2["Host 2"]
             subgraph PM2["ProcMesh Segment 2"]
                 H2P0["Process 4<br/>GPU 0"]
                 H2P1["Process 5<br/>GPU 1"]
@@ -106,22 +106,22 @@ graph TD
                 H2P3["Process 7<br/>GPU 3"]
             end
         end
-        
+
         subgraph Host3["Host 3"]
             subgraph PM3["ProcMesh Segment 3"]
                 H3P0["Process 8<br/>GPU 0"]
                 H3P1["Process 9<br/>GPU 1"]
-                H3P2["Process 10<br/>GPU 2"] 
+                H3P2["Process 10<br/>GPU 2"]
                 H3P3["Process 11<br/>GPU 3"]
             end
         end
     end
-    
+
     H1P0 -.->|"InfiniBand"| H2P0
     H1P1 -.->|"InfiniBand"| H2P1
     H2P0 -.->|"InfiniBand"| H3P0
     H2P1 -.->|"InfiniBand"| H3P1
-    
+
     style PM1 fill:#ff9999
     style PM2 fill:#99ff99
     style PM3 fill:#99ccff
@@ -167,7 +167,7 @@ await counters.increment.call()
 value = await counters.get_value.call_one()
 print(f"One counter: {value}")
 
-# choose() - random single actor  
+# choose() - random single actor
 value = await counters.get_value.choose()
 print(f"Random counter: {value}")
 
@@ -190,28 +190,28 @@ await procs.stop()
 graph TD
     subgraph Creation["Actor Creation Process"]
         Code["mesh.spawn('policy', PolicyActor, model='Qwen/Qwen3-7B')"]
-        
+
         subgraph ProcMesh["ProcMesh (4 processes)"]
-            P0["Process 0<br/>GPU 0"] 
+            P0["Process 0<br/>GPU 0"]
             P1["Process 1<br/>GPU 1"]
             P2["Process 2<br/>GPU 2"]
             P3["Process 3<br/>GPU 3"]
         end
-        
+
         subgraph ActorMesh["ActorMesh[PolicyActor]"]
             A0["PolicyActor<br/>Instance #0<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
             A1["PolicyActor<br/>Instance #1<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
             A2["PolicyActor<br/>Instance #2<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
             A3["PolicyActor<br/>Instance #3<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
         end
-        
+
         Code --> ProcMesh
         P0 --> A0
         P1 --> A1
         P2 --> A2
         P3 --> A3
     end
-    
+
     style A0 fill:#99ff99
     style A1 fill:#99ff99
     style A2 fill:#99ff99
@@ -224,29 +224,29 @@ graph TD
 graph TD
     subgraph MessageFlow["Message Flow Patterns"]
         Client["await policy_actors.generate.METHOD(prompt)"]
-        
+
         subgraph Methods["Different Adverbs Route Differently"]
             Choose["choose()<br/>→ Routes to ONE actor<br/>→ Load balanced"]
-            Call["call()<br/>→ Routes to ALL actors<br/>→ Collects all results"] 
+            Call["call()<br/>→ Routes to ALL actors<br/>→ Collects all results"]
             Broadcast["broadcast()<br/>→ Routes to ALL actors<br/>→ Fire and forget"]
             Stream["stream()<br/>→ Routes to ALL actors<br/>→ Iterator of results"]
         end
-        
+
         subgraph ActorInstances["PolicyActor Instances"]
             A0["Actor 0<br/>GPU 0<br/>generates response"]
-            A1["Actor 1<br/>GPU 1<br/>generates response"] 
+            A1["Actor 1<br/>GPU 1<br/>generates response"]
             A2["Actor 2<br/>GPU 2<br/>generates response"]
             A3["Actor 3<br/>GPU 3<br/>generates response"]
         end
-        
+
         Client --> Choose
         Client --> Call
         Client --> Broadcast
         Client --> Stream
-        
+
         Choose -.->|"Load balanced"| A1
         Call --> A0
-        Call --> A1  
+        Call --> A1
         Call --> A2
         Call --> A3
         Broadcast --> A0
@@ -258,7 +258,7 @@ graph TD
         Stream --> A2
         Stream --> A3
     end
-    
+
     style Choose fill:#99ff99
     style Call fill:#ffcc99
     style Broadcast fill:#ff99cc
@@ -275,31 +275,31 @@ Now the key insight: **Forge services are ServiceActors that manage ActorMeshes
 graph TD
     subgraph ServiceCreation["spawn_service() Process"]
         Call["await spawn_service(ServiceConfig(num_replicas=4), PolicyActor, model='Qwen')"]
-        
+
         ServiceActor["ServiceActor<br/>• Manages 4 replicas<br/>• Handles health checks<br/>• Routes service calls"]
-        
-        subgraph Replicas["4 Independent Replicas"] 
+
+        subgraph Replicas["4 Independent Replicas"]
             subgraph R0["Replica 0"]
                 PM0["ProcMesh<br/>1 process<br/>GPU 0"]
                 AM0["ActorMesh<br/>1 PolicyActor"]
             end
-            
+
             subgraph R1["Replica 1"]
-                PM1["ProcMesh<br/>1 process<br/>GPU 1"] 
+                PM1["ProcMesh<br/>1 process<br/>GPU 1"]
                 AM1["ActorMesh<br/>1 PolicyActor"]
             end
-            
+
             subgraph R2["Replica 2"]
                 PM2["ProcMesh<br/>1 process<br/>GPU 2"]
                 AM2["ActorMesh<br/>1 PolicyActor"]
             end
-            
+
             subgraph R3["Replica 3"]
                 PM3["ProcMesh<br/>1 process<br/>GPU 3"]
                 AM3["ActorMesh<br/>1 PolicyActor"]
             end
         end
-        
+
         Call --> ServiceActor
         ServiceActor --> R0
         ServiceActor --> R1
@@ -310,7 +310,7 @@ graph TD
         PM2 --> AM2
         PM3 --> AM3
     end
-    
+
     style ServiceActor fill:#ffcc99
     style AM0 fill:#99ff99
     style AM1 fill:#99ff99
@@ -324,30 +324,30 @@ graph TD
 graph TD
     subgraph CallFlow["Complete Call Flow"]
         UserCall["await policy_service.generate.choose('What is 2+2?')"]
-        
+
         ServiceInterface["ServiceInterface<br/>• Receives .choose() call<br/>• Routes to ServiceActor"]
-        
+
         ServiceActor["ServiceActor<br/>• Selects healthy replica<br/>• Load balancing logic<br/>• Failure handling"]
-        
+
         SelectedReplica["Selected Replica #2<br/>• ProcMesh with 1 process<br/>• ActorMesh with 1 PolicyActor"]
-        
+
         PolicyActor["PolicyActor Instance<br/>• Loads model<br/>• Runs vLLM inference<br/>• Returns 'The answer is 4'"]
-        
+
         GPU["GPU 2<br/>• vLLM engine<br/>• Model weights<br/>• KV cache<br/>• CUDA kernels"]
-        
+
         UserCall --> ServiceInterface
         ServiceInterface --> ServiceActor
         ServiceActor --> SelectedReplica
         SelectedReplica --> PolicyActor
         PolicyActor --> GPU
-        
+
         GPU -.->|"Response"| PolicyActor
         PolicyActor -.->|"Response"| SelectedReplica
         SelectedReplica -.->|"Response"| ServiceActor
         ServiceActor -.->|"Response"| ServiceInterface
         ServiceInterface -.->|"'The answer is 4'"| UserCall
     end
-    
+
     style UserCall fill:#99ff99
     style ServiceActor fill:#ffcc99
     style PolicyActor fill:#cc99ff
@@ -361,32 +361,32 @@ In real RL systems, you have multiple services that can share or use separate Pr
 ```mermaid
 graph TD
     subgraph Cluster["RL Training Cluster"]
-        subgraph Services["Forge Services"] 
+        subgraph Services["Forge Services"]
             PS["Policy Service<br/>4 GPU replicas"]
-            TS["Trainer Service<br/>2 GPU replicas"] 
+            TS["Trainer Service<br/>2 GPU replicas"]
             RS["Reward Service<br/>4 CPU replicas"]
             BS["Buffer Service<br/>1 CPU replica"]
         end
-        
+
         subgraph MonarchInfra["Monarch Infrastructure"]
             subgraph GPUMesh["GPU ProcMesh (6 processes)"]
                 G0["Process 0<br/>GPU 0"]
                 G1["Process 1<br/>GPU 1"]
-                G2["Process 2<br/>GPU 2"] 
+                G2["Process 2<br/>GPU 2"]
                 G3["Process 3<br/>GPU 3"]
                 G4["Process 4<br/>GPU 4"]
                 G5["Process 5<br/>GPU 5"]
             end
-            
+
             subgraph CPUMesh["CPU ProcMesh (5 processes)"]
                 C0["Process 0<br/>CPU"]
-                C1["Process 1<br/>CPU"] 
+                C1["Process 1<br/>CPU"]
                 C2["Process 2<br/>CPU"]
                 C3["Process 3<br/>CPU"]
                 C4["Process 4<br/>CPU"]
             end
         end
-        
+
         PS --> G0
         PS --> G1
         PS --> G2
@@ -399,7 +399,7 @@ graph TD
         RS --> C3
         BS --> C4
     end
-    
+
     style PS fill:#99ff99
     style TS fill:#ff99cc
     style RS fill:#ffcc99
@@ -411,7 +411,7 @@ graph TD
 ## Key Insights: Why This Architecture Matters
 
 1. **Process Isolation**: Each actor runs in its own process - failures don't cascade
-2. **Location Transparency**: Actors can be local or remote with identical APIs  
+2. **Location Transparency**: Actors can be local or remote with identical APIs
 3. **Structured Distribution**: ProcMesh maps directly to hardware topology
 4. **Message Passing**: No shared memory means no race conditions or locks
 5. **Service Abstraction**: Forge hides Monarch complexity while preserving power
@@ -427,13 +427,13 @@ Understanding this hierarchy helps you:
 ## What You've Learned
 
 1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples
-2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns  
+2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns
 3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware
 
 ## Key Takeaways
 
 - **Services hide complexity**: Your RL code looks like simple async functions, but runs on distributed clusters
-- **Communication patterns matter**: `.route()`, `.fanout()`, sessions, and `.call_one()` each serve specific purposes  
+- **Communication patterns matter**: `.route()`, `.fanout()`, sessions, and `.call_one()` each serve specific purposes
 - **Architecture understanding helps**: Knowing the Service → Actor → Process → Hardware hierarchy helps you debug, optimize, and scale
 - **Always verify APIs**: This guide is verified, but cross-check with source code for latest changes
 - **Real API patterns**: Use `.options().as_service()` not `spawn_service()`, use `.route()` not `.choose()`, etc.

From 07b059777666027e91b00304d1baeafb5470ad9a Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:40:02 -0700
Subject: [PATCH 10/22] Update 3_Monarch_101.MD

---
 docs/Tutorials/3_Monarch_101.MD | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 7b3f6d310..0b1b4bd79 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -198,7 +198,7 @@ graph TD
             P3["Process 3<br/>GPU 3"]
         end
 
-        subgraph ActorMesh["ActorMesh[PolicyActor]"]
+        subgraph ActorMesh["ActorMesh PolicyActor"]
             A0["PolicyActor<br/>Instance #0<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
             A1["PolicyActor<br/>Instance #1<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
             A2["PolicyActor<br/>Instance #2<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]

From 55dc5b452bd757deaf31f9262976c676556cf05b Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Thu, 2 Oct 2025 19:40:40 -0700
Subject: [PATCH 11/22] Update 3_Monarch_101.MD

---
 docs/Tutorials/3_Monarch_101.MD | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 0b1b4bd79..52a058dcc 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -1,6 +1,6 @@
 # Part 3: The Forge-Monarch Connection
 
-This is part 3 of our series, in the previous sections: we learned [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals.MD), [Forge Internals](./2_Forge_Internals.MD).
+This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals.MD), Part 2: [Forge Internals](./2_Forge_Internals.MD).
 
 Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging.
 

From 7366497a21f7268bec99fc3d454bf5c40d81dd50 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Fri, 3 Oct 2025 00:22:10 -0700
Subject: [PATCH 12/22] fix funcs

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 152 ++++++-------
 docs/Tutorials/2_Forge_Internals.MD           | 200 ++++++++++--------
 docs/Tutorials/3_Monarch_101.MD               |  14 +-
 3 files changed, 199 insertions(+), 167 deletions(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 810ef373f..c34ae6639 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -114,33 +114,36 @@ Let's look at the example from above again, but this time we would use the names
 # Conceptual Example
 
 async def conceptual_forge_rl_step(services, step):
-    # 1. Get a math problem - CONCEPTUAL API
-    sample = await services['dataloader'].get_sample()
-    question, target = sample["question"], sample["answer"]
+    # 1. Get a math problem - Using actual DatasetActor API
+    sample = await services['dataloader'].sample.call_one()
+    question, target = sample["request"], sample["target"]
     
-    # 2. Student generates answer - CONCEPTUAL API
-    # Actual method names vary by implementation
-    responses = await services['policy'].generate(prompt=question)
+    # 2. Student generates answer - Using actual Policy API
+    responses = await services['policy'].generate.route(prompt=question)
     answer = responses[0].text  
     
-    # 3. Teacher grades it - CONCEPTUAL API  
-    # Actual reward evaluation varies by implementation
-    score = await services['reward_actor'].evaluate(
+    # 3. Teacher grades it - Using actual RewardActor API
+    score = await services['reward_actor'].evaluate_response.route(
         prompt=question, response=answer, target=target
     )
     
-    # 4. Compare to baseline - CONCEPTUAL API
-    ref_logprobs = await services['ref_model'].compute_baseline(responses[0].token_ids)
+    # 4. Compare to baseline - Using actual ReferenceModel API
+    # Note: ReferenceModel.forward requires input_ids, max_req_tokens, return_logprobs
+    ref_logprobs = await services['ref_model'].forward.route(
+        input_ids, max_req_tokens, return_logprobs=True
+    )
     
-    # 5. Store experience - CONCEPTUAL Episode structure
-    # Real Episode structure in src/forge/data_models/episode.py
-    episode = create_episode(responses[0], score, ref_logprobs, step)
-    await services['replay_buffer'].store(episode)
+    # 5. Store experience - Using actual Episode structure from apps/grpo/main.py
+    episode = create_episode_from_response(responses[0], score, ref_logprobs, step)
+    await services['replay_buffer'].add.call_one(episode)
     
-    # 6. Improve student - CONCEPTUAL API
-    batch = await services['replay_buffer'].get_batch(policy_version=step)
+    # 6. Improve student - Using actual training pattern
+    batch = await services['replay_buffer'].sample.call_one(
+        curr_policy_version=step
+    )
     if batch is not None:
-        loss = await services['trainer'].update_policy(batch)
+        inputs, targets = batch
+        loss = await services['trainer'].train_step.call(inputs, targets)
         return loss
 ```
 
@@ -234,34 +237,38 @@ Let's see how core RL concepts map to Forge services:
 async def real_rl_training_step(services, step):
     """Single RL step using verified Forge APIs"""
     
-    # 1. Environment interaction
-    sample = await services['dataloader'].__next__.call_one()
-    prompt, target = sample["question"], sample["answer"]
+    # 1. Environment interaction - Using actual DatasetActor API
+    sample = await services['dataloader'].sample.call_one()
+    prompt, target = sample["request"], sample["target"]
     
-    responses = await services['policy'].generate.route(prompt=prompt)
+    responses = await services['policy'].generate.route(prompt)
     
-    # 2. Reward computation
+    # 2. Reward computation - Using actual RewardActor API
     score = await services['reward_actor'].evaluate_response.route(
         prompt=prompt, response=responses[0].text, target=target
     )
     
-    # 3. Get reference logprobs
-    ref_logprobs = await services['ref_model'].forward.route(responses[0].token_ids)
+    # 3. Get reference logprobs - Using actual ReferenceModel API
+    # Note: ReferenceModel requires full input_ids tensor, not just tokens
+    input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids])
+    ref_logprobs = await services['ref_model'].forward.route(
+        input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True
+    )
     
-    # 4. Experience storage - Episode creation pattern
-    # Note: Actual Episode structure requires token tensors, not text
+    # 4. Experience storage - Using actual Episode pattern from GRPO
     episode = create_episode_from_response(responses[0], score, ref_logprobs, step)
     await services['replay_buffer'].add.call_one(episode)
     
-    # 5. Learning - trainer endpoint
+    # 5. Learning - Using actual trainer pattern
     batch = await services['replay_buffer'].sample.call_one(
         curr_policy_version=step
     )
     if batch is not None:
-        loss = await services['trainer'].train_step.call_one(batch)
+        inputs, targets = batch  # GRPO returns (inputs, targets) tuple
+        loss = await services['trainer'].train_step.call(inputs, targets)
         
-        # 6. Policy synchronization - weight update pattern
-        await services['trainer'].push_weights.call_one(step + 1)
+        # 6. Policy synchronization - Using actual weight update pattern
+        await services['trainer'].push_weights.call(step + 1)
         await services['policy'].update_weights.fanout(step + 1)
         
         return loss
@@ -287,12 +294,14 @@ Forge handles behind the scenes:
 ### Independent Scaling
 ```python
 
-from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig
+from forge.actors.policy import Policy
 from forge.actors.replay_buffer import ReplayBuffer
-from forge.controller.service import shutdown_service
-from apps.grpo.main import Trainer, RewardActor, ComputeAdvantages, RefModel, DatasetActor
+from forge.actors.reference_model import ReferenceModel
+from forge.actors.trainer import RLTrainer
+from apps.grpo.main import DatasetActor, RewardActor, ComputeAdvantages
 from forge.data.rewards import MathReward, ThinkingReward
 import asyncio
+import torch
 
 model = "Qwen/Qwen3-1.7B"
 group_size = 1
@@ -306,67 +315,60 @@ group_size = 1
     ref_model,
     reward_actor,
 ) = await asyncio.gather(
-        # Dataset service
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, num_replicas=1),
-            DatasetActor,
+        # Dataset actor (CPU)
+        DatasetActor.options(procs=1).as_actor(
             path="openai/gsm8k",
-            config_name="main",
-            split="train",
+            revision="main",
+            data_split="train",
             streaming=True,
+            model=model,
         ),
         # Policy service with GPU
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1),
-            Policy,
-            config=PolicyConfig(
-                worker_params=WorkerConfig(model=model),
-                sampling_params=SamplingOverrides(
-                    num_samples=group_size, max_tokens=16
-                ),
-            ),
+        Policy.options(procs=1, with_gpus=True, num_replicas=1).as_service(
+            engine_config={
+                "model": model,
+                "tensor_parallel_size": 1,
+                "pipeline_parallel_size": 1,
+                "enforce_eager": False
+            },
+            sampling_config={
+                "n": group_size,
+                "max_tokens": 16,
+                "temperature": 1.0,
+                "top_p": 1.0
+            }
         ),
-        # Trainer service with GPU
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1),
-            Trainer,
-            learning_rate=1e-5,
-            beta=0.1,
-            model_name=model,
+        # Trainer actor with GPU
+        RLTrainer.options(procs=1, with_gpus=True).as_actor(
+            # Trainer config would come from YAML in real usage
+            model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": f"hf://{model}"},
+            optimizer={"name": "AdamW", "lr": 1e-5},
+            training={"local_batch_size": 2, "seq_len": 2048}
         ),
         # Replay buffer (CPU)
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, num_replicas=1),
-            ReplayBuffer,
+        ReplayBuffer.options(procs=1).as_actor(
             batch_size=2,
             max_policy_age=1,
+            dp_size=1
         ),
         # Advantage computation (CPU)
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, num_replicas=1),
-            ComputeAdvantages,
-            gamma=0.99,
-            lambda_=0.95,
-        ),
+        ComputeAdvantages.options(procs=1).as_actor(),
         # Reference model with GPU
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, num_replicas=1, with_gpus=True),
-            RefModel,
-            model_name=model,
+        ReferenceModel.options(procs=1, with_gpus=True).as_actor(
+            model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": f"hf://{model}"},
+            training={"dtype": "bfloat16"}
         ),
         # Reward actor (CPU)
-        spawn_service(
-            ServiceConfig(procs_per_replica=1, num_replicas=1),
-            RewardActor,
-            reward_functions=[MathReward(), ThinkingReward()],
+        RewardActor.options(procs=1, num_replicas=1).as_service(
+            reward_functions=[MathReward(), ThinkingReward()]
         )
     )
 ```
 
-Production scaling - multiply num_replicas:
+Production scaling - multiply num_replicas for services or spawn multiple actors:
 - Policy: num_replicas=8 for high inference demand
 - RewardActor: num_replicas=16 for parallel evaluation
-- Trainer: num_replicas=4 for distributed training
+- Trainer: Multiple actors for distributed training (RLTrainer handles this internally)
 
 
 ### Fault Tolerance
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 9018afe3d..634f04f85 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -65,36 +65,44 @@ Policy.options(
 ### 2. Real Service Creation
 
 Services are created using the `spawn_service` function:
+Services are created using the `.options().as_service()` pattern from the actual GRPO implementation:
 
-The spawn_service() function automatically handles:
+The service creation automatically handles:
 - Spawning actor replicas across processes/GPUs
-- Load balancing with .choose() method
+- Load balancing with .route() method for services
 - Health monitoring and failure recovery  
 - Message routing and serialization
 
 ```python
-from forge.controller.service import ServiceConfig, spawn_service
-from forge.actors.policy import Policy, PolicyConfig, SamplingOverrides, WorkerConfig
+from forge.actors.policy import Policy
 
 model = "Qwen/Qwen3-1.7B"
 
-policy = await spawn_service(
-    ServiceConfig(procs_per_replica=1, with_gpus=True, num_replicas=1),
-    Policy,
-    config=PolicyConfig(
-        worker_params=WorkerConfig(model=model),
-        sampling_params=SamplingOverrides(
-            num_samples=1, max_tokens=16
-        ),
-    ),
+policy = await Policy.options(
+    procs=1, 
+    with_gpus=True, 
+    num_replicas=1
+).as_service(
+    engine_config={
+        "model": model,
+        "tensor_parallel_size": 1,
+        "pipeline_parallel_size": 1,
+        "enforce_eager": False
+    },
+    sampling_config={
+        "n": 1,
+        "max_tokens": 16,
+        "temperature": 1.0,
+        "top_p": 1.0
+    }
 )
 
 prompt = "What is 3 + 5?"
-responses = await policy.generate.choose(prompt=prompt)
+responses = await policy.generate.route(prompt)
 print(f"Response: {responses[0].text}")
 
 # Cleanup when done
-await shutdown_service(policy)
+await policy.shutdown()
 ```
 
 ### 3. How Services Actually Work
@@ -253,7 +261,6 @@ class CustomPolicy(Policy):
 # This Counter example demonstrates the session pattern
 
 from forge.controller import ForgeActor
-from forge.controller.service import ServiceConfig, spawn_service, shutdown_service
 from monarch.actor import endpoint
 
 class ForgeCounter(ForgeActor):
@@ -273,37 +280,35 @@ class ForgeCounter(ForgeActor):
     async def reset(self):
         self.value = 0
 
-counter_service = await spawn_service(
-    ServiceConfig(procs_per_replica=1, num_replicas=4),
-    ForgeCounter,
-    initial_value=0
-)
+counter_service = await ForgeCounter.options(
+    procs=1, num_replicas=4
+).as_service(initial_value=0)
 
 # Test basic operations
-await counter_service.increment.choose()
-results = await counter_service.increment.call()
+await counter_service.increment.route()
+results = await counter_service.increment.fanout()  # Get from all replicas
 print(f"All replica values: {results}")
 
 # STICKY SESSIONS
 print("\nUsing sticky sessions:")
 async with counter_service.session():
-    await counter_service.reset.choose()
-    print(await counter_service.increment.choose())  # 1
-    print(await counter_service.increment.choose())  # 2
-    print(await counter_service.increment.choose())  # 3
+    await counter_service.reset.route()  # Uses .route() within session
+    print(await counter_service.increment.route())  # 1
+    print(await counter_service.increment.route())  # 2
+    print(await counter_service.increment.route())  # 3
           
-    final_value = await counter_service.get_value.choose()
+    final_value = await counter_service.get_value.route()
     print(f"Final value on this replica: {final_value}")  # 3
 
 # Same pattern works with Policy for multi-turn conversations:
 # async with policy.session():
-#     response1 = await policy.generate.choose(prompt=turn1)
+#     response1 = await policy.generate.route(turn1)
 #     full_prompt = turn1 + response1[0].text + turn2
-#     response2 = await policy.generate.choose(prompt=full_prompt)
+#     response2 = await policy.generate.route(full_prompt)
 #     # Both calls hit same replica, preserving KV cache
 
 # Cleanup
-await shutdown_service(counter_service)
+await counter_service.shutdown()
 ```
 
 **Performance impact**: Critical for maintaining KV cache in multi-turn conversations.
@@ -395,60 +400,72 @@ print(f"Current policy version: {current_version}")
 Instead of manual coordination, Forge services handle speed mismatches automatically:
 
 ```python
-
 from apps.grpo.main import Episode, Group
 
 async def simple_rl_step():
     
     # ===== Generate a rollout =====
-    sample = await dataloader.__next__.choose()
-    prompt, target = sample["question"], sample["answer"]
+    sample = await dataloader.sample.call_one()  # DatasetActor is an actor, not service
+    prompt, target = sample["request"], sample["target"]  # Correct field names
     
     print(f"Prompt: {prompt}")
     print(f"Target: {target}")
     
-    actions = await policy.generate.choose(prompt=prompt)
+    actions = await policy.generate.route(prompt=prompt)  # Policy is a service
     print(f"Policy response: {actions[0].text}")
     
-    ref_logprobs = await ref_model.forward.choose(actions[0].token_ids)    
-    reward = await reward_actor.evaluate_response.choose(
+    # Create input tensor for reference model (requires full context)
+    input_ids = torch.cat([actions[0].prompt_ids, actions[0].token_ids])
+    ref_logprobs = await ref_model.forward.route(
+        input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True
+    )    
+    reward = await reward_actor.evaluate_response.route(  # RewardActor is a service
         prompt=prompt, 
         response=actions[0].text, 
         target=target
     )
     print(f"Reward: {reward}")
     
+    # Create episode using actual GRPO Episode structure
     episode = Episode(
-        episode_id=0,
-        prompt=prompt,
-        target=target, 
+        episode_id="0",
+        request=prompt,
         policy_version=0,
+        pad_id=tokenizer.pad_token_id,
+        request_len=512,
+        response_len=512,
+        target=target
     )
     
-    episode.add_group(Group(
-        response=actions[0].text,
-        ref_logprobs=ref_logprobs,
-        reward=reward,
-    ))
+    # Add response data
+    episode.response = actions[0].text
+    episode.request_tokens = actions[0].prompt_ids.tolist()
+    episode.response_tokens = actions[0].token_ids.tolist()
+    episode.ref_logprobs = ref_logprobs[0]  # Extract from batch dimension
+    episode.reward = reward
     
-    advantages = await compute_advantages.__call__.choose(episode.groups)
-    episode.groups[0].advantage = advantages[0]
+    # Compute advantages using actual ComputeAdvantages actor
+    group = Group.new_group(0, 1, prompt, 0, tokenizer.pad_token_id, 512, 512, target)
+    group.episodes[0] = episode
+    advantages = await compute_advantages.compute.call_one(group)  # ComputeAdvantages is an actor
+    episode.advantage = advantages[0]
     print(f"Advantage: {advantages[0]}")    
-    await replay_buffer.add.choose(episode)
+    await replay_buffer.add.call_one(episode)  # ReplayBuffer is an actor
     print("Episode stored in replay buffer")
     
     # ===== Train on the batch ===== 
-    batch = await replay_buffer.sample.choose(curr_policy_version=0)
+    batch = await replay_buffer.sample.call_one(curr_policy_version=0)
     if batch is not None:
         print("Training on batch...")
-        training_result = await trainer.train_step.choose(batch)
-        loss = training_result.get("loss", 0.0)
+        inputs, targets = batch  # GRPO returns (inputs, targets) tuple
+        loss = await trainer.train_step.call(inputs, targets)  # RLTrainer is an actor
         print(f"Training loss: {loss}")
         return loss
     else:
         print("Not enough data in buffer yet")
         return None
 
+# Note: This simplified example assumes tokenizer and services are already initialized
 for step in range(10):
     print(f"\n--- RL Step {step + 1} ---")
     loss = await simple_rl_step()
@@ -467,7 +484,7 @@ for step in range(10):
 policy = await Policy.options(
     procs=1, num_replicas=8, with_gpus=True  # Many replicas for high throughput
 ).as_service(
-    engine_config=EngineConfig(model=model_name)
+    engine_config={"model": model_name, "tensor_parallel_size": 1}
 )
 
 # Reward evaluation might be CPU-bound
@@ -479,9 +496,10 @@ reward_actor = await RewardActor.options(
 
 # Training needs fewer but more powerful replicas
 trainer = await RLTrainer.options(
-    procs=1, num_replicas=2, with_gpus=True  # Fewer but GPU-heavy
+    procs=1, with_gpus=True  # Fewer but GPU-heavy
 ).as_actor(  # Trainer typically uses .as_actor() not .as_service()
-    optimizer=Optimizer(lr=1e-5)
+    model={"name": "qwen3", "flavor": "1.7B"},
+    optimizer={"name": "AdamW", "lr": 1e-5}
 )
 ```
 
@@ -495,7 +513,6 @@ Let's see how a reward service is actually implemented:
 from forge.controller import ForgeActor
 from monarch.actor import endpoint
 from forge.data.rewards import MathReward, ThinkingReward
-from forge.controller.service import ServiceConfig, spawn_service
 
 # class definition from apps/grpo/main.py
 class RewardActor(ForgeActor):
@@ -515,9 +532,9 @@ class RewardActor(ForgeActor):
         # Return average reward across all functions
         return total_reward / len(self.reward_functions) if self.reward_functions else 0.0
 
-reward_actor = await spawn_service(
-    ServiceConfig(procs_per_replica=1, num_replicas=1),
-    RewardActor,
+reward_actor = await RewardActor.options(
+    procs=1, num_replicas=1
+).as_service(
     reward_functions=[MathReward(), ThinkingReward()]
 )
 
@@ -525,7 +542,7 @@ prompt = "What is 15% of 240?"
 response = "15% of 240 is 36"
 target = "36"
 
-score = await reward_actor.evaluate_response.choose(
+score = await reward_actor.evaluate_response.route(
     prompt=prompt,
     response=response, 
     target=target
@@ -533,10 +550,10 @@ score = await reward_actor.evaluate_response.choose(
 print(f"Reward score: {score}")  # Usually around 1.0 for correct math answers
 
 # For production scaling - increase num_replicas for parallel evaluation:
-# ServiceConfig(procs_per_replica=1, num_replicas=16)  # 16 parallel evaluators
+# RewardActor.options(procs=1, num_replicas=16)  # 16 parallel evaluators
 
 # Cleanup when done
-await shutdown_service(reward_actor)
+await reward_actor.shutdown()
 ```
 
 ## Service Orchestration: The Training Loop
@@ -547,16 +564,15 @@ Now let's see how services coordinate in a real training loop:
 # This is the REAL way production RL systems are built with Forge
 
 import asyncio
+import torch
 from forge.actors.policy import Policy
 from forge.actors.reference_model import ReferenceModel
 from forge.actors.replay_buffer import ReplayBuffer
 from forge.actors.trainer import RLTrainer
-from forge.controller.actor import ForgeActor
+from apps.grpo.main import DatasetActor, RewardActor, ComputeAdvantages
 from forge.data.rewards import MathReward, ThinkingReward
-from monarch.actor import endpoint
-from omegaconf import DictConfig
 
-# Service creation from apps/grpo/main.py lines 322-344
+# Service creation pattern from apps/grpo/main.py lines 322-344
 print("Initializing all services...")
 (
     dataloader,
@@ -567,17 +583,27 @@ print("Initializing all services...")
     ref_model,
     reward_actor,
 ) = await asyncio.gather(
-    DatasetActor.options(**cfg.actors.dataset).as_actor(**cfg.dataset),
-    Policy.options(**cfg.services.policy).as_service(**cfg.policy),
-    RLTrainer.options(**cfg.actors.trainer).as_actor(
-        **cfg.trainer, loss=simple_grpo_loss
+    DatasetActor.options(procs=1).as_actor(
+        path="openai/gsm8k", revision="main", data_split="train", 
+        streaming=True, model="Qwen/Qwen3-1.7B"
+    ),
+    Policy.options(procs=1, with_gpus=True, num_replicas=1).as_service(
+        engine_config={"model": "Qwen/Qwen3-1.7B", "tensor_parallel_size": 1},
+        sampling_config={"n": 1, "max_tokens": 512}
     ),
-    ReplayBuffer.options(**cfg.actors.replay_buffer).as_actor(
-        **cfg.replay_buffer, collate=collate
+    RLTrainer.options(procs=1, with_gpus=True).as_actor(
+        model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": "hf://Qwen/Qwen3-1.7B"},
+        optimizer={"name": "AdamW", "lr": 1e-5},
+        training={"local_batch_size": 2, "seq_len": 2048}
     ),
-    ComputeAdvantages.options(**cfg.actors.compute_advantages).as_actor(),
-    ReferenceModel.options(**cfg.services.ref_model).as_service(**cfg.ref_model),
-    RewardActor.options(**cfg.services.reward_actor).as_service(
+    ReplayBuffer.options(procs=1).as_actor(
+        batch_size=2, max_policy_age=1, dp_size=1
+    ),
+    ComputeAdvantages.options(procs=1).as_actor(),
+    ReferenceModel.options(procs=1, with_gpus=True).as_actor(
+        model={"name": "qwen3", "flavor": "1.7B", "hf_assets_path": "hf://Qwen/Qwen3-1.7B"}
+    ),
+    RewardActor.options(procs=1, num_replicas=1).as_service(
         reward_functions=[MathReward(), ThinkingReward()]
     ),
 )
@@ -593,10 +619,13 @@ async def production_training_loop():
         sample = await dataloader.sample.call_one()
         
         # Policy generation service call
-        responses = await policy.generate.route(prompt=sample["question"])
+        responses = await policy.generate.route(sample["request"])  # Correct field name
         
-        # Reference computation service call
-        ref_logprobs = await ref_model.forward.route(responses[0].token_ids)
+        # Reference computation service call (requires full input tensor)
+        input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids])
+        ref_logprobs = await ref_model.forward.route(
+            input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True
+        )
         
         # Reward evaluation service call 
         reward = await reward_actor.evaluate_response.route(
@@ -605,18 +634,19 @@ async def production_training_loop():
             target=sample["answer"]
         )
         
-        # Experience storage (simplified structure for illustration)
-        episode = create_episode(sample, responses[0], reward, ref_logprobs, step)
+        # Experience storage (using actual Episode structure)
+        episode = create_episode_from_grpo_data(sample, responses[0], reward, ref_logprobs[0], step)
         await replay_buffer.add.call_one(episode)
         
-        # Training when ready endpoints
+        # Training when ready
         batch = await replay_buffer.sample.call_one(curr_policy_version=step)
         if batch is not None:
-            loss = await trainer.train_step.call_one(batch)
+            inputs, targets = batch  # GRPO returns (inputs, targets) tuple
+            loss = await trainer.train_step.call(inputs, targets)
             
             # Weight synchronization pattern
-            await trainer.push_weights.call_one(step + 1)
-            await policy.update_weights.route(step + 1)
+            await trainer.push_weights.call(step + 1)
+            await policy.update_weights.fanout(step + 1)  # Fanout to all replicas
             
             print(f"Step {step}, Loss: {loss:.4f}")
             step += 1
@@ -628,7 +658,7 @@ await asyncio.gather(
     RLTrainer.shutdown(trainer),
     ReplayBuffer.shutdown(replay_buffer),
     ComputeAdvantages.shutdown(compute_advantages),
-    ref_model.shutdown(),
+    ReferenceModel.shutdown(ref_model),
     reward_actor.shutdown(),
 )
 print("All services shut down successfully!")
@@ -636,7 +666,7 @@ print("All services shut down successfully!")
 
 **Key observations:**
 1. **Parallelism**: Independent operations run concurrently
-2. **Load balancing**: Each `choose()` call automatically selects optimal replica  
+2. **Load balancing**: Each `.route()` call automatically selects optimal replica  
 3. **Fault tolerance**: Failures automatically retry on different replicas
 4. **Resource efficiency**: CPU and GPU services scale independently
 5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions)
diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 52a058dcc..0cbdcbd88 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -9,11 +9,11 @@ Now let's peel back the layers. Forge services are built on top of **Monarch**,
 ```mermaid
 graph TD
     subgraph YourCode["1. Your RL Code"]
-        Call["await policy_service.generate.choose('What is 2+2?')"]
+        Call["await policy_service.generate.route('What is 2+2?')"]
     end
 
     subgraph ForgeServices["2. Forge Service Layer"]
-        ServiceInterface["ServiceInterface<br/>• Routes .choose() to replica<br/>• Handles load balancing<br/>• Manages health checks"]
+        ServiceInterface["ServiceInterface<br/>• Routes .route() to replica<br/>• Handles load balancing<br/>• Manages health checks"]
         ServiceActor["ServiceActor<br/>• Manages replica lifecycle<br/>• Monitors health<br/>• Coordinates failures"]
     end
 
@@ -167,7 +167,7 @@ await counters.increment.call()
 value = await counters.get_value.call_one()
 print(f"One counter: {value}")
 
-# choose() - random single actor
+# choose() - random single actor (actors only, not services)
 value = await counters.get_value.choose()
 print(f"Random counter: {value}")
 
@@ -273,8 +273,8 @@ Now the key insight: **Forge services are ServiceActors that manage ActorMeshes
 
 ```mermaid
 graph TD
-    subgraph ServiceCreation["spawn_service() Process"]
-        Call["await spawn_service(ServiceConfig(num_replicas=4), PolicyActor, model='Qwen')"]
+    subgraph ServiceCreation["Service Creation Process"]
+        Call["await PolicyActor.options(num_replicas=4, procs=1).as_service(model='Qwen')"]
 
         ServiceActor["ServiceActor<br/>• Manages 4 replicas<br/>• Handles health checks<br/>• Routes service calls"]
 
@@ -323,9 +323,9 @@ graph TD
 ```mermaid
 graph TD
     subgraph CallFlow["Complete Call Flow"]
-        UserCall["await policy_service.generate.choose('What is 2+2?')"]
+        UserCall["await policy_service.generate.route('What is 2+2?')"]
 
-        ServiceInterface["ServiceInterface<br/>• Receives .choose() call<br/>• Routes to ServiceActor"]
+        ServiceInterface["ServiceInterface<br/>• Receives .route() call<br/>• Routes to ServiceActor"]
 
         ServiceActor["ServiceActor<br/>• Selects healthy replica<br/>• Load balancing logic<br/>• Failure handling"]
 

From 75352ab1090b855dabbe9b7420043a68fe8a7a7b Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Fri, 3 Oct 2025 13:50:49 -0700
Subject: [PATCH 13/22] Update docs/Tutorials/2_Forge_Internals.MD

Co-authored-by: Allen Wang <9057208+allenwang28@users.noreply.github.com>
---
 docs/Tutorials/2_Forge_Internals.MD | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 634f04f85..09c39fb7e 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -58,7 +58,7 @@ Policy.options(
     num_replicas=4,    # Number of replicas  
     with_gpus=True     # Allocate GPUs
     # Other available options:
-    # hosts=None
+    # hosts=None   #  the number of remote hosts used per replica
 )
 ```
 

From d0ea7709f8622448934595a9ce3deeafcb14eaec Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyambhutani@meta.com>
Date: Fri, 10 Oct 2025 14:10:38 -0700
Subject: [PATCH 14/22] update part 1 and 2

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD |  4 +-
 docs/Tutorials/2_Forge_Internals.MD           | 56 +------------------
 2 files changed, 3 insertions(+), 57 deletions(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index c34ae6639..32ada41cb 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -213,7 +213,7 @@ Each step has different:
 Unlike supervised learning where you process independent batches, RL requires coordination:
 
 ```python
-# This won't work - creates bottlenecks and resource waste
+# While this does work, it creates bottlenecks and resource waste
 def naive_rl_step():
     # Policy waits idle while reward model works
     response = policy_model.generate(prompt)  # GPU busy
@@ -368,7 +368,7 @@ group_size = 1
 Production scaling - multiply num_replicas for services or spawn multiple actors:
 - Policy: num_replicas=8 for high inference demand
 - RewardActor: num_replicas=16 for parallel evaluation
-- Trainer: Multiple actors for distributed training (RLTrainer handles this internally)
+- Trainer: Multiple processes for distributed training (RLTrainer handles this internally)
 
 
 ### Fault Tolerance
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 09c39fb7e..c21485bb0 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -64,7 +64,6 @@ Policy.options(
 
 ### 2. Real Service Creation
 
-Services are created using the `spawn_service` function:
 Services are created using the `.options().as_service()` pattern from the actual GRPO implementation:
 
 The service creation automatically handles:
@@ -126,32 +125,6 @@ But behind the scenes:
 - Handles failures and retries automatically
 - Returns list[Completion] from the selected replica
 
-### 3. Different Service Types and Their Characteristics
-
-```mermaid
-graph TD
-    subgraph GPU["GPU-Intensive Services"]
-        PolicySvc["Policy Service<br/>Large model inference<br/>High GPU memory<br/>Batch optimization"]
-        TrainerSvc["Trainer Service<br/>Distributed training<br/>Gradient sync<br/>Massive compute"]
-        RefSvc["Reference Service<br/>Frozen model<br/>Baseline computation<br/>Read-only ops"]
-    end
-    
-    subgraph CPU["CPU-Intensive Services"]
-        RewardSvc["Reward Service<br/>Evaluation logic<br/>Rule-based scoring<br/>High throughput"]
-        DataSvc["Data Service<br/>Dataset streaming<br/>Preprocessing<br/>I/O optimization"]
-    end
-    
-    subgraph Memory["Memory-Intensive Services"]
-        BufferSvc["Buffer Service<br/>Experience storage<br/>Efficient sampling<br/>Persistence"]
-        MetricsSvc["Metrics Service<br/>Logging aggregation<br/>Performance tracking<br/>Analytics"]
-    end
-    
-    style PolicySvc fill:#ff9999
-    style TrainerSvc fill:#ff9999
-    style RewardSvc fill:#99ff99
-    style BufferSvc fill:#9999ff
-```
-
 ## Deep Dive: Service Communication Patterns
 
 These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage.
@@ -226,34 +199,7 @@ while training:
 
 **Critical insight**: This is essential for high-throughput RL where you can't wait for batches.
 
-### 4. Fire-and-Forget Operations
-
-**When to use**: Side effects that don't need responses (notifications, cache updates).
-
-```python
-# CONCEPTUAL - Fire-and-forget requires custom @endpoint implementations
-# The basic services don't have broadcast methods built-in
-# You would implement custom endpoints in your ForgeActor:
-
-class CustomPolicy(Policy):
-    @endpoint
-    async def clear_cache(self) -> None:
-        """Custom endpoint for cache clearing"""
-        self.policy_worker.clear_kv_cache()
-
-# Then use it (hypothetical):
-# await custom_policy.clear_cache.fanout()  # Clear all replica caches
-# Note: Actual cache clearing would use existing Policy methods
-```
-
-**Performance characteristics**:
-- **Latency**: Immediately returns (doesn't wait for completion)
-- **Throughput**: Network limited, but non-blocking
-- **Fault tolerance**: Fire-and-forget (you don't know if it worked)
-
-**Critical warning**: Only use for non-critical operations - you get no confirmation.
-
-### 5. Service Sessions for Stateful Operations
+### 3. Service Sessions for Stateful Operations
 
 **When to use**: When you need multiple calls to hit the same replica (like KV cache preservation).
 

From 9d4be6073f1de9fa3eb4280f2848a5e7b87102f4 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Sun, 12 Oct 2025 11:43:57 -0700
Subject: [PATCH 15/22] address more comments

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 8 ++++----
 docs/Tutorials/2_Forge_Internals.MD           | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 32ada41cb..66b32a2b3 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -204,8 +204,8 @@ graph LR
 
 Each step has different:
 - **Latency requirements**: Policy inference needs low latency, training can batch
-- **Scaling patterns**: Reward evaluation scales with response count, training with model size
-- **Failure modes**: Policy failure stops generation, reward failure affects learning quality
+- **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference)
+- **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover)
 - **Resource utilization**: GPUs for inference/training, CPUs for data processing
 
 ### Problem 3: The Coordination Challenge
@@ -229,9 +229,9 @@ def naive_rl_step():
 
 ## Enter Forge: RL-Native Architecture
 
-Forge solves these problems by treating each RL component as an **independent, scalable service**
+Forge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ)
 
-Let's see how core RL concepts map to Forge services:
+Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
 
 ```python
 async def real_rl_training_step(services, step):
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index c21485bb0..2ed3301e5 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -140,7 +140,7 @@ answer = responses[0].text  # Extract text from Completion object
 
 Behind the scenes:
 1. Health check eliminates failed replicas
-2. Load balancer picks least loaded healthy replica  
+2. Load balancer picks replica (currently round robin, configurable balancers coming soon)
 3. Request routes to that specific replica
 4. Automatic retry on different replica if failure
 
@@ -302,7 +302,7 @@ async def optimized_multi_turn():
 ```python
 # Forge ReplayBuffer endpoints (verified from source code)
 # Add episodes (thread-safe by actor model)
-await replay_buffer.add.call_one(episode)  # Note: .call_one() not .choose()
+await replay_buffer.add.call_one(episode)  # .choose() would work too, but .call_one() clarifies it's a singleton actor not ActorMesh
 
 # Sample batches for training
 batch = await replay_buffer.sample.call_one(

From 1cebab5d1b3e2c70ab704f9c1589708c9415373a Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Sun, 12 Oct 2025 11:46:05 -0700
Subject: [PATCH 16/22] fix multi line issue

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 20 +++----
 docs/Tutorials/2_Forge_Internals.MD           | 14 ++---
 docs/Tutorials/3_Monarch_101.MD               | 60 +++++++++----------
 3 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 66b32a2b3..26f90092c 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -9,12 +9,12 @@ Let's start with a simple math tutoring example to understand RL concepts with t
 ```mermaid
 graph TD
     subgraph Example["Math Tutoring RL Example"]
-        Dataset["Dataset<br/>math problems<br/>'What is 2+2?'"]
-        Policy["Policy<br/>student AI<br/>generates: 'The answer is 4'"]
-        Reward["Reward Model<br/>Evaluation Exam<br/>scores: 0.95 (excellent)"]
-        Reference["Reference Model<br/>original student<br/>baseline comparison"]
-        ReplayBuffer["Replay Buffer<br/>notebook<br/>stores experiences"]
-        Trainer["Trainer<br/>tutor<br/>improves student"]
+        Dataset["Dataset: math problems"]
+        Policy["Policy: student AI"]
+        Reward["Reward Model: scores answers"]
+        Reference["Reference Model: baseline"]
+        ReplayBuffer["Replay Buffer: stores experiences"]
+        Trainer["Trainer: improves student"]
     end
     
     Dataset --> Policy
@@ -163,13 +163,13 @@ Our simple RL loop above has complex requirements:
 ```mermaid
 graph TD
     subgraph Components["Each Component Needs Different Resources"]
-        Policy["Policy (Student AI)<br/>Generates: 'The answer is 4'<br/>Needs: Large GPU memory<br/>Scaling: Multiple replicas for speed"]
+        Policy["Policy (Student AI): Large GPU memory, Multiple replicas"]
         
-        Reward["Reward Model (Teacher)<br/>Scores answers: 0.95<br/>Needs: Moderate compute<br/>Scaling: CPU or small GPU"]
+        Reward["Reward Model (Teacher): Moderate compute, CPU/small GPU"]
         
-        Trainer["Trainer (Tutor)<br/>Improves student weights<br/>Needs: Massive GPU compute<br/>Scaling: Distributed training"]
+        Trainer["Trainer (Tutor): Massive GPU compute, Distributed training"]
         
-        Dataset["Dataset (Question Bank)<br/>Provides: 'What is 2+2?'<br/>Needs: CPU intensive I/O<br/>Scaling: High memory bandwidth"]
+        Dataset["Dataset (Question Bank): CPU intensive I/O, High memory bandwidth"]
     end
     
     style Policy fill:#99ff99
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 2ed3301e5..ef53ddfe5 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -15,19 +15,19 @@ graph TD
     Call["Your Code:<br/>await policy_service.generate"]
     
     subgraph ServiceLayer["Service Layer"]
-        Proxy["Service Proxy<br/>Load balancing<br/>Health checking<br/>Request routing"]
-        LB["Load Balancer<br/>Replica selection<br/>Circuit breaker<br/>Retry logic"]
+        Proxy["Service Proxy: Load balancing, Health checking"]
+        LB["Load Balancer: Replica selection, Circuit breaker"]
     end
     
     subgraph Replicas["Replica Management"]
-        R1["Replica 1<br/>GPU 0<br/>Healthy"]
-        R2["Replica 2<br/>GPU 1<br/>Overloaded"]
-        R3["Replica 3<br/>GPU 2<br/>Failed"]
-        R4["Replica 4<br/>GPU 3<br/>Healthy"]
+        R1["Replica 1: GPU 0, Healthy"]
+        R2["Replica 2: GPU 1, Overloaded"]
+        R3["Replica 3: GPU 2, Failed"]
+        R4["Replica 4: GPU 3, Healthy"]
     end
     
     subgraph Compute["Actual Computation"]
-        Actor["Policy Actor<br/>vLLM engine<br/>Model weights<br/>KV cache"]
+        Actor["Policy Actor: vLLM engine, Model weights, KV cache"]
     end
     
     Call --> Proxy
diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 0cbdcbd88..502d8a34d 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -13,20 +13,20 @@ graph TD
     end
 
     subgraph ForgeServices["2. Forge Service Layer"]
-        ServiceInterface["ServiceInterface<br/>• Routes .route() to replica<br/>• Handles load balancing<br/>• Manages health checks"]
-        ServiceActor["ServiceActor<br/>• Manages replica lifecycle<br/>• Monitors health<br/>• Coordinates failures"]
+        ServiceInterface["ServiceInterface: Routes requests, Load balancing, Health checks"]
+        ServiceActor["ServiceActor: Manages replicas, Monitors health, Coordinates failures"]
     end
 
     subgraph MonarchLayer["3. Monarch Actor Layer"]
-        ActorMesh["ActorMesh PolicyActor<br/>• 4 PolicyActor instances<br/>• Each on different GPU<br/>• Message passing interface"]
-        ProcMesh["ProcMesh<br/>• 4 processes<br/>• GPU topology: 0,1,2,3<br/>• Network interconnect"]
+        ActorMesh["ActorMesh PolicyActor: 4 instances, Different GPUs, Message passing"]
+        ProcMesh["ProcMesh: 4 processes, GPU topology 0,1,2,3, Network interconnect"]
     end
 
     subgraph Hardware["4. Physical Hardware"]
-        GPU0["GPU 0<br/>PolicyActor #1<br/>vLLM Engine<br/>Model Weights"]
-        GPU1["GPU 1<br/>PolicyActor #2<br/>vLLM Engine<br/>Model Weights"]
-        GPU2["GPU 2<br/>PolicyActor #3<br/>vLLM Engine<br/>Model Weights"]
-        GPU3["GPU 3<br/>PolicyActor #4<br/>vLLM Engine<br/>Model Weights"]
+        GPU0["GPU 0: PolicyActor #1, vLLM Engine, Model Weights"]
+        GPU1["GPU 1: PolicyActor #2, vLLM Engine, Model Weights"]
+        GPU2["GPU 2: PolicyActor #3, vLLM Engine, Model Weights"]
+        GPU3["GPU 3: PolicyActor #4, vLLM Engine, Model Weights"]
     end
 
     Call --> ServiceInterface
@@ -199,10 +199,10 @@ graph TD
         end
 
         subgraph ActorMesh["ActorMesh PolicyActor"]
-            A0["PolicyActor<br/>Instance #0<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
-            A1["PolicyActor<br/>Instance #1<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
-            A2["PolicyActor<br/>Instance #2<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
-            A3["PolicyActor<br/>Instance #3<br/>model=Qwen/Qwen3-7B<br/>generation_count=0"]
+            A0["PolicyActor Instance #0: model=Qwen/Qwen3-7B"]
+            A1["PolicyActor Instance #1: model=Qwen/Qwen3-7B"]
+            A2["PolicyActor Instance #2: model=Qwen/Qwen3-7B"]
+            A3["PolicyActor Instance #3: model=Qwen/Qwen3-7B"]
         end
 
         Code --> ProcMesh
@@ -226,17 +226,17 @@ graph TD
         Client["await policy_actors.generate.METHOD(prompt)"]
 
         subgraph Methods["Different Adverbs Route Differently"]
-            Choose["choose()<br/>→ Routes to ONE actor<br/>→ Load balanced"]
-            Call["call()<br/>→ Routes to ALL actors<br/>→ Collects all results"]
-            Broadcast["broadcast()<br/>→ Routes to ALL actors<br/>→ Fire and forget"]
-            Stream["stream()<br/>→ Routes to ALL actors<br/>→ Iterator of results"]
+            Choose["choose(): Routes to ONE actor, Load balanced"]
+            Call["call(): Routes to ALL actors, Collects results"]
+            Broadcast["broadcast(): Routes to ALL actors, Fire and forget"]
+            Stream["stream(): Routes to ALL actors, Iterator of results"]
         end
 
         subgraph ActorInstances["PolicyActor Instances"]
-            A0["Actor 0<br/>GPU 0<br/>generates response"]
-            A1["Actor 1<br/>GPU 1<br/>generates response"]
-            A2["Actor 2<br/>GPU 2<br/>generates response"]
-            A3["Actor 3<br/>GPU 3<br/>generates response"]
+            A0["Actor 0: GPU 0, generates response"]
+            A1["Actor 1: GPU 1, generates response"]
+            A2["Actor 2: GPU 2, generates response"]
+            A3["Actor 3: GPU 3, generates response"]
         end
 
         Client --> Choose
@@ -276,26 +276,26 @@ graph TD
     subgraph ServiceCreation["Service Creation Process"]
         Call["await PolicyActor.options(num_replicas=4, procs=1).as_service(model='Qwen')"]
 
-        ServiceActor["ServiceActor<br/>• Manages 4 replicas<br/>• Handles health checks<br/>• Routes service calls"]
+        ServiceActor["ServiceActor: Manages 4 replicas, Health checks, Routes calls"]
 
         subgraph Replicas["4 Independent Replicas"]
             subgraph R0["Replica 0"]
-                PM0["ProcMesh<br/>1 process<br/>GPU 0"]
+                PM0["ProcMesh: 1 process, GPU 0"]
                 AM0["ActorMesh<br/>1 PolicyActor"]
             end
 
             subgraph R1["Replica 1"]
-                PM1["ProcMesh<br/>1 process<br/>GPU 1"]
+                PM1["ProcMesh: 1 process, GPU 1"]
                 AM1["ActorMesh<br/>1 PolicyActor"]
             end
 
             subgraph R2["Replica 2"]
-                PM2["ProcMesh<br/>1 process<br/>GPU 2"]
+                PM2["ProcMesh: 1 process, GPU 2"]
                 AM2["ActorMesh<br/>1 PolicyActor"]
             end
 
             subgraph R3["Replica 3"]
-                PM3["ProcMesh<br/>1 process<br/>GPU 3"]
+                PM3["ProcMesh: 1 process, GPU 3"]
                 AM3["ActorMesh<br/>1 PolicyActor"]
             end
         end
@@ -325,15 +325,15 @@ graph TD
     subgraph CallFlow["Complete Call Flow"]
         UserCall["await policy_service.generate.route('What is 2+2?')"]
 
-        ServiceInterface["ServiceInterface<br/>• Receives .route() call<br/>• Routes to ServiceActor"]
+        ServiceInterface["ServiceInterface: Receives .route() call, Routes to ServiceActor"]
 
-        ServiceActor["ServiceActor<br/>• Selects healthy replica<br/>• Load balancing logic<br/>• Failure handling"]
+        ServiceActor["ServiceActor: Selects healthy replica, Load balancing, Failure handling"]
 
-        SelectedReplica["Selected Replica #2<br/>• ProcMesh with 1 process<br/>• ActorMesh with 1 PolicyActor"]
+        SelectedReplica["Selected Replica #2: ProcMesh 1 process, ActorMesh 1 PolicyActor"]
 
-        PolicyActor["PolicyActor Instance<br/>• Loads model<br/>• Runs vLLM inference<br/>• Returns 'The answer is 4'"]
+        PolicyActor["PolicyActor Instance: Loads model, Runs vLLM inference"]
 
-        GPU["GPU 2<br/>• vLLM engine<br/>• Model weights<br/>• KV cache<br/>• CUDA kernels"]
+        GPU["GPU 2: vLLM engine, Model weights, KV cache, CUDA kernels"]
 
         UserCall --> ServiceInterface
         ServiceInterface --> ServiceActor

From a001a8da844e613e36d4d90fc8a8c59eef636ede Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Sun, 12 Oct 2025 11:48:16 -0700
Subject: [PATCH 17/22] fix colours

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 32 ++++----
 docs/Tutorials/2_Forge_Internals.MD           |  8 +-
 docs/Tutorials/3_Monarch_101.MD               | 76 +++++++++----------
 3 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 26f90092c..2565d626e 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -25,9 +25,9 @@ graph TD
     ReplayBuffer --> Trainer
     Trainer --> Policy
     
-    style Policy fill:#99ff99
-    style Reward fill:#ffcc99
-    style Trainer fill:#ff99cc
+    style Policy fill:#4CAF50
+    style Reward fill:#FF9800
+    style Trainer fill:#E91E63
 ```
 
 ### RL Components Defined (Forge Names)
@@ -100,10 +100,10 @@ graph LR
     C5 --> S5
     C6 --> S6
     
-    style C2 fill:#99ff99
-    style S2 fill:#99ff99
-    style C3 fill:#ffcc99
-    style S3 fill:#ffcc99
+    style C2 fill:#4CAF50
+    style S2 fill:#4CAF50
+    style C3 fill:#FF9800
+    style S3 fill:#FF9800
 ```
 
 ### RL Step with Forge Services
@@ -172,10 +172,10 @@ graph TD
         Dataset["Dataset (Question Bank): CPU intensive I/O, High memory bandwidth"]
     end
     
-    style Policy fill:#99ff99
-    style Reward fill:#ffcc99
-    style Trainer fill:#ff99cc
-    style Dataset fill:#ccccff
+    style Policy fill:#4CAF50
+    style Reward fill:#FF9800
+    style Trainer fill:#E91E63
+    style Dataset fill:#2196F3
 ```
 
 ### Problem 2: Complex Interdependencies
@@ -195,11 +195,11 @@ graph LR
     D --> E
     E --> A
     
-    style A fill:#99ff99
-    style B fill:#ffcc99
-    style C fill:#99ccff
-    style D fill:#ccff99
-    style E fill:#ff99cc
+    style A fill:#4CAF50
+    style B fill:#FF9800
+    style C fill:#2196F3
+    style D fill:#8BC34A
+    style E fill:#E91E63
 ```
 
 Each step has different:
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index ef53ddfe5..05a40e4a5 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -39,10 +39,10 @@ graph TD
     R1 --> Actor
     R4 --> Actor
     
-    style Call fill:#99ff99
-    style LB fill:#ffcc99
-    style R3 fill:#ff9999
-    style Actor fill:#cc99ff
+    style Call fill:#4CAF50
+    style LB fill:#FF9800
+    style R3 fill:#F44336
+    style Actor fill:#9C27B0
 ```
 
 ## Service Components Deep Dive
diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 502d8a34d..52bdb17d0 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -38,10 +38,10 @@ graph TD
     ProcMesh --> GPU2
     ProcMesh --> GPU3
 
-    style Call fill:#99ff99
-    style ServiceActor fill:#ffcc99
-    style ActorMesh fill:#cc99ff
-    style ProcMesh fill:#ccccff
+    style Call fill:#4CAF50
+    style ServiceActor fill:#FF9800
+    style ActorMesh fill:#9C27B0
+    style ProcMesh fill:#2196F3
 ```
 
 ## Deep Dive: ProcMesh - The Foundation
@@ -74,14 +74,14 @@ graph TD
         P7 -.->|"Network"| P0
     end
 
-    style P0 fill:#ff9999
-    style P1 fill:#ff9999
-    style P2 fill:#ff9999
-    style P3 fill:#ff9999
-    style P4 fill:#ff9999
-    style P5 fill:#ff9999
-    style P6 fill:#ff9999
-    style P7 fill:#ff9999
+    style P0 fill:#F44336
+    style P1 fill:#F44336
+    style P2 fill:#F44336
+    style P3 fill:#F44336
+    style P4 fill:#F44336
+    style P5 fill:#F44336
+    style P6 fill:#F44336
+    style P7 fill:#F44336
 ```
 
 ### Multi-Host ProcMesh
@@ -122,9 +122,9 @@ graph TD
     H2P0 -.->|"InfiniBand"| H3P0
     H2P1 -.->|"InfiniBand"| H3P1
 
-    style PM1 fill:#ff9999
-    style PM2 fill:#99ff99
-    style PM3 fill:#99ccff
+    style PM1 fill:#F44336
+    style PM2 fill:#4CAF50
+    style PM3 fill:#2196F3
 ```
 
 ```python
@@ -212,10 +212,10 @@ graph TD
         P3 --> A3
     end
 
-    style A0 fill:#99ff99
-    style A1 fill:#99ff99
-    style A2 fill:#99ff99
-    style A3 fill:#99ff99
+    style A0 fill:#4CAF50
+    style A1 fill:#4CAF50
+    style A2 fill:#4CAF50
+    style A3 fill:#4CAF50
 ```
 
 ### Message Routing Through ActorMesh
@@ -259,10 +259,10 @@ graph TD
         Stream --> A3
     end
 
-    style Choose fill:#99ff99
-    style Call fill:#ffcc99
-    style Broadcast fill:#ff99cc
-    style Stream fill:#cc99ff
+    style Choose fill:#4CAF50
+    style Call fill:#FF9800
+    style Broadcast fill:#E91E63
+    style Stream fill:#9C27B0
 ```
 
 ## How Forge Services Use Monarch
@@ -311,11 +311,11 @@ graph TD
         PM3 --> AM3
     end
 
-    style ServiceActor fill:#ffcc99
-    style AM0 fill:#99ff99
-    style AM1 fill:#99ff99
-    style AM2 fill:#99ff99
-    style AM3 fill:#99ff99
+    style ServiceActor fill:#FF9800
+    style AM0 fill:#4CAF50
+    style AM1 fill:#4CAF50
+    style AM2 fill:#4CAF50
+    style AM3 fill:#4CAF50
 ```
 
 ### Service Call to Actor Execution
@@ -348,10 +348,10 @@ graph TD
         ServiceInterface -.->|"'The answer is 4'"| UserCall
     end
 
-    style UserCall fill:#99ff99
-    style ServiceActor fill:#ffcc99
-    style PolicyActor fill:#cc99ff
-    style GPU fill:#ffcccc
+    style UserCall fill:#4CAF50
+    style ServiceActor fill:#FF9800
+    style PolicyActor fill:#9C27B0
+    style GPU fill:#FF5722
 ```
 
 ## Multiple Services Sharing Infrastructure
@@ -400,12 +400,12 @@ graph TD
         BS --> C4
     end
 
-    style PS fill:#99ff99
-    style TS fill:#ff99cc
-    style RS fill:#ffcc99
-    style BS fill:#cc99ff
-    style GPUMesh fill:#ffe6e6
-    style CPUMesh fill:#e6f3ff
+    style PS fill:#4CAF50
+    style TS fill:#E91E63
+    style RS fill:#FF9800
+    style BS fill:#9C27B0
+    style GPUMesh fill:#FFEBEE
+    style CPUMesh fill:#E3F2FD
 ```
 
 ## Key Insights: Why This Architecture Matters

From 6b23e25fa48881e3869ad82fb6504eba99063ab6 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Sun, 12 Oct 2025 12:03:09 -0700
Subject: [PATCH 18/22] fix linter and ohter comments

---
 docs/Tutorials/1_RL_and_Forge_Fundamentals.MD | 114 +++++++++---------
 docs/Tutorials/2_Forge_Internals.MD           |  98 +++++++--------
 docs/Tutorials/ReadMe.MD                      |   6 +-
 3 files changed, 109 insertions(+), 109 deletions(-)

diff --git a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
index 2565d626e..39b6d62aa 100644
--- a/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
+++ b/docs/Tutorials/1_RL_and_Forge_Fundamentals.MD
@@ -16,7 +16,7 @@ graph TD
         ReplayBuffer["Replay Buffer: stores experiences"]
         Trainer["Trainer: improves student"]
     end
-    
+
     Dataset --> Policy
     Policy --> Reward
     Policy --> Reference
@@ -24,7 +24,7 @@ graph TD
     Reference --> ReplayBuffer
     ReplayBuffer --> Trainer
     Trainer --> Policy
-    
+
     style Policy fill:#4CAF50
     style Reward fill:#FF9800
     style Trainer fill:#E91E63
@@ -47,25 +47,25 @@ graph TD
 def conceptual_rl_step():
     # 1. Get a math problem
     question = dataset.sample()  # "What is 2+2?"
-    
-    # 2. Student generates answer  
+
+    # 2. Student generates answer
     answer = policy.generate(question)  # "The answer is 4"
-    
+
     # 3. Teacher grades it
     score = reward_model.evaluate(question, answer)  # 0.95
-    
+
     # 4. Compare to original student
     baseline = reference_model.compute_logprobs(question, answer)
-    
+
     # 5. Store the experience
     experience = Episode(question, answer, score, baseline)
     replay_buffer.add(experience)
-    
+
     # 6. When enough experiences collected, improve student
     batch = replay_buffer.sample(curr_policy_version=0)
     if batch is not None:
         trainer.train_step(batch)  # Student gets better!
-        
+
 # 🔄 See complete working example below with actual Forge service calls
 ```
 
@@ -83,7 +83,7 @@ graph LR
         C5["Replay Buffer"]
         C6["Trainer"]
     end
-    
+
     subgraph Services["Forge Services (Real Classes)"]
         S1["DatasetActor"]
         S2["Policy"]
@@ -92,14 +92,14 @@ graph LR
         S5["ReplayBuffer"]
         S6["RLTrainer"]
     end
-    
+
     C1 --> S1
     C2 --> S2
     C3 --> S3
     C4 --> S4
     C5 --> S5
     C6 --> S6
-    
+
     style C2 fill:#4CAF50
     style S2 fill:#4CAF50
     style C3 fill:#FF9800
@@ -117,26 +117,26 @@ async def conceptual_forge_rl_step(services, step):
     # 1. Get a math problem - Using actual DatasetActor API
     sample = await services['dataloader'].sample.call_one()
     question, target = sample["request"], sample["target"]
-    
+
     # 2. Student generates answer - Using actual Policy API
     responses = await services['policy'].generate.route(prompt=question)
-    answer = responses[0].text  
-    
+    answer = responses[0].text
+
     # 3. Teacher grades it - Using actual RewardActor API
     score = await services['reward_actor'].evaluate_response.route(
         prompt=question, response=answer, target=target
     )
-    
+
     # 4. Compare to baseline - Using actual ReferenceModel API
     # Note: ReferenceModel.forward requires input_ids, max_req_tokens, return_logprobs
     ref_logprobs = await services['ref_model'].forward.route(
         input_ids, max_req_tokens, return_logprobs=True
     )
-    
+
     # 5. Store experience - Using actual Episode structure from apps/grpo/main.py
     episode = create_episode_from_response(responses[0], score, ref_logprobs, step)
     await services['replay_buffer'].add.call_one(episode)
-    
+
     # 6. Improve student - Using actual training pattern
     batch = await services['replay_buffer'].sample.call_one(
         curr_policy_version=step
@@ -160,23 +160,12 @@ Our simple RL loop above has complex requirements:
 
 #### Problem 1: Different Resource Needs
 
-```mermaid
-graph TD
-    subgraph Components["Each Component Needs Different Resources"]
-        Policy["Policy (Student AI): Large GPU memory, Multiple replicas"]
-        
-        Reward["Reward Model (Teacher): Moderate compute, CPU/small GPU"]
-        
-        Trainer["Trainer (Tutor): Massive GPU compute, Distributed training"]
-        
-        Dataset["Dataset (Question Bank): CPU intensive I/O, High memory bandwidth"]
-    end
-    
-    style Policy fill:#4CAF50
-    style Reward fill:#FF9800
-    style Trainer fill:#E91E63
-    style Dataset fill:#2196F3
-```
+| Component | Resource Needs | Scaling Strategy |
+|-----------|----------------|------------------|
+| **Policy** (Student AI) | Large GPU memory | Multiple replicas for throughput |
+| **Reward Heuristic** (Teacher) | Small compute | CPU or small GPU |
+| **Trainer** (Tutor) | Massive GPU compute | Distributed training |
+| **Dataset** (Question Bank) | CPU intensive I/O | High memory bandwidth |
 
 ### Problem 2: Complex Interdependencies
 
@@ -187,14 +176,14 @@ graph LR
     C["Reference: Original Student<br/>Provides baseline comparison"]
     D["Replay Buffer: Notebook<br/>Stores: question + answer + score"]
     E["Trainer: Tutor<br/>Improves student using experiences"]
-    
+
     A --> B
     A --> C
     B --> D
     C --> D
     D --> E
     E --> A
-    
+
     style A fill:#4CAF50
     style B fill:#FF9800
     style C fill:#2196F3
@@ -203,7 +192,7 @@ graph LR
 ```
 
 Each step has different:
-- **Latency requirements**: Policy inference needs low latency, training can batch
+- **Latency requirements**: Policy inference needs low latency (each episode waits), training can batch multiple episodes together
 - **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference)
 - **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover)
 - **Resource utilization**: GPUs for inference/training, CPUs for data processing
@@ -218,10 +207,10 @@ def naive_rl_step():
     # Policy waits idle while reward model works
     response = policy_model.generate(prompt)  # GPU busy
     reward = reward_model.evaluate(prompt, response)  # Policy GPU idle
-    
-    # Training waits for single episode  
+
+    # Training waits for single episode
     loss = compute_loss(response, reward)  # Batch size = 1, inefficient
-    
+
     # Everything stops if any component fails
     if policy_fails or reward_fails or trainer_fails:
         entire_system_stops()
@@ -233,32 +222,37 @@ Forge solves these problems by treating each RL component as an **independent, d
 
 Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2):
 
+**Quick API Reference:** (covered in detail in Part 2: Service Communication Patterns)
+- `.route()` - Send request to any healthy replica in a service (load balanced)
+- `.call_one()` - Send request to a single actor instance
+- `.fanout()` - Send request to ALL replicas in a service
+
 ```python
 async def real_rl_training_step(services, step):
     """Single RL step using verified Forge APIs"""
-    
+
     # 1. Environment interaction - Using actual DatasetActor API
     sample = await services['dataloader'].sample.call_one()
     prompt, target = sample["request"], sample["target"]
-    
+
     responses = await services['policy'].generate.route(prompt)
-    
+
     # 2. Reward computation - Using actual RewardActor API
     score = await services['reward_actor'].evaluate_response.route(
         prompt=prompt, response=responses[0].text, target=target
     )
-    
+
     # 3. Get reference logprobs - Using actual ReferenceModel API
     # Note: ReferenceModel requires full input_ids tensor, not just tokens
     input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids])
     ref_logprobs = await services['ref_model'].forward.route(
         input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True
     )
-    
+
     # 4. Experience storage - Using actual Episode pattern from GRPO
     episode = create_episode_from_response(responses[0], score, ref_logprobs, step)
     await services['replay_buffer'].add.call_one(episode)
-    
+
     # 5. Learning - Using actual trainer pattern
     batch = await services['replay_buffer'].sample.call_one(
         curr_policy_version=step
@@ -266,11 +260,11 @@ async def real_rl_training_step(services, step):
     if batch is not None:
         inputs, targets = batch  # GRPO returns (inputs, targets) tuple
         loss = await services['trainer'].train_step.call(inputs, targets)
-        
+
         # 6. Policy synchronization - Using actual weight update pattern
         await services['trainer'].push_weights.call(step + 1)
         await services['policy'].update_weights.fanout(step + 1)
-        
+
         return loss
 ```
 
@@ -286,7 +280,7 @@ answer = responses[0].text  # responses is list[Completion]
 
 Forge handles behind the scenes:
 - Routing to least loaded replica
-- GPU memory management  
+- GPU memory management
 - Batch optimization
 - Failure recovery
 - Auto-scaling based on demand
@@ -365,10 +359,16 @@ group_size = 1
     )
 ```
 
-Production scaling - multiply num_replicas for services or spawn multiple actors:
-- Policy: num_replicas=8 for high inference demand
-- RewardActor: num_replicas=16 for parallel evaluation
-- Trainer: Multiple processes for distributed training (RLTrainer handles this internally)
+**Forge Components: Services vs Actors**
+
+Forge has two types of distributed components:
+- **Services**: Multiple replicas with automatic load balancing (like Policy, RewardActor)
+- **Actors**: Single instances that handle their own internal distribution (like RLTrainer, ReplayBuffer)
+
+We cover this distinction in detail in Part 2, but for now this explains the scaling patterns:
+- Policy service: num_replicas=8 for high inference demand
+- RewardActor service: num_replicas=16 for parallel evaluation
+- RLTrainer actor: Single instance with internal distributed training
 
 
 ### Fault Tolerance
@@ -377,13 +377,13 @@ Production scaling - multiply num_replicas for services or spawn multiple actors
 responses = await policy.generate.route(prompt=question)
 answer = responses[0].text
 # -> Forge automatically routes to healthy replica
-# -> Failed replica respawns in background  
+# -> Failed replica respawns in background
 # -> No impact on training loop
 
 # If reward service fails:
 score = await reward_actor.evaluate_response.route(
     prompt=question, response=answer, target=target
-) 
+)
 ```
 
 - Retries on different replica automatically
@@ -392,4 +392,4 @@ score = await reward_actor.evaluate_response.route(
 
 This is fundamentally different from monolithic RL implementations where any component failure stops everything!
 
-In the next Section, we will go a layer deeper and learn how ForgeServices work. Continue to [Part 2 here](./2_Forge_Internals.MD)
\ No newline at end of file
+In the next Section, we will go a layer deeper and learn how ForgeServices work. Continue to [Part 2 here](./2_Forge_Internals.MD)
diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 05a40e4a5..e1af9cde3 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -13,23 +13,23 @@ When you call `await policy_service.generate(question)`, here's what actually ha
 ```mermaid
 graph TD
     Call["Your Code:<br/>await policy_service.generate"]
-    
+
     subgraph ServiceLayer["Service Layer"]
         Proxy["Service Proxy: Load balancing, Health checking"]
         LB["Load Balancer: Replica selection, Circuit breaker"]
     end
-    
+
     subgraph Replicas["Replica Management"]
         R1["Replica 1: GPU 0, Healthy"]
         R2["Replica 2: GPU 1, Overloaded"]
         R3["Replica 3: GPU 2, Failed"]
         R4["Replica 4: GPU 3, Healthy"]
     end
-    
+
     subgraph Compute["Actual Computation"]
         Actor["Policy Actor: vLLM engine, Model weights, KV cache"]
     end
-    
+
     Call --> Proxy
     Proxy --> LB
     LB --> R1
@@ -38,7 +38,7 @@ graph TD
     LB --> R4
     R1 --> Actor
     R4 --> Actor
-    
+
     style Call fill:#4CAF50
     style LB fill:#FF9800
     style R3 fill:#F44336
@@ -55,7 +55,7 @@ Here's the actual ServiceConfig from Forge source code:
 # Configuration pattern from apps/grpo/main.py:
 Policy.options(
     procs=1,           # Processes per replica
-    num_replicas=4,    # Number of replicas  
+    num_replicas=4,    # Number of replicas
     with_gpus=True     # Allocate GPUs
     # Other available options:
     # hosts=None   #  the number of remote hosts used per replica
@@ -69,7 +69,7 @@ Services are created using the `.options().as_service()` pattern from the actual
 The service creation automatically handles:
 - Spawning actor replicas across processes/GPUs
 - Load balancing with .route() method for services
-- Health monitoring and failure recovery  
+- Health monitoring and failure recovery
 - Message routing and serialization
 
 ```python
@@ -78,8 +78,8 @@ from forge.actors.policy import Policy
 model = "Qwen/Qwen3-1.7B"
 
 policy = await Policy.options(
-    procs=1, 
-    with_gpus=True, 
+    procs=1,
+    with_gpus=True,
     num_replicas=1
 ).as_service(
     engine_config={
@@ -158,7 +158,7 @@ Behind the scenes:
 ```python
 # Get version from all policy replicas
 current_versions = await policy.get_version.fanout()
-# Returns: [version_replica_1, version_replica_2, ...] 
+# Returns: [version_replica_1, version_replica_2, ...]
 
 # Update weights on all replicas
 await policy.update_weights.fanout(new_policy_version)
@@ -193,8 +193,8 @@ while training:
 ```
 
 **Performance characteristics**:
-- **Latency**: Process first result immediately  
-- **Throughput**: Pipeline parallelism (much higher than sequential)
+- **Latency**: Process first result immediately
+- **Throughput**: Non-blocking async operations (much higher than waiting for full batches)
 - **Fault tolerance**: Continues if some replicas fail
 
 **Critical insight**: This is essential for high-throughput RL where you can't wait for batches.
@@ -242,7 +242,7 @@ async with counter_service.session():
     print(await counter_service.increment.route())  # 1
     print(await counter_service.increment.route())  # 2
     print(await counter_service.increment.route())  # 3
-          
+
     final_value = await counter_service.get_value.route()
     print(f"Final value on this replica: {final_value}")  # 3
 
@@ -263,7 +263,7 @@ await counter_service.shutdown()
 
 The most complex challenge in distributed RL is maintaining state consistency while maximizing performance.
 
-### The KV Cache Problem  
+### The KV Cache Problem
 
 **The challenge**: Policy inference is much faster with KV cache, but cache is tied to specific conversation history.
 
@@ -278,16 +278,16 @@ async def naive_multi_turn():
 
 **The solution**: Sticky sessions ensure all calls go to same replica.
 
-```python  
+```python
 async def optimized_multi_turn():
     async with policy.session():
         # All calls guaranteed to hit same replica = cache hits
         response1 = await policy.generate.route(prompt=question1)
-        full_prompt = question1 + response1[0].text  
+        full_prompt = question1 + response1[0].text
         response2 = await policy.generate.route(prompt=full_prompt) # Cache hit!
         conversation = full_prompt + response2[0].text
         response3 = await policy.generate.route(prompt=conversation)   # Cache hit!
-        
+
     # Session ends, replica can be garbage collected or reused
 ```
 
@@ -327,11 +327,11 @@ batch = await replay_buffer.sample.call_one(
 async def real_weight_sync(trainer, policy, step):
     # Trainer pushes weights to TorchStore with version number
     await trainer.push_weights.call_one(policy_version=step + 1)
-    
-    # Policy service updates to new version from TorchStore  
+
+    # Policy service updates to new version from TorchStore
     # Use .fanout() to update ALL policy replicas
     await policy.update_weights.fanout(policy_version=step + 1)
-    
+
 # Check current policy version
 current_version = await policy.get_version.route()
 print(f"Current policy version: {current_version}")
@@ -349,29 +349,29 @@ Instead of manual coordination, Forge services handle speed mismatches automatic
 from apps.grpo.main import Episode, Group
 
 async def simple_rl_step():
-    
+
     # ===== Generate a rollout =====
     sample = await dataloader.sample.call_one()  # DatasetActor is an actor, not service
     prompt, target = sample["request"], sample["target"]  # Correct field names
-    
+
     print(f"Prompt: {prompt}")
     print(f"Target: {target}")
-    
+
     actions = await policy.generate.route(prompt=prompt)  # Policy is a service
     print(f"Policy response: {actions[0].text}")
-    
+
     # Create input tensor for reference model (requires full context)
     input_ids = torch.cat([actions[0].prompt_ids, actions[0].token_ids])
     ref_logprobs = await ref_model.forward.route(
         input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True
-    )    
+    )
     reward = await reward_actor.evaluate_response.route(  # RewardActor is a service
-        prompt=prompt, 
-        response=actions[0].text, 
+        prompt=prompt,
+        response=actions[0].text,
         target=target
     )
     print(f"Reward: {reward}")
-    
+
     # Create episode using actual GRPO Episode structure
     episode = Episode(
         episode_id="0",
@@ -382,24 +382,24 @@ async def simple_rl_step():
         response_len=512,
         target=target
     )
-    
+
     # Add response data
     episode.response = actions[0].text
     episode.request_tokens = actions[0].prompt_ids.tolist()
     episode.response_tokens = actions[0].token_ids.tolist()
     episode.ref_logprobs = ref_logprobs[0]  # Extract from batch dimension
     episode.reward = reward
-    
+
     # Compute advantages using actual ComputeAdvantages actor
     group = Group.new_group(0, 1, prompt, 0, tokenizer.pad_token_id, 512, 512, target)
     group.episodes[0] = episode
     advantages = await compute_advantages.compute.call_one(group)  # ComputeAdvantages is an actor
     episode.advantage = advantages[0]
-    print(f"Advantage: {advantages[0]}")    
+    print(f"Advantage: {advantages[0]}")
     await replay_buffer.add.call_one(episode)  # ReplayBuffer is an actor
     print("Episode stored in replay buffer")
-    
-    # ===== Train on the batch ===== 
+
+    # ===== Train on the batch =====
     batch = await replay_buffer.sample.call_one(curr_policy_version=0)
     if batch is not None:
         print("Training on batch...")
@@ -469,12 +469,12 @@ class RewardActor(ForgeActor):
     async def evaluate_response(self, prompt: str, response: str, target: str) -> float:
         """Evaluate response quality using multiple reward functions"""
         total_reward = 0.0
-        
+
         for reward_fn in self.reward_functions:
             # Each reward function contributes to total score
             reward = reward_fn(prompt, response, target)
             total_reward += reward
-            
+
         # Return average reward across all functions
         return total_reward / len(self.reward_functions) if self.reward_functions else 0.0
 
@@ -490,7 +490,7 @@ target = "36"
 
 score = await reward_actor.evaluate_response.route(
     prompt=prompt,
-    response=response, 
+    response=response,
     target=target
 )
 print(f"Reward score: {score}")  # Usually around 1.0 for correct math answers
@@ -530,7 +530,7 @@ print("Initializing all services...")
     reward_actor,
 ) = await asyncio.gather(
     DatasetActor.options(procs=1).as_actor(
-        path="openai/gsm8k", revision="main", data_split="train", 
+        path="openai/gsm8k", revision="main", data_split="train",
         streaming=True, model="Qwen/Qwen3-1.7B"
     ),
     Policy.options(procs=1, with_gpus=True, num_replicas=1).as_service(
@@ -559,41 +559,41 @@ print("All services initialized successfully!")
 async def production_training_loop():
     """Real training loop pattern from apps/grpo/main.py"""
     step = 0
-    
+
     while True:
-        # Data generation 
+        # Data generation
         sample = await dataloader.sample.call_one()
-        
+
         # Policy generation service call
         responses = await policy.generate.route(sample["request"])  # Correct field name
-        
+
         # Reference computation service call (requires full input tensor)
         input_ids = torch.cat([responses[0].prompt_ids, responses[0].token_ids])
         ref_logprobs = await ref_model.forward.route(
             input_ids.unsqueeze(0), max_req_tokens=512, return_logprobs=True
         )
-        
-        # Reward evaluation service call 
+
+        # Reward evaluation service call
         reward = await reward_actor.evaluate_response.route(
             prompt=sample["question"],
             response=responses[0].text,
             target=sample["answer"]
         )
-        
+
         # Experience storage (using actual Episode structure)
         episode = create_episode_from_grpo_data(sample, responses[0], reward, ref_logprobs[0], step)
         await replay_buffer.add.call_one(episode)
-        
+
         # Training when ready
         batch = await replay_buffer.sample.call_one(curr_policy_version=step)
         if batch is not None:
             inputs, targets = batch  # GRPO returns (inputs, targets) tuple
             loss = await trainer.train_step.call(inputs, targets)
-            
+
             # Weight synchronization pattern
             await trainer.push_weights.call(step + 1)
             await policy.update_weights.fanout(step + 1)  # Fanout to all replicas
-            
+
             print(f"Step {step}, Loss: {loss:.4f}")
             step += 1
 
@@ -612,11 +612,11 @@ print("All services shut down successfully!")
 
 **Key observations:**
 1. **Parallelism**: Independent operations run concurrently
-2. **Load balancing**: Each `.route()` call automatically selects optimal replica  
+2. **Load balancing**: Each `.route()` call automatically selects optimal replica
 3. **Fault tolerance**: Failures automatically retry on different replicas
 4. **Resource efficiency**: CPU and GPU services scale independently
 5. **Coordination**: Services coordinate through shared state (replay buffer, weight versions)
 
 This is the power of the service abstraction - complex distributed coordination looks like simple async Python code.
 
-In the next part we will learn about [Monarch internals](./3_Monarch_101.MD)
\ No newline at end of file
+In the next part we will learn about [Monarch internals](./3_Monarch_101.MD)
diff --git a/docs/Tutorials/ReadMe.MD b/docs/Tutorials/ReadMe.MD
index 7798b147d..084710853 100644
--- a/docs/Tutorials/ReadMe.MD
+++ b/docs/Tutorials/ReadMe.MD
@@ -4,7 +4,7 @@ A comprehensive guide for ML Engineers building distributed RL systems for langu
 
 Some of the examples mentioned below will be conceptual in nature for understanding. Please refer to API Docs (Coming Soon!) for more details
 
-Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our PyTorch friends that remember! 
+Welcome to the Tutorials section! This section is inspired by the A-Z PyTorch tutorial, shoutout to our PyTorch friends that remember!
 
 ###
 
@@ -14,6 +14,6 @@ This section currently is structured in 3 detailed parts:
 2. [Forge Internals](./2_Forge_Internals.MD): Goes a layer deeper and explains the internals of Forge
 3. [Monarch 101](./3_Monarch_101.MD): It's a 101 to Monarch and how Forge Talks to Monarch
 
-Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy! 
+Each part builds upon the next and the entire section can be consumed in roughly an hour-Grab a Chai and Enjoy!
 
-If you're eager, please checkout our SFT Tutorial too (Coming soon!) as well as [App Examples](../../apps/).
\ No newline at end of file
+If you're eager, please checkout our SFT Tutorial too (Coming soon!) as well as [App Examples](../../apps/).

From d03e84a851bb6adfdd2bd4e8c418e6d62a3d4ace Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Mon, 13 Oct 2025 15:04:16 -0700
Subject: [PATCH 19/22] address felipe's comments, add image and fix sticky
 session examples

---
 docs/Tutorials/2_Forge_Internals.MD | 83 +++++++++++++++++++++++------
 1 file changed, 66 insertions(+), 17 deletions(-)

diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index e1af9cde3..8189cf8a5 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -108,22 +108,54 @@ await policy.shutdown()
 
 Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas:
 
-Forge internals - What happens behind the scenes:
-1. `.as_service()` creates a `ServiceInterface`
-2. `ServiceInterface` manages N replicas of your `ForgeActor` class
-3. `ServiceInterface` handles routing between replicas
-4. You get methods like `.route()`, `.fanout()`, etc.
+When you call `.as_service()`, Forge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc.
 
 ```python
-# Your code sees this:
+# Your code sees this simple interface:
 responses = await policy.generate.route(prompt=prompt)
+# But Forge handles all the complexity of replica management, load balancing, and fault tolerance
 ```
 
-But behind the scenes:
-- `ServiceInterface` selects healthy replica
-- Routes message to that replica's `Policy.generate()` endpoint
-- Handles failures and retries automatically
-- Returns list[Completion] from the selected replica
+## Communication Patterns: Quick Reference
+
+**API Summary:**
+- `.route()` - Send request to any healthy replica in a service (load balanced)
+- `.call_one()` - Send request to a single actor instance  
+- `.fanout()` - Send request to ALL replicas in a service
+
+```mermaid
+graph LR
+    subgraph Request["Your Request"]
+        Code["await service.method.ADVERB()"]
+    end
+    
+    subgraph Patterns["Communication Patterns"]
+        Route[".route()<br/>→ One healthy replica"]
+        CallOne[".call_one()<br/>→ Single actor"]
+        Fanout[".fanout()<br/>→ ALL replicas"]
+    end
+    
+    subgraph Replicas["Replicas/Actors"]
+        R1["Replica 1"]
+        R2["Replica 2"]
+        R3["Replica 3"]
+        A1["Actor"]
+    end
+    
+    Code --> Route
+    Code --> CallOne
+    Code --> Fanout
+    
+    Route --> R2
+    CallOne --> A1
+    Fanout --> R1
+    Fanout --> R2
+    Fanout --> R3
+    
+    style Route fill:#4CAF50
+    style CallOne fill:#FF9800
+    style Fanout fill:#9C27B0
+```
 
 ## Deep Dive: Service Communication Patterns
 
@@ -203,8 +235,10 @@ while training:
 
 **When to use**: When you need multiple calls to hit the same replica (like KV cache preservation).
 
+**What are sticky sessions?** A session ensures all your service calls within the `async with` block go to the same replica, instead of being load-balanced across different replicas.
+
 ```python
-# This Counter example demonstrates the session pattern
+# This Counter example demonstrates the difference between regular routing and sessions
 
 from forge.controller import ForgeActor
 from monarch.actor import endpoint
@@ -230,22 +264,37 @@ counter_service = await ForgeCounter.options(
     procs=1, num_replicas=4
 ).as_service(initial_value=0)
 
-# Test basic operations
-await counter_service.increment.route()
+# WITHOUT SESSIONS: Each .route() call goes to a different replica
+await counter_service.increment.route()  # Might go to replica 2
+await counter_service.increment.route()  # Might go to replica 1  
+await counter_service.increment.route()  # Might go to replica 3
+
 results = await counter_service.increment.fanout()  # Get from all replicas
 print(f"All replica values: {results}")
+# Output: All replica values: [1, 2, 1, 1] - Each replica has different state!
+```
 
-# STICKY SESSIONS
+The problem: each `.route()` call can go to different replicas, creating inconsistent state.
+
+```python
+# WITH SESSIONS: All calls go to the SAME replica
 print("\nUsing sticky sessions:")
-async with counter_service.session():
+async with counter_service.session():  # Creates a session that picks one replica
     await counter_service.reset.route()  # Uses .route() within session
     print(await counter_service.increment.route())  # 1
-    print(await counter_service.increment.route())  # 2
+    print(await counter_service.increment.route())  # 2  
     print(await counter_service.increment.route())  # 3
 
     final_value = await counter_service.get_value.route()
     print(f"Final value on this replica: {final_value}")  # 3
 
+# Output:
+# Using sticky sessions:
+# 1
+# 2
+# 3
+# Final value on this replica: 3
+
 # Same pattern works with Policy for multi-turn conversations:
 # async with policy.session():
 #     response1 = await policy.generate.route(turn1)

From 6ace584dcfd84a3b5e17a357f23b98e6d0d52c69 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Mon, 13 Oct 2025 15:07:27 -0700
Subject: [PATCH 20/22] fix PR tests

---
 docs/Tutorials/2_Forge_Internals.MD | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/docs/Tutorials/2_Forge_Internals.MD b/docs/Tutorials/2_Forge_Internals.MD
index 8189cf8a5..1a9421a96 100644
--- a/docs/Tutorials/2_Forge_Internals.MD
+++ b/docs/Tutorials/2_Forge_Internals.MD
@@ -120,7 +120,7 @@ responses = await policy.generate.route(prompt=prompt)
 
 **API Summary:**
 - `.route()` - Send request to any healthy replica in a service (load balanced)
-- `.call_one()` - Send request to a single actor instance  
+- `.call_one()` - Send request to a single actor instance
 - `.fanout()` - Send request to ALL replicas in a service
 
 ```mermaid
@@ -128,30 +128,30 @@ graph LR
     subgraph Request["Your Request"]
         Code["await service.method.ADVERB()"]
     end
-    
+
     subgraph Patterns["Communication Patterns"]
         Route[".route()<br/>→ One healthy replica"]
         CallOne[".call_one()<br/>→ Single actor"]
         Fanout[".fanout()<br/>→ ALL replicas"]
     end
-    
+
     subgraph Replicas["Replicas/Actors"]
         R1["Replica 1"]
         R2["Replica 2"]
         R3["Replica 3"]
         A1["Actor"]
     end
-    
+
     Code --> Route
     Code --> CallOne
     Code --> Fanout
-    
+
     Route --> R2
     CallOne --> A1
     Fanout --> R1
     Fanout --> R2
     Fanout --> R3
-    
+
     style Route fill:#4CAF50
     style CallOne fill:#FF9800
     style Fanout fill:#9C27B0
@@ -266,7 +266,7 @@ counter_service = await ForgeCounter.options(
 
 # WITHOUT SESSIONS: Each .route() call goes to a different replica
 await counter_service.increment.route()  # Might go to replica 2
-await counter_service.increment.route()  # Might go to replica 1  
+await counter_service.increment.route()  # Might go to replica 1
 await counter_service.increment.route()  # Might go to replica 3
 
 results = await counter_service.increment.fanout()  # Get from all replicas
@@ -282,7 +282,7 @@ print("\nUsing sticky sessions:")
 async with counter_service.session():  # Creates a session that picks one replica
     await counter_service.reset.route()  # Uses .route() within session
     print(await counter_service.increment.route())  # 1
-    print(await counter_service.increment.route())  # 2  
+    print(await counter_service.increment.route())  # 2
     print(await counter_service.increment.route())  # 3
 
     final_value = await counter_service.get_value.route()

From b67d2e3c6a47298d02fb02558b2b93a5f5e260be Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Tue, 14 Oct 2025 13:54:10 -0700
Subject: [PATCH 21/22] Update 3_Monarch_101.MD

---
 docs/Tutorials/3_Monarch_101.MD | 201 +++++++++-----------------------
 1 file changed, 52 insertions(+), 149 deletions(-)

diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index 52bdb17d0..f3c5c5f37 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -50,85 +50,50 @@ graph TD
 
 ### Single Host ProcMesh
 
-```mermaid
-graph TD
-    subgraph Host["Single Host (8 GPUs)"]
-        subgraph ProcMesh["ProcMesh: per_host={'gpus': 8}"]
-            P0["Process 0<br/>GPU 0"]
-            P1["Process 1<br/>GPU 1"]
-            P2["Process 2<br/>GPU 2"]
-            P3["Process 3<br/>GPU 3"]
-            P4["Process 4<br/>GPU 4"]
-            P5["Process 5<br/>GPU 5"]
-            P6["Process 6<br/>GPU 6"]
-            P7["Process 7<br/>GPU 7"]
-        end
-
-        P0 -.->|"Network"| P1
-        P1 -.->|"Network"| P2
-        P2 -.->|"Network"| P3
-        P3 -.->|"Network"| P4
-        P4 -.->|"Network"| P5
-        P5 -.->|"Network"| P6
-        P6 -.->|"Network"| P7
-        P7 -.->|"Network"| P0
-    end
+**Key insight**: ProcMesh creates one process per GPU, automatically handling the process-to-hardware mapping.
 
-    style P0 fill:#F44336
-    style P1 fill:#F44336
-    style P2 fill:#F44336
-    style P3 fill:#F44336
-    style P4 fill:#F44336
-    style P5 fill:#F44336
-    style P6 fill:#F44336
-    style P7 fill:#F44336
+```python
+# This simple call:
+procs = this_host().spawn_procs(per_host={"gpus": 8})
+
+# Creates:
+# Process 0 → GPU 0
+# Process 1 → GPU 1  
+# Process 2 → GPU 2
+# Process 3 → GPU 3
+# Process 4 → GPU 4
+# Process 5 → GPU 5
+# Process 6 → GPU 6
+# Process 7 → GPU 7
 ```
 
-### Multi-Host ProcMesh
-
-```mermaid
-graph TD
-    subgraph Cluster["Multi-Host Cluster"]
-        subgraph Host1["Host 1"]
-            subgraph PM1["ProcMesh Segment 1"]
-                H1P0["Process 0<br/>GPU 0"]
-                H1P1["Process 1<br/>GPU 1"]
-                H1P2["Process 2<br/>GPU 2"]
-                H1P3["Process 3<br/>GPU 3"]
-            end
-        end
-
-        subgraph Host2["Host 2"]
-            subgraph PM2["ProcMesh Segment 2"]
-                H2P0["Process 4<br/>GPU 0"]
-                H2P1["Process 5<br/>GPU 1"]
-                H2P2["Process 6<br/>GPU 2"]
-                H2P3["Process 7<br/>GPU 3"]
-            end
-        end
+The beauty: you don't manage individual processes or GPU assignments - ProcMesh handles the topology for you.
 
-        subgraph Host3["Host 3"]
-            subgraph PM3["ProcMesh Segment 3"]
-                H3P0["Process 8<br/>GPU 0"]
-                H3P1["Process 9<br/>GPU 1"]
-                H3P2["Process 10<br/>GPU 2"]
-                H3P3["Process 11<br/>GPU 3"]
-            end
-        end
-    end
+### Multi-Host ProcMesh
 
-    H1P0 -.->|"InfiniBand"| H2P0
-    H1P1 -.->|"InfiniBand"| H2P1
-    H2P0 -.->|"InfiniBand"| H3P0
-    H2P1 -.->|"InfiniBand"| H3P1
+**Key insight**: ProcMesh seamlessly scales across multiple hosts with continuous process numbering.
 
-    style PM1 fill:#F44336
-    style PM2 fill:#4CAF50
-    style PM3 fill:#2196F3
+```python
+# Same simple API works across hosts:
+cluster_procs = spawn_cluster_procs(
+    hosts=["host1", "host2", "host3"], 
+    per_host={"gpus": 4}
+)
+
+# Automatically creates:
+# Host 1: Processes 0-3  → GPUs 0-3
+# Host 2: Processes 4-7  → GPUs 0-3  
+# Host 3: Processes 8-11 → GPUs 0-3
+
+# Your code stays the same whether it's 1 host or 100 hosts
+actors = cluster_procs.spawn("my_actor", MyActor)
 ```
 
+**The power**: Scale from single host to cluster without changing your actor code - ProcMesh handles all the complexity.
+
 ```python
 # This shows the underlying actor system that powers Forge services
+# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real Forge apps!
 
 from monarch.actor import Actor, endpoint, this_proc, Future
 from monarch.actor import ProcMesh, this_host
@@ -165,104 +130,42 @@ await counters.increment.call()
 # STEP 6: Different message patterns
 # call_one() - single actor
 value = await counters.get_value.call_one()
-print(f"One counter: {value}")
+print(f"One counter: {value}")  # Output: One counter: 1
 
 # choose() - random single actor (actors only, not services)
 value = await counters.get_value.choose()
-print(f"Random counter: {value}")
+print(f"Random counter: {value}")  # Output: Random counter: 1
 
 # call() - all actors, collect results
 values = await counters.get_value.call()
-print(f"All counters: {values}")
+print(f"All counters: {values}")  # Output: All counters: [1, 1, 1, 1, 1, 1, 1, 1]
 
 # broadcast() - fire and forget
-await counters.increment.broadcast()
+await counters.increment.broadcast()  # No return value - just sends to all actors
 
 # Cleanup
 await procs.stop()
-```
-
-## Actor Meshes: Your Code Running Distributed
-
-**ActorMesh** is created when you spawn actors across a ProcMesh. Each process in the ProcMesh gets one instance of your actor.
-
-```mermaid
-graph TD
-    subgraph Creation["Actor Creation Process"]
-        Code["mesh.spawn('policy', PolicyActor, model='Qwen/Qwen3-7B')"]
-
-        subgraph ProcMesh["ProcMesh (4 processes)"]
-            P0["Process 0<br/>GPU 0"]
-            P1["Process 1<br/>GPU 1"]
-            P2["Process 2<br/>GPU 2"]
-            P3["Process 3<br/>GPU 3"]
-        end
-
-        subgraph ActorMesh["ActorMesh PolicyActor"]
-            A0["PolicyActor Instance #0: model=Qwen/Qwen3-7B"]
-            A1["PolicyActor Instance #1: model=Qwen/Qwen3-7B"]
-            A2["PolicyActor Instance #2: model=Qwen/Qwen3-7B"]
-            A3["PolicyActor Instance #3: model=Qwen/Qwen3-7B"]
-        end
-
-        Code --> ProcMesh
-        P0 --> A0
-        P1 --> A1
-        P2 --> A2
-        P3 --> A3
-    end
 
-    style A0 fill:#4CAF50
-    style A1 fill:#4CAF50
-    style A2 fill:#4CAF50
-    style A3 fill:#4CAF50
+# Remember: This raw Monarch code is for understanding how Forge works internally.
+# In your Forge applications, use ForgeActor, .as_service(), .as_actor() instead!
 ```
 
-### Message Routing Through ActorMesh
+## Actor Meshes: Your Code Running Distributed
 
-```mermaid
-graph TD
-    subgraph MessageFlow["Message Flow Patterns"]
-        Client["await policy_actors.generate.METHOD(prompt)"]
-
-        subgraph Methods["Different Adverbs Route Differently"]
-            Choose["choose(): Routes to ONE actor, Load balanced"]
-            Call["call(): Routes to ALL actors, Collects results"]
-            Broadcast["broadcast(): Routes to ALL actors, Fire and forget"]
-            Stream["stream(): Routes to ALL actors, Iterator of results"]
-        end
+**ActorMesh** is created when you spawn actors across a ProcMesh. Key points:
 
-        subgraph ActorInstances["PolicyActor Instances"]
-            A0["Actor 0: GPU 0, generates response"]
-            A1["Actor 1: GPU 1, generates response"]
-            A2["Actor 2: GPU 2, generates response"]
-            A3["Actor 3: GPU 3, generates response"]
-        end
+- **One actor instance per process**: `mesh.spawn("policy", PolicyActor)` creates one PolicyActor in each process
+- **Same constructor arguments**: All instances get the same initialization parameters  
+- **Independent state**: Each actor instance maintains its own state and memory
+- **Message routing**: You can send messages to one actor or all actors using different methods
 
-        Client --> Choose
-        Client --> Call
-        Client --> Broadcast
-        Client --> Stream
-
-        Choose -.->|"Load balanced"| A1
-        Call --> A0
-        Call --> A1
-        Call --> A2
-        Call --> A3
-        Broadcast --> A0
-        Broadcast --> A1
-        Broadcast --> A2
-        Broadcast --> A3
-        Stream --> A0
-        Stream --> A1
-        Stream --> A2
-        Stream --> A3
-    end
+```python
+# Simple example:
+procs = spawn_procs(per_host={"gpus": 4})  # 4 processes
+policy_actors = procs.spawn("policy", PolicyActor, model="Qwen/Qwen3-7B")
 
-    style Choose fill:#4CAF50
-    style Call fill:#FF9800
-    style Broadcast fill:#E91E63
-    style Stream fill:#9C27B0
+# Now you have 4 PolicyActor instances, one per GPU
+# All initialized with the same model parameter
 ```
 
 ## How Forge Services Use Monarch

From cda22d849c44d3788e37179836f9093179006608 Mon Sep 17 00:00:00 2001
From: Sanyam Bhutani <sanyam.bhutani05@gmail.com>
Date: Tue, 14 Oct 2025 13:56:36 -0700
Subject: [PATCH 22/22] fix linter

---
 docs/Tutorials/3_Monarch_101.MD | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/Tutorials/3_Monarch_101.MD b/docs/Tutorials/3_Monarch_101.MD
index f3c5c5f37..2213e9bb5 100644
--- a/docs/Tutorials/3_Monarch_101.MD
+++ b/docs/Tutorials/3_Monarch_101.MD
@@ -58,7 +58,7 @@ procs = this_host().spawn_procs(per_host={"gpus": 8})
 
 # Creates:
 # Process 0 → GPU 0
-# Process 1 → GPU 1  
+# Process 1 → GPU 1
 # Process 2 → GPU 2
 # Process 3 → GPU 3
 # Process 4 → GPU 4
@@ -76,13 +76,13 @@ The beauty: you don't manage individual processes or GPU assignments - ProcMesh
 ```python
 # Same simple API works across hosts:
 cluster_procs = spawn_cluster_procs(
-    hosts=["host1", "host2", "host3"], 
+    hosts=["host1", "host2", "host3"],
     per_host={"gpus": 4}
 )
 
 # Automatically creates:
 # Host 1: Processes 0-3  → GPUs 0-3
-# Host 2: Processes 4-7  → GPUs 0-3  
+# Host 2: Processes 4-7  → GPUs 0-3
 # Host 3: Processes 8-11 → GPUs 0-3
 
 # Your code stays the same whether it's 1 host or 100 hosts
@@ -155,7 +155,7 @@ await procs.stop()
 **ActorMesh** is created when you spawn actors across a ProcMesh. Key points:
 
 - **One actor instance per process**: `mesh.spawn("policy", PolicyActor)` creates one PolicyActor in each process
-- **Same constructor arguments**: All instances get the same initialization parameters  
+- **Same constructor arguments**: All instances get the same initialization parameters
 - **Independent state**: Each actor instance maintains its own state and memory
 - **Message routing**: You can send messages to one actor or all actors using different methods