diff --git a/docs/source/tutorial_sources/README.txt b/docs/source/tutorial_sources/README.txt index 1fadb0a08..0f59efa01 100644 --- a/docs/source/tutorial_sources/README.txt +++ b/docs/source/tutorial_sources/README.txt @@ -1,5 +1,5 @@ Tutorials ========= -This gallery contains tutorials and examples to help you get started with Forge. +This gallery contains tutorials and examples to help you get started with TorchForge. Each tutorial demonstrates specific features and use cases with practical examples. diff --git a/docs/source/tutorial_sources/zero-to-forge/1_RL_and_Forge_Fundamentals.md b/docs/source/tutorial_sources/zero-to-forge/1_RL_and_Forge_Fundamentals.md index 42f234772..fd7c0cf6b 100644 --- a/docs/source/tutorial_sources/zero-to-forge/1_RL_and_Forge_Fundamentals.md +++ b/docs/source/tutorial_sources/zero-to-forge/1_RL_and_Forge_Fundamentals.md @@ -1,8 +1,8 @@ -# Part 1: RL Fundamentals - Using Forge Terminology +# Part 1: RL Fundamentals - Using TorchForge Terminology -## Core RL Components in Forge +## Core RL Components in TorchForge -Let's start with a simple math tutoring example to understand RL concepts with the exact names Forge uses: +Let's start with a simple math tutoring example to understand RL concepts with the exact names TorchForge uses: ### The Toy Example: Teaching Math @@ -30,7 +30,7 @@ graph TD style Trainer fill:#E91E63 ``` -### RL Components Defined (Forge Names) +### RL Components Defined (TorchForge Names) 1. **Dataset**: Provides questions/prompts (like "What is 2+2?") 2. **Policy**: The AI being trained (generates answers like "The answer is 4") @@ -66,12 +66,12 @@ def conceptual_rl_step(): if batch is not None: trainer.train_step(batch) # Student gets better! -# 🔄 See complete working example below with actual Forge service calls +# 🔄 See complete working example below with actual TorchForge service calls ``` -## From Concepts to Forge Services +## From Concepts to TorchForge Services -Here's the key insight: **Each RL component becomes a Forge service**. The toy example above maps directly to Forge: +Here's the key insight: **Each RL component becomes a TorchForge service**. The toy example above maps directly to TorchForge: ```mermaid graph LR @@ -85,7 +85,7 @@ graph LR C6["Trainer"] end - subgraph Services["Forge Services (Real Classes)"] + subgraph Services["TorchForge Services (Real Classes)"] direction TB S1["DatasetActor"] S2["Policy"] @@ -108,9 +108,9 @@ graph LR style S3 fill:#FF9800 ``` -### RL Step with Forge Services +### RL Step with TorchForge Services -Let's look at the example from above again, but this time we would use the names from Forge: +Let's look at the example from above again, but this time we would use the names from TorchForge: ```python # Conceptual Example @@ -151,7 +151,7 @@ async def conceptual_forge_rl_step(services, step): **Key difference**: Same RL logic, but each component is now a distributed, fault-tolerant, auto-scaling service. -Did you realise-we are not worrying about any Infra code here! Forge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms! +Did you realise-we are not worrying about any Infra code here! TorchForge Automagically handles the details behind the scenes and you can focus on writing your RL Algorthms! ## Why This Matters: Traditional ML Infrastructure Fails @@ -196,7 +196,7 @@ graph LR Each step has different: - **Latency requirements**: Policy inference needs low latency (each episode waits), training can batch multiple episodes together - **Scaling patterns**: Need N policy replicas to keep trainer busy, plus different sharding strategies (tensor parallel for training vs replicated inference) -- **Failure modes**: Any component failure cascades to halt the entire pipeline (Forge prevents this with automatic failover) +- **Failure modes**: Any component failure cascades to halt the entire pipeline (TorchForge prevents this with automatic failover) - **Resource utilization**: GPUs for inference/training, CPUs for data processing ### Problem 3: The Coordination Challenge @@ -218,11 +218,11 @@ def naive_rl_step(): entire_system_stops() ``` -## Enter Forge: RL-Native Architecture +## Enter TorchForge: RL-Native Architecture -Forge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ) +TorchForge solves these problems by treating each RL component as an **independent, distributed unit** - some as fault-tolerant services (like Policy inference where failures are easy to handle), others as actors (like Trainers where recovery semantics differ) -Let's see how core RL concepts map to Forge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2): +Let's see how core RL concepts map to TorchForge components (you'll notice a mix of `.route()` for services and `.call_one()` for actors - we cover when to use each in Part 2): **Quick API Reference:** (covered in detail in Part 2: Service Communication Patterns) - `.route()` - Send request to any healthy replica in a service (load balanced) @@ -231,7 +231,7 @@ Let's see how core RL concepts map to Forge components (you'll notice a mix of ` ```python async def real_rl_training_step(services, step): - """Single RL step using verified Forge APIs""" + """Single RL step using verified TorchForge APIs""" # 1. Environment interaction - Using actual DatasetActor API sample = await services['dataloader'].sample.call_one() @@ -280,7 +280,7 @@ responses = await policy.generate.route(prompt=question) answer = responses[0].text # responses is list[Completion] ``` -Forge handles behind the scenes: +TorchForge handles behind the scenes: - Routing to least loaded replica - GPU memory management - Batch optimization @@ -361,9 +361,9 @@ group_size = 1 ) ``` -**Forge Components: Services vs Actors** +**TorchForge Components: Services vs Actors** -Forge has two types of distributed components: +TorchForge has two types of distributed components: - **Services**: Multiple replicas with automatic load balancing (like Policy, RewardActor) - **Actors**: Single instances that handle their own internal distribution (like RLTrainer, ReplayBuffer) @@ -378,7 +378,7 @@ We cover this distinction in detail in Part 2, but for now this explains the sca # If a policy replica fails: responses = await policy.generate.route(prompt=question) answer = responses[0].text -# -> Forge automatically routes to healthy replica +# -> TorchForge automatically routes to healthy replica # -> Failed replica respawns in background # -> No impact on training loop diff --git a/docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md b/docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md index 13e59da48..9c8f89bc2 100644 --- a/docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md +++ b/docs/source/tutorial_sources/zero-to-forge/2_Forge_Internals.md @@ -1,6 +1,6 @@ # Part 2: Peeling Back the Abstraction - What Are Services? -We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in Forge. +We highly recommend reading [Part 1](./1_RL_and_Forge_Fundamentals) before this, it explains RL Concepts and how they land in TorchForge. Now that you see the power of the service abstraction, let's understand what's actually happening under the hood, Grab your chai! @@ -49,7 +49,7 @@ graph TD ### 1. Real Service Configuration -Here's the actual ServiceConfig from Forge source code: +Here's the actual ServiceConfig from TorchForge source code: ```python # Configuration pattern from apps/grpo/main.py: @@ -106,14 +106,14 @@ await policy.shutdown() ### 3. How Services Actually Work -Forge services are implemented as ServiceActors that manage collections of your ForgeActor replicas: +TorchForge services are implemented as ServiceActors that manage collections of your ForgeActor replicas: -When you call `.as_service()`, Forge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc. +When you call `.as_service()`, TorchForge creates a `ServiceInterface` that manages N replicas of your `ForgeActor` class and gives you methods like `.route()`, `.fanout()`, etc. ```python # Your code sees this simple interface: responses = await policy.generate.route(prompt=prompt) -# But Forge handles all the complexity of replica management, load balancing, and fault tolerance +# But TorchForge handles all the complexity of replica management, load balancing, and fault tolerance ``` ## Communication Patterns: Quick Reference @@ -159,7 +159,7 @@ graph LR ## Deep Dive: Service Communication Patterns -These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective Forge usage. +These communication patterns (\"adverbs\") determine how your service calls are routed to replicas. Understanding when to use each pattern is key to effective TorchForge usage. ### 1. `.route()` - Load Balanced Single Replica @@ -181,7 +181,7 @@ Behind the scenes: - **Throughput**: Limited by single replica capacity - **Fault tolerance**: Automatic failover to other replicas -**Critical insight**: `.route()` is your default choice for stateless operations in Forge services. +**Critical insight**: `.route()` is your default choice for stateless operations in TorchForge services. ### 2. `.fanout()` - Broadcast with Results Collection @@ -346,10 +346,10 @@ async def optimized_multi_turn(): **The challenge**: Multiple trainers and experience collectors reading/writing concurrently. -**Real Forge approach**: The ReplayBuffer actor handles concurrency internally: +**Real TorchForge approach**: The ReplayBuffer actor handles concurrency internally: ```python -# Forge ReplayBuffer endpoints (verified from source code) +# TorchForge ReplayBuffer endpoints (verified from source code) # Add episodes (thread-safe by actor model) await replay_buffer.add.call_one(episode) # .choose() would work too, but .call_one() clarifies it's a singleton actor not ActorMesh @@ -372,7 +372,7 @@ batch = await replay_buffer.sample.call_one( **The challenge**: Trainer updates policy weights, but policy service needs those weights. ```python -# Forge weight synchronization pattern from apps/grpo/main.py +# TorchForge weight synchronization pattern from apps/grpo/main.py async def real_weight_sync(trainer, policy, step): # Trainer pushes weights to TorchStore with version number await trainer.push_weights.call_one(policy_version=step + 1) @@ -388,11 +388,11 @@ print(f"Current policy version: {current_version}") ## Deep Dive: Asynchronous Coordination Patterns -**The real challenge**: Different services run at different speeds, but Forge's service abstraction handles the coordination complexity. +**The real challenge**: Different services run at different speeds, but TorchForge's service abstraction handles the coordination complexity. -### The Forge Approach: Let Services Handle Coordination +### The TorchForge Approach: Let Services Handle Coordination -Instead of manual coordination, Forge services handle speed mismatches automatically: +Instead of manual coordination, TorchForge services handle speed mismatches automatically: ```python from apps.grpo.main import Episode, Group @@ -556,7 +556,7 @@ await reward_actor.shutdown() Now let's see how services coordinate in a real training loop: ```python -# This is the REAL way production RL systems are built with Forge +# This is the REAL way production RL systems are built with TorchForge import asyncio import torch diff --git a/docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md b/docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md index faf21e159..a5a28c7a6 100644 --- a/docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md +++ b/docs/source/tutorial_sources/zero-to-forge/3_Monarch_101.md @@ -1,8 +1,8 @@ -# Part 3: The Forge-Monarch Connection +# Part 3: The TorchForge-Monarch Connection -This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to Forge](./1_RL_and_Forge_Fundamentals), Part 2: [Forge Internals](./2_Forge_Internals). +This is part 3 of our series, in the previous sections: we learned Part 1: [RL Concepts and how they map to TorchForge](./1_RL_and_Forge_Fundamentals), Part 2: [TorchForge Internals](./2_Forge_Internals). -Now let's peel back the layers. Forge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging. +Now let's peel back the layers. TorchForge services are built on top of **Monarch**, PyTorch's distributed actor framework. Understanding this connection is crucial for optimization and debugging. ## The Complete Hierarchy: Service to Silicon @@ -12,7 +12,7 @@ graph TD Call["await policy_service.generate.route('What is 2+2?')"] end - subgraph ForgeServices["2. Forge Service Layer"] + subgraph ForgeServices["2. TorchForge Service Layer"] ServiceInterface["ServiceInterface: Routes requests, Load balancing, Health checks"] ServiceActor["ServiceActor: Manages replicas, Monitors health, Coordinates failures"] end @@ -92,8 +92,8 @@ actors = cluster_procs.spawn("my_actor", MyActor) **The power**: Scale from single host to cluster without changing your actor code - ProcMesh handles all the complexity. ```python -# This shows the underlying actor system that powers Forge services -# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real Forge apps! +# This shows the underlying actor system that powers TorchForge services +# NOTE: This is for educational purposes - use ForgeActor and .as_service() in real TorchForge apps! from monarch.actor import Actor, endpoint, this_proc, Future from monarch.actor import ProcMesh, this_host @@ -146,8 +146,8 @@ await counters.increment.broadcast() # No return value - just sends to all acto # Cleanup await procs.stop() -# Remember: This raw Monarch code is for understanding how Forge works internally. -# In your Forge applications, use ForgeActor, .as_service(), .as_actor() instead! +# Remember: This raw Monarch code is for understanding how TorchForge works internally. +# In your TorchForge applications, use ForgeActor, .as_service(), .as_actor() instead! ``` ## Actor Meshes: Your Code Running Distributed @@ -168,9 +168,9 @@ policy_actors = procs.spawn("policy", PolicyActor, model="Qwen/Qwen3-7B") # All initialized with the same model parameter ``` -## How Forge Services Use Monarch +## How TorchForge Services Use Monarch -Now the key insight: **Forge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**. +Now the key insight: **TorchForge services are ServiceActors that manage ActorMeshes of your ForgeActor replicas**. ### The Service Creation Process @@ -264,7 +264,7 @@ In real RL systems, you have multiple services that can share or use separate Pr ```mermaid graph TD subgraph Cluster["RL Training Cluster"] - subgraph Services["Forge Services"] + subgraph Services["TorchForge Services"] PS["Policy Service
4 GPU replicas"] TS["Trainer Service
2 GPU replicas"] RS["Reward Service
4 CPU replicas"] @@ -317,7 +317,7 @@ graph TD 2. **Location Transparency**: Actors can be local or remote with identical APIs 3. **Structured Distribution**: ProcMesh maps directly to hardware topology 4. **Message Passing**: No shared memory means no race conditions or locks -5. **Service Abstraction**: Forge hides Monarch complexity while preserving power +5. **Service Abstraction**: TorchForge hides Monarch complexity while preserving power Understanding this hierarchy helps you: - **Debug performance issues**: Is the bottleneck at service, actor, or hardware level? @@ -329,9 +329,9 @@ Understanding this hierarchy helps you: ## What You've Learned -1. **RL Fundamentals**: How RL concepts map to Forge services with REAL, working examples -2. **Service Abstraction**: How to use Forge services effectively with verified communication patterns -3. **Monarch Foundation**: How Forge services connect to distributed actors and hardware +1. **RL Fundamentals**: How RL concepts map to TorchForge services with REAL, working examples +2. **Service Abstraction**: How to use TorchForge services effectively with verified communication patterns +3. **Monarch Foundation**: How TorchForge services connect to distributed actors and hardware ## Key Takeaways diff --git a/docs/source/zero-to-forge-intro.md b/docs/source/zero-to-forge-intro.md index 45a352bbe..c9f2e98d2 100644 --- a/docs/source/zero-to-forge-intro.md +++ b/docs/source/zero-to-forge-intro.md @@ -1,4 +1,4 @@ -# Zero to Forge: From RL Theory to Production-Scale Implementation +# Zero to TorchForge: From RL Theory to Production-Scale Implementation A comprehensive guide for ML Engineers building distributed RL systems for language models. @@ -12,9 +12,9 @@ PyTorch tutorial, shoutout to our PyTorch friends that remember! This section currently is structured in 3 detailed parts: -1. [RL Fundamentals and Understanding Forge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you Forge Fundamentals -2. [Forge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of Forge -3. [Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how Forge Talks to Monarch +1. [RL Fundamentals and Understanding TorchForge Terminology](tutorials/zero-to-forge/1_RL_and_Forge_Fundamentals): This gives a quick refresher of Reinforcement Learning and teaches you TorchForge Fundamentals +2. [TorchForge Internals](tutorials/zero-to-forge/2_Forge_Internals): Goes a layer deeper and explains the internals of TorchForge +3. [Monarch 101](tutorials/zero-to-forge/3_Monarch_101): It's a 101 to Monarch and how TorchForge Talks to Monarch Each part builds upon the next and the entire section can be consumed in roughly an hour - Grab a Chai and Enjoy!