diff --git a/docs/source/architecture.md b/docs/source/architecture.md
new file mode 100644
index 000000000..b7c274a9a
--- /dev/null
+++ b/docs/source/architecture.md
@@ -0,0 +1,256 @@
+# Architecture
+
+This guide provides a deep dive into TorchForge's architecture, explaining how Monarch, Services, and TorchStore work together to enable distributed RL.
+
+## The Foundation: Monarch
+
+At TorchForge's core is **Monarch**, a PyTorch-native distributed programming framework that brings single-controller orchestration to entire GPU clusters.
+
+### Single-Controller vs SPMD
+
+Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with:
+- Asynchronous generation and training
+- Multiple heterogeneous components (policy, reward model, reference model)
+- Dynamic resource allocation
+- Fault tolerance across components
+
+**Monarch's single-controller model** changes this entirely. You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs.
+
+### Actor Meshes
+
+Monarch organizes resources into multidimensional arrays called **meshes**:
+
+**Process Mesh**
+: An array of processes spread across many hosts, typically one process per GPU
+
+**Actor Mesh**
+: An array of actors, each running inside a separate process
+
+Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems. You can slice meshes, broadcast operations, and operate on entire meshes with simple APIs.
+
+```python
+from monarch.actor import Actor, this_host
+
+# Create a process mesh with 8 GPUs
+procs = this_host().spawn_procs({"gpus": 8})
+
+# Define an actor
+class PolicyActor(Actor):
+    @endpoint
+    def generate(self, prompt):
+        return self.model.generate(prompt)
+
+# Spawn actors across the mesh
+actors = procs.spawn("policy", PolicyActor)
+
+# Call methods on the entire mesh
+results = actors.generate.call_all("Hello world")
+```
+
+### Fault Tolerance
+
+Monarch provides **progressive fault handling** - you write your code as if nothing fails. When something does fail, Monarch fails fast by default, stopping the whole program like an uncaught exception.
+
+But you can progressively add fine-grained fault handling exactly where you need it:
+
+```python
+try:
+    result = await policy.generate.route(prompt)
+except ActorFailure:
+    # Handle failure - maybe retry with different replica
+    result = await policy.generate.route(prompt)
+```
+
+For long-running RL training, this is crucial. Hardware failures are common at scale - in Meta's Llama 3 training, there were 419 interruptions across 54 days on a 16K GPU job (roughly one failure every 3 hours).
+
+### RDMA and Data Plane
+
+Monarch separates the **control plane** (messaging) from the **data plane** (bulk data transfers). This enables direct GPU-to-GPU memory transfers across your cluster using RDMA (Remote Direct Memory Access).
+
+Control commands go through one optimized path, while large data transfers (like model weights) go through another path optimized for bandwidth.
+
+## Services: RL-Friendly Actor Abstraction
+
+**Services** wrap Monarch's ActorMesh with patterns common in RL. A service is a managed group of actor replicas with built-in load balancing, fault tolerance, and routing primitives.
+
+```python
+# Create a policy service with 16 replicas, each using 8 GPUs
+policy = PolicyActor.options(
+    procs=8,
+    with_gpus=True,
+    num_replicas=16
+).as_service()
+
+# Create a lightweight coding environment service
+coder = SandboxedCoder.options(
+    procs=1,
+    with_gpus=False,
+    num_replicas=16
+).as_service()
+```
+
+### Service Adverbs
+
+Services provide intuitive operations called "adverbs":
+
+**route()**
+: Load-balanced request to one replica
+```python
+response = await policy.generate.route(prompt)
+```
+
+**fanout()**
+: Broadcast to ALL replicas in parallel
+```python
+await policy.update_weights.fanout(version)
+```
+
+**session()**
+: Sticky sessions for stateful operations (maintains KV cache consistency)
+```python
+async with policy.session():
+    response1 = await policy.generate.route(prompt1)
+    response2 = await policy.generate.route(prompt2)  # Same replica
+```
+
+### Why Services Matter for RL
+
+Services solve critical infrastructure challenges:
+
+**Heterogeneous Scaling**
+: Different components need different resources. Your policy might need 16 replicas × 8 GPUs for high-throughput vLLM inference. Your reward model might need 4 replicas × 4 GPUs. Your coding environment might need 16 lightweight CPU-only replicas. Services let each component scale independently.
+
+**Load Balancing**
+: In async RL, multiple `continuous_rollouts()` tasks run concurrently. Services automatically distribute these rollouts across available replicas - no manual worker pool management.
+
+**Fault Tolerance**
+: If a replica fails during a rollout, services detect it, mark it unhealthy, and route subsequent requests to healthy replicas. The failed replica gets restarted automatically. Your RL code never sees the failure.
+
+**Ephemeral Infrastructure**
+: Services are created with your job and torn down when finished. Want to try a new reward model? Change your Python code. No standing deployments to maintain, no infrastructure to provision ahead of time.
+
+## TorchStore: Distributed Weight Storage
+
+In async RL, every training step produces new policy weights that must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes of data. **TorchStore** makes this efficient.
+
+### The Weight Synchronization Challenge
+
+Traditionally, you have two options:
+1. **Build complex p2p mappings** between training and inference sharding strategies (fast but extremely complex)
+2. **Use network filesystem** like NFS (simple but slow, with high infrastructure cost)
+
+TorchStore combines the **UX of central storage** with the **performance of in-memory p2p operations**.
+
+### How TorchStore Works
+
+TorchStore is a distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives:
+
+```python
+import torchstore as ts
+from torch.distributed._tensor import distribute_tensor, Shard
+from torch.distributed.device_mesh import init_device_mesh
+
+# Training process: store sharded weights
+async def store_weights():
+    device_mesh = init_device_mesh("cuda", (4,))
+    tensor = model.state_dict()['layer.weight']
+    dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)])
+
+    # Each rank stores its shard
+    await ts.put("policy_weights_v123", dtensor)
+
+# Inference process: fetch with different sharding
+async def load_weights():
+    device_mesh = init_device_mesh("cuda", (2, 2))  # Different topology!
+    tensor = torch.empty_like(model.state_dict()['layer.weight'])
+    dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)])
+
+    # TorchStore handles resharding automatically
+    await ts.get("policy_weights_v123", dtensor)
+```
+
+**Key Features:**
+
+**Automatic Resharding**
+: Handles complex weight transfer between different sharding strategies transparently
+
+**DTensor Native**
+: Works seamlessly with PyTorch's distributed tensors
+
+**RDMA Transfers**
+: Uses RDMA for high-bandwidth data movement without blocking GPUs
+
+**Asynchronous Updates**
+: Training and inference can read/write weights independently, enabling true async RL
+
+**Flexible Storage**
+: Store tensors co-located with trainers, on their own storage tier, sharded or replicated - change with minimal code modifications
+
+### Why TorchStore Matters
+
+Without TorchStore, weight synchronization becomes the bottleneck in async RL. Traditional approaches either:
+- Require synchronous GPU-to-GPU transfers (blocking training)
+- Use slow network filesystems (minutes per update)
+- Demand complex manual resharding logic (error-prone, hard to maintain)
+
+TorchStore solves all of these, keeping data distributed across the cluster until requested and moving it efficiently with RDMA.
+
+## Resource Management
+
+Effective resource management is crucial for training large models.
+
+### Memory Optimization
+
+**Gradient Checkpointing**
+: Trade computation for memory by recomputing activations during backward pass
+
+**Mixed Precision**
+: Use FP16/BF16 for reduced memory footprint while maintaining quality
+
+**Activation Offloading**
+: Move activations to CPU when not needed on GPU
+
+**Parameter Sharding**
+: Distribute model parameters across devices using FSDP
+
+### Compute Optimization
+
+**Asynchronous Execution**
+: Overlap communication with computation to hide latency
+
+**Batch Size Tuning**
+: Balance memory usage and throughput for optimal training speed
+
+**Dynamic Batching**
+: Group requests efficiently for inference (vLLM does this)
+
+**Kernel Fusion**
+: Combine operations to reduce memory bandwidth (torch.compile helps)
+
+## Distributed Training Strategies
+
+TorchForge leverages multiple parallelism strategies through TorchTitan:
+
+### Data Parallelism
+
+Each GPU processes different batches using the same model replica. Gradients are synchronized via all-reduce operations.
+
+### FSDP (Fully Sharded Data Parallel)
+
+**Sharding** splits model parameters, gradients, and optimizer states across multiple devices, dramatically reducing memory per GPU. FSDP is the strategy that enables training models larger than single-GPU memory.
+
+### Tensor Parallelism
+
+Split individual tensors (like large weight matrices) across devices. Each device computes a portion of the operation.
+
+### Pipeline Parallelism
+
+Partition model layers across devices, pipeline micro-batches through the stages for efficient utilization.
+
+## See Also
+
+- {doc}`concepts` - Core philosophy and key abstractions
+- {doc}`technology_stack` - Understanding the dependency stack
+- {doc}`rl_workflows` - Writing RL algorithms with these components
+- {doc}`getting_started` - Installation and setup
+- {doc}`usage` - Practical usage examples
diff --git a/docs/source/concepts.md b/docs/source/concepts.md
index 075d1ef7f..7371a813e 100644
--- a/docs/source/concepts.md
+++ b/docs/source/concepts.md
@@ -1,4 +1,151 @@
 # Concepts
 
-This guide covers the fundamental concepts and architecture behind TorchForge,
-helping you understand how the system works and how to effectively use its components.
+This guide introduces the fundamental principles and concepts behind TorchForge, helping you understand the philosophy that drives the system.
+
+## The Core Philosophy
+
+TorchForge is built on one principle: **researchers should write algorithms, not infrastructure**.
+
+The traditional approach to distributed RL requires you to write complex coordination logic, retry mechanisms, resource management, and synchronization code. TorchForge abstracts all of this away, letting you express RL algorithms as naturally as pseudocode while powerful infrastructure handles the distributed complexity underneath.
+
+## Key Abstractions
+
+Understanding these core abstractions helps you use TorchForge effectively:
+
+### Actor
+
+A component that encapsulates a model along with its execution logic. Actors provide:
+- **Isolation**: Independent resources and failure domains
+- **Flexibility**: Different parallelism strategies per actor
+- **Composability**: Combine actors to create complex pipelines
+
+### Service
+
+A managed group of actor replicas with built-in routing, load balancing, and fault tolerance. Services handle operational complexity so your RL code stays clean. Think of services as horizontally scaled actors with automatic load distribution.
+
+### DTensor (Distributed Tensor)
+
+A tensor sharded across multiple devices. TorchStore handles resharding DTensors between different topologies automatically, making distributed tensor operations transparent.
+
+### Episode
+
+A complete RL interaction sequence containing:
+- **Prompt**: Input to the policy
+- **Response**: Generated output
+- **Reward**: Feedback signal
+- **Metadata**: Additional context (timestamps, model versions, etc.)
+
+Episodes flow through your system from generation to replay buffer to training.
+
+### Replay Buffer
+
+Stores episodes for training. Can be implemented with various strategies:
+- **FIFO**: Simple queue for on-policy algorithms
+- **Prioritized**: Importance sampling for off-policy learning
+- **Reservoir**: Uniform sampling from history
+- **Hybrid**: Mix multiple strategies
+
+Integrates with TorchStore for efficient distributed storage.
+
+## Design Principles
+
+### Single-Controller Model
+
+Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with:
+- Asynchronous generation and training
+- Multiple heterogeneous components (policy, reward model, reference model)
+- Dynamic resource allocation
+- Fault tolerance across components
+
+TorchForge adopts **Monarch's single-controller model**: You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs.
+
+### Composable Components
+
+Write your core logic once, compose it into any paradigm:
+- **Synchronous on-policy** (PPO, GRPO)
+- **Asynchronous off-policy** (continuous rollouts + training)
+- **Hybrid approaches** (batch collection with async training)
+
+The same `generate_episode()` function works everywhere. Just change how you compose it.
+
+### Ephemeral Infrastructure
+
+Services are created with your job and torn down when finished:
+- No standing deployments to maintain
+- No infrastructure to provision ahead of time
+- Want to try a new reward model? Change your Python code and rerun
+
+This dramatically reduces operational overhead and enables rapid experimentation.
+
+### Progressive Fault Tolerance
+
+Write code as if nothing fails. When failures do occur:
+- Monarch fails fast by default (like uncaught exceptions)
+- Add fine-grained fault handling exactly where you need it
+- Services automatically route around failed replicas
+- Failed actors restart automatically
+
+You choose your fault tolerance granularity based on your needs.
+
+## Best Practices
+
+### Model Selection
+
+- Start with smaller models for prototyping
+- Use pre-configured model setups when available
+- Validate configurations before large-scale training
+
+### Data Preparation
+
+- Ensure balanced and diverse training data
+- Implement proper train/validation splits
+- Monitor data quality throughout training
+- Verify token distributions match expectations
+
+### Training Strategy
+
+- Begin with SFT before attempting GRPO
+- Use gradient accumulation for larger effective batch sizes
+- Monitor KL divergence to prevent policy collapse
+- Implement regular checkpointing for fault tolerance
+- Apply warmup schedules for stable training
+
+### Resource Optimization
+
+- Profile memory usage to identify bottlenecks
+- Tune batch sizes for your hardware configuration
+- Consider mixed precision training to reduce memory
+- Use appropriate parallelism strategies for your model size
+
+### Debugging
+
+- Start with single-GPU training to isolate issues
+- Enable verbose logging for distributed runs
+- Use profiling tools to identify bottlenecks
+- Validate data pipelines before full training
+- Monitor loss curves and generation quality
+
+## Validation
+
+TorchForge has been validated in real-world deployments:
+
+- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA)
+- **CoreWeave**: Large-scale training runs on 512 H100 GPU clusters with smooth, efficient performance
+- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training
+
+## Learn More
+
+Dive deeper into specific topics:
+
+```{toctree}
+:maxdepth: 1
+
+architecture
+technology_stack
+rl_workflows
+```
+
+**Related Documentation:**
+- {doc}`getting_started` - Installation and first training run
+- {doc}`usage` - Practical configuration examples
+- {doc}`api` - Complete API reference
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 69b3c20b5..342188948 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -98,6 +98,8 @@
     "navbar_center": "navbar-nav",
     "canonical_url": "https://meta-pytorch.org/forge/",
     "header_links_before_dropdown": 7,
+    "show_toc_level": 2,
+    "navigation_depth": 3,
 }
 
 theme_variables = pytorch_sphinx_theme2.get_theme_variables()
@@ -117,6 +119,7 @@
     "colon_fence",
     "deflist",
     "html_image",
+    "substitution",
 ]
 
 autodoc_default_options = {
diff --git a/docs/source/faq.md b/docs/source/faq.md
index d3c027866..ebbe4f38e 100644
--- a/docs/source/faq.md
+++ b/docs/source/faq.md
@@ -1,3 +1,57 @@
 # FAQ
 
-This FAQ covers common questions and issues encountered when using TorchForge.
+## Common Installation Issues
+
+### Issue: `gh: command not found`
+
+**Solution**: Install GitHub CLI:
+```bash
+# On Ubuntu/Debian
+sudo apt install gh
+
+# On Fedora
+sudo dnf install gh
+
+# Then authenticate
+gh auth login
+```
+
+### Issue: `CUDA out of memory`
+
+**Solution**: Reduce batch size in your config file:
+```yaml
+training:
+  batch_size: 2  # Reduced from 4
+  gradient_accumulation_steps: 8  # Increased to maintain effective batch size
+```
+
+### Issue: `ImportError: No module named 'torch'`
+
+**Solution**: Ensure you activated the conda environment:
+```bash
+conda activate forge
+```
+
+### Issue: vLLM wheel download fails
+
+**Solution**: The vLLM wheel is hosted on GitHub releases. Ensure you're authenticated with `gh auth login` and have internet access.
+
+### Issue: `Unsupported GPU architecture`
+
+**Solution**: Check your GPU compute capability:
+```bash
+python -c "import torch; print(torch.cuda.get_device_capability())"
+```
+
+TorchForge requires compute capability 7.0 or higher (Volta architecture or newer).
+
+### Issue: Monarch actor spawn failures
+
+**Symptom**: Errors like "Failed to spawn actors" or "Process allocation failed"
+
+**Solution**: Verify your GPU count matches your configuration:
+```bash
+nvidia-smi  # Check available GPUs
+```
+
+Ensure your config requests fewer processes than available GPUs.
diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md
index 57e1b63c8..17e08db75 100644
--- a/docs/source/getting_started.md
+++ b/docs/source/getting_started.md
@@ -1,9 +1,316 @@
-# Get Started
+---
+orphan: true
+---
 
-Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models.
+# Getting Started
 
-TorchForge specializes in post-training techniques for large language models, including:
+This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job.
 
-- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data
-- **Generalized Reward Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment
-- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes
+## System Requirements
+
+Before installing TorchForge, ensure your system meets the following requirements.
+
+| Component | Requirement | Notes |
+|-----------|-------------|-------|
+| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported |
+| **Python** | 3.10 or higher | Python 3.11 recommended |
+| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported |
+| **CUDA** | 12.8 or higher | Required for GPU training |
+| **Minimum GPUs** | 2 for SFT, 3 for GRPO | More GPUs enable larger models |
+| **RAM** | 32GB+ recommended | Depends on model size |
+| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints |
+
+## Prerequisites
+
+- **Conda or Miniconda**: For environment management
+  - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html)
+
+- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies
+  - Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation)
+  - After installing, authenticate with: `gh auth login`
+  - You can use either HTTPS or SSH as the authentication protocol
+
+- **Git**: For cloning the repository
+  - Usually pre-installed on Linux systems
+  - Verify with: `git --version`
+
+
+**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.
+
+## Installation
+
+TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable.
+
+1. **Clone the Repository**
+
+   ```bash
+   git clone https://github.com/meta-pytorch/forge.git
+   cd forge
+   ```
+
+2. **Create Conda Environment**
+
+   ```bash
+   conda create -n forge python=3.10
+   conda activate forge
+   ```
+
+3. **Run Installation Script**
+
+   ```bash
+   ./scripts/install.sh
+   ```
+
+   The installation script will:
+   - Install system dependencies using DNF (or your package manager)
+   - Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan
+   - Install TorchForge and all Python dependencies
+   - Configure the environment for GPU training
+
+   ```{tip}
+   **Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use:
+   `./scripts/install.sh --use-sudo`
+   ```
+
+4. **Verify Installation**
+
+   Test that TorchForge is properly installed:
+
+   ```bash
+   python -c "import forge; print(f'TorchForge version: {forge.__version__}')"
+   python -c "import monarch; print('Monarch: OK')"
+   python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
+   ```
+
+   ```{warning}
+   When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM.
+   ```
+
+## Verifying Your Setup
+
+After installation, verify that all components are working correctly:
+
+1. **Check GPU Availability**
+
+   ```bash
+   python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')"
+   ```
+
+   Expected output: `GPUs available: 2` (or more)
+
+2. **Check CUDA Version**
+
+   ```bash
+   python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"
+   ```
+
+   Expected output: `CUDA version: 12.8` (or higher)
+
+3. **Check All Dependencies**
+
+   ```bash
+   # Check core components
+   python -c "import torch, forge, monarch, vllm; print('All imports successful')"
+
+   # Check specific versions
+   python -c "
+   import torch
+   import forge
+   import vllm
+
+   print(f'PyTorch: {torch.__version__}')
+   print(f'TorchForge: {forge.__version__}')
+   print(f'vLLM: {vllm.__version__}')
+   print(f'CUDA: {torch.version.cuda}')
+   print(f'GPUs: {torch.cuda.device_count()}')
+   "
+   ```
+
+4. **Verify Monarch**
+
+   ```bash
+   python -c "
+   from monarch.actor import Actor, this_host
+
+   # Test basic Monarch functionality
+   procs = this_host().spawn_procs({'gpus': 1})
+   print('Monarch: Process spawning works')
+   "
+   ```
+
+## Quick Start Examples
+
+Now that TorchForge is installed, let's run some training examples.
+
+### Example 1: Supervised Fine-Tuning (SFT)
+
+Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs**
+
+1. **Download the Model**
+
+   ```bash
+   uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
+     --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
+     --ignore-patterns "original/consolidated.00.pth"
+   ```
+
+   ```{note}
+   Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already.
+   ```
+
+2. **Run Training**
+
+   ```bash
+   uv run forge run --nproc_per_node 2 \
+     apps/sft/main.py \
+     --config apps/sft/llama3_8b.yaml
+   ```
+
+   **What's Happening:**
+   - `--nproc_per_node 2`: Use 2 GPUs for training
+   - `apps/sft/main.py`: SFT training script
+   - `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters
+   - **TorchTitan** handles model sharding across the 2 GPUs
+   - **Monarch** coordinates the distributed training
+
+   **Expected Output:**
+   ```
+   Initializing process group...
+   Loading model from /tmp/Meta-Llama-3.1-8B-Instruct...
+   Starting training...
+   Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001
+   ...
+   ```
+
+### Example 2: GRPO Training
+
+Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs**
+
+```bash
+python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
+```
+
+**What's Happening:**
+- GPU 0: Policy model (being trained, powered by TorchTitan)
+- GPU 1: Reference model (frozen baseline)
+- GPU 2: Reward model (scoring outputs, powered by vLLM)
+- **Monarch** orchestrates all three components
+- **TorchStore** handles weight synchronization from training to inference
+
+**Expected Output:**
+```
+Initializing GRPO training...
+Loading policy model on GPU 0...
+Loading reference model on GPU 1...
+Loading reward model on GPU 2...
+Episode 1 | Avg Reward: 0.75 | KL Divergence: 0.12
+...
+```
+
+### Example 3: Inference with vLLM
+
+Generate text using a trained model:
+
+```bash
+python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml
+```
+
+This loads your model with vLLM and starts an interactive generation session.
+
+## Understanding Configuration Files
+
+TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config:
+
+```yaml
+# Example: apps/sft/llama3_8b.yaml
+model:
+  name: meta-llama/Meta-Llama-3.1-8B-Instruct
+  path: /tmp/Meta-Llama-3.1-8B-Instruct
+
+training:
+  batch_size: 4
+  learning_rate: 1e-5
+  num_epochs: 10
+  gradient_accumulation_steps: 4
+
+distributed:
+  strategy: fsdp  # Managed by TorchTitan
+  precision: bf16
+
+checkpointing:
+  save_interval: 1000
+  output_dir: /tmp/checkpoints
+```
+
+**Key Sections:**
+- **model**: Model path and settings
+- **training**: Hyperparameters like batch size and learning rate
+- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan
+- **checkpointing**: Where and when to save model checkpoints
+
+See {doc}`usage` for detailed configuration options.
+
+## Next Steps
+
+Now that you have TorchForge installed and verified:
+
+1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture, including Monarch, Services, and TorchStore
+2. **Explore Examples**: Check the `apps/` directory for more training examples
+3. **Customize Training**: See {doc}`usage` for configuration patterns
+4. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides
+5. **API Documentation**: Explore {doc}`api` for detailed API reference
+
+## Getting Help
+
+If you encounter issues:
+
+1. **Check the FAQ**: {doc}`faq` covers common questions and solutions
+2. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues)
+3. **File a Bug Report**: Create a new issue with:
+   - Your system configuration (output of diagnostic command below)
+   - Full error message
+   - Steps to reproduce
+   - Expected vs actual behavior
+
+**Diagnostic command:**
+```bash
+python -c "
+import torch
+import forge
+
+try:
+    import monarch
+    monarch_status = 'OK'
+except Exception as e:
+    monarch_status = str(e)
+
+try:
+    import vllm
+    vllm_version = vllm.__version__
+except Exception as e:
+    vllm_version = str(e)
+
+print(f'PyTorch: {torch.__version__}')
+print(f'TorchForge: {forge.__version__}')
+print(f'Monarch: {monarch_status}')
+print(f'vLLM: {vllm_version}')
+print(f'CUDA: {torch.version.cuda}')
+print(f'GPUs: {torch.cuda.device_count()}')
+"
+```
+
+Include this output in your bug reports!
+
+## Additional Resources
+
+- **GitHub Repository**: [github.com/meta-pytorch/forge](https://github.com/meta-pytorch/forge)
+- **Example Notebooks**: Check `demo.ipynb` in the repository root
+- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
+- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
+- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch)
+- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai)
+- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)
+
+---
+
+**Ready to start training?** Head to {doc}`usage` for practical configuration examples and workflows, or dive into {doc}`concepts` to understand how all the pieces work together.
diff --git a/docs/source/index.md b/docs/source/index.md
index c450b4ca7..b9e3a1475 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -1,25 +1,233 @@
-# Welcome to TorchForge documentation!
+# TorchForge Documentation
 
-**TorchForge** is a PyTorch-native platform specifically designed
-for post-training generative AI models.
+**TorchForge** is a PyTorch-native library for RL post-training and agentic development. Built on the principle that **researchers should write algorithms, not infrastructure**.
 
-Key Features
-------------
+```{note}
+**Experimental Status:** TorchForge is currently in early development. Expect bugs, incomplete features, and API changes. Please file issues on [GitHub](https://github.com/meta-pytorch/forge) for bug reports and feature requests.
+```
+
+## Why TorchForge?
+
+Reinforcement Learning has become essential to frontier AI - from instruction following and reasoning to complex research capabilities. But infrastructure complexity often dominates the actual research.
+
+TorchForge lets you **express RL algorithms as naturally as pseudocode**, while powerful infrastructure handles distribution, fault tolerance, and optimization underneath.
+
+### Core Design Principles
+
+- **Algorithms, Not Infrastructure**: Write your RL logic without distributed systems code
+- **Any Degree of Asynchrony**: From fully synchronous PPO to fully async off-policy training
+- **Composable Components**: Mix and match proven frameworks (vLLM, TorchTitan) with custom logic
+- **Built on Solid Foundations**: Leverages Monarch's single-controller model for simplified distributed programming
+
+## Foundation: The Technology Stack
+
+TorchForge is built on carefully selected, battle-tested components:
+
+::::{grid} 1 1 2 2
+:gutter: 3
+
+:::{grid-item-card} **Monarch**
+:link: https://meta-pytorch.org/monarch
+
+Single-controller distributed programming framework that orchestrates clusters like you'd program a single machine. Provides actor meshes, fault tolerance, and RDMA-based data transfers.
+
+**Why it matters:** Eliminates SPMD complexity, making distributed RL tractable
+:::
+
+:::{grid-item-card} **vLLM**
+:link: https://docs.vllm.ai
+
+High-throughput, memory-efficient inference engine with PagedAttention and continuous batching.
+
+**Why it matters:** Handles policy generation efficiently at scale
+:::
+
+:::{grid-item-card} **TorchTitan**
+:link: https://github.com/pytorch/torchtitan
+
+Meta's production-grade LLM training platform with FSDP, pipeline parallelism, and tensor parallelism.
+
+**Why it matters:** Battle-tested training infrastructure proven at scale
+:::
+
+:::{grid-item-card} **TorchStore**
+:link: https://github.com/meta-pytorch/torchstore
+
+Distributed, in-memory key-value store for PyTorch tensors built on Monarch, optimized for weight synchronization with automatic DTensor resharding.
+
+**Why it matters:** Solves the weight transfer bottleneck in async RL
+:::
+
+::::
+
+## What You Can Build
+
+::::{grid} 1 1 2 3
+:gutter: 2
+
+:::{grid-item-card} Supervised Fine-Tuning
+Adapt foundation models to specific tasks using labeled data with efficient multi-GPU training.
+:::
+
+:::{grid-item-card} GRPO Training
+Train models with Generalized Reward Policy Optimization for aligning with human preferences.
+:::
+
+:::{grid-item-card} Asynchronous RL
+Continuous rollout generation with non-blocking training for maximum throughput.
+:::
+
+:::{grid-item-card} Code Execution
+Safe, sandboxed code execution environments for RL on coding tasks (RLVR).
+:::
+
+:::{grid-item-card} Tool Integration
+Extensible environment system for agents that interact with tools and APIs.
+:::
+
+:::{grid-item-card} Custom Workflows
+Build your own components and compose them naturally with existing infrastructure.
+:::
+
+::::
+
+## Requirements at a Glance
+
+Before diving in, ensure your system meets these requirements:
+
+| Component | Requirement | Why It's Needed |
+|-----------|-------------|-----------------|
+| **Operating System** | Linux (Fedora/Ubuntu) | Dependency compatibility |
+| **Python** | 3.10+ | Core runtime |
+| **CUDA** | 12.8+ | GPU acceleration |
+| **GPUs** | 2+ for SFT, 3+ for GRPO | Distributed training & separate policy/ref/reward models |
+| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) |
+| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system |
+| **vLLM** | v0.10.0+ | Fast inference with PagedAttention |
+| **TorchTitan** | Latest | Production training infrastructure |
+
+See {doc}`getting_started` for detailed installation instructions.
+
+## Quick Start
+
+Here's what training looks like with TorchForge:
 
-* **Post-Training Focus**: Specializes in techniques
-  like Supervised Fine-Tuning (SFT) and Generalized Reward Policy Optimization (GRPO)
-* **PyTorch Integration**: Built natively on PyTorch with
-  dependencies on [PyTorch nightly](https://pytorch.org/get-started/locally/),
-  [Monarch](https://meta-pytorch.org/monarch), [vLLM](https://docs.vllm.ai/en/latest/),
-  and [TorchTitan](https://github.com/pytorch/torchtitan).
-* **Multi-GPU Support**: Designed for distributed training
-  with minimum 3 GPU requirement for GRPO training
-* **Model Support**: Includes pre-configured setups for popular models
-  like Llama3 8B and Qwen3.1 7B
+```bash
+# Install dependencies
+conda create -n forge python=3.10
+conda activate forge
+git clone https://github.com/meta-pytorch/forge
+cd forge
+./scripts/install.sh
+
+# Download a model
+uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
+  --output-dir /tmp/Meta-Llama-3.1-8B-Instruct
+
+# Run SFT training (requires 2+ GPUs)
+uv run forge run --nproc_per_node 2 \
+  apps/sft/main.py --config apps/sft/llama3_8b.yaml
+
+# Run GRPO training (requires 3+ GPUs)
+python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
+```
+
+## Writing RL Code
+
+With TorchForge, your RL logic looks like pseudocode:
+
+```python
+async def generate_episode(dataloader, policy, reward, replay_buffer):
+    # Sample a prompt
+    prompt, target = await dataloader.sample.route()
+
+    # Generate response
+    response = await policy.generate.route(prompt)
+
+    # Score the response
+    reward_value = await reward.evaluate_response.route(
+        prompt=prompt,
+        response=response.text,
+        target=target
+    )
+
+    # Store for training
+    await replay_buffer.add.route(
+        Episode(prompt_ids=response.prompt_ids,
+                response_ids=response.token_ids,
+                reward=reward_value)
+    )
+```
+
+No retry logic, no resource management, no synchronization code - just your algorithm.
+
+## Documentation Paths
+
+Choose your learning path:
+
+::::{grid} 1 1 2 2
+:gutter: 3
+
+:::{grid-item-card} 🚀 Getting Started
+:link: getting_started
+:link-type: doc
+
+Installation, prerequisites, verification, and your first training run.
+
+**Time to first run: ~15 minutes**
+:::
+
+:::{grid-item-card} 🧠 Core Concepts
+:link: concepts
+:link-type: doc
+
+Architecture, Monarch integration, Services, TorchStore, and how everything works together.
+
+**For understanding the system**
+:::
+
+:::{grid-item-card} 💻 Usage Patterns
+:link: usage
+:link-type: doc
+
+Configuration examples, common workflows, and practical scenarios.
+
+**For day-to-day development**
+:::
+
+:::{grid-item-card} 📖 API Reference
+:link: api
+:link-type: doc
+
+Complete API documentation for customization and extension.
+
+**For deep integration**
+:::
+
+::::
+
+## Validation & Partnerships
+
+TorchForge has been validated in real-world deployments:
+
+- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA)
+- **CoreWeave**: Large-scale training on 512 H100 GPU clusters with smooth, efficient performance
+- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training
+
+## Community
+
+- **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge)
+- **Issues**: [Report bugs and request features](https://github.com/meta-pytorch/forge/issues)
+- **Contributing**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
+- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
+
+```{tip}
+Before starting significant work, signal your intention in the issue tracker to coordinate with maintainers.
+```
 
 ```{toctree}
-:maxdepth: 1
-:caption: Contents:
+:maxdepth: 2
+:caption: Documentation
 
 getting_started
 concepts
@@ -29,8 +237,11 @@ api
 faq
 ```
 
-## Indices and tables
+## Indices
+
+* {ref}`genindex` - Index of all documented objects
+* {ref}`modindex` - Python module index
+
+---
 
-* {ref}`genindex`
-* {ref}`modindex`
-* {ref}`search`
+**License**: BSD 3-Clause | **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge)
diff --git a/docs/source/rl_workflows.md b/docs/source/rl_workflows.md
new file mode 100644
index 000000000..9a828e271
--- /dev/null
+++ b/docs/source/rl_workflows.md
@@ -0,0 +1,375 @@
+# RL Workflows
+
+This guide shows you how to write RL algorithms with TorchForge, from simple episode generation to complex asynchronous training loops.
+
+## Writing RL Algorithms
+
+With TorchForge's foundations (Monarch, Services, TorchStore), here's what RL code looks like:
+
+### Episode Generation
+
+```python
+async def generate_episode(dataloader, policy, reward, replay_buffer):
+    # Sample a prompt
+    prompt, target = await dataloader.sample.route()
+
+    # Generate response (vLLM handles this efficiently)
+    response = await policy.generate.route(prompt)
+
+    # Score the response
+    reward_value = await reward.evaluate_response.route(
+        prompt=prompt,
+        response=response.text,
+        target=target
+    )
+
+    # Store for training
+    await replay_buffer.add.route(
+        Episode(
+            prompt_ids=response.prompt_ids,
+            response_ids=response.token_ids,
+            reward=reward_value
+        )
+    )
+```
+
+Notice what's **not** here:
+- No retry logic
+- No resource allocation
+- No synchronization code
+- No infrastructure complexity
+
+Just your algorithm.
+
+### Asynchronous RL
+
+Compose this into fully async, off-policy training:
+
+```python
+async def async_rl_loop(num_rollout_loops: int):
+    # Multiple concurrent rollout generators
+    rollout_tasks = [
+        asyncio.create_task(continuous_rollouts())
+        for _ in range(num_rollout_loops)
+    ]
+
+    # Continuous training
+    training_task = asyncio.create_task(continuous_training())
+
+    await asyncio.gather(*rollout_tasks, training_task)
+
+async def continuous_rollouts():
+    """Generate rollouts continuously using latest policy."""
+    while True:
+        await generate_episode(dataloader, policy, reward, replay_buffer)
+
+async def continuous_training():
+    """Train continuously on available experience."""
+    training_step = 0
+    while True:
+        batch = await replay_buffer.sample.route(
+            curr_policy_version=training_step
+        )
+
+        if batch is None:
+            await asyncio.sleep(0.1)  # Wait for more experience
+        else:
+            loss = await trainer.train_step.route(batch)
+            training_step += 1
+
+            # Push updated weights (TorchStore handles this)
+            await trainer.push_weights.route(training_step)
+            # Broadcast to all policy replicas
+            await policy.update_weights.fanout(training_step)
+```
+
+### Synchronous RL
+
+The same `generate_episode()` function works for on-policy algorithms like PPO - just compose it differently:
+
+```python
+async def synchronous_rl(batch_size: int):
+    """Synchronous on-policy RL: collect batch, then train."""
+    version = 0
+
+    while True:
+        # Collect a full batch with current policy version
+        for _ in range(batch_size):
+            await generate_episode(dataloader, policy, reward, replay_buffer)
+
+        # Sample the batch we just collected
+        batch = await replay_buffer.sample.route(
+            curr_policy_version=version,
+            batch_size=batch_size
+        )
+
+        # Train on the complete batch
+        loss = await trainer.train_step.route(batch)
+
+        # Update weights in lockstep
+        await trainer.push_weights.route(version + 1)
+        await policy.update_weights.fanout(version + 1)
+        version += 1
+```
+
+**The Power of Composition**: Write your rollout logic once, compose it into any paradigm - on-policy, off-policy, or anywhere in between.
+
+## Extensible Environments
+
+RL often requires interacting with environments beyond text generation - executing code, using tools, running simulations. TorchForge makes these first-class citizens through the same service abstraction.
+
+### Code Execution
+
+For RL on coding tasks (RLVR - Reinforcement Learning with Verifiable Rewards):
+
+```python
+# Lightweight CPU-only service for parallel execution
+coder = SandboxedPythonCoder.options(
+    procs=1,
+    with_gpus=False,
+    num_replicas=16
+).as_service()
+
+# In your RL code
+async def generate_episode():
+    prompt = await dataloader.sample.route()
+    code = await policy.generate.route(prompt)
+
+    # Execute safely in sandbox
+    stdout, stderr = await coder.execute.route(code)
+    reward = 1.0 if stderr == "" else 0.0  # Reward based on execution
+
+    await replay_buffer.add.route(Episode(...))
+```
+
+### Tool Integration
+
+Services make tools ephemeral - spawn them with your job, scale them independently, tear down when finished. The same coordination primitives work for any environment type.
+
+```python
+# Create a web browsing environment
+browser = WebBrowsingEnv.options(
+    procs=1,
+    with_gpus=False,
+    num_replicas=8
+).as_service()
+
+# Use it in your RL loop
+async def generate_episode():
+    task = await dataloader.sample.route()
+
+    # Agent decides on actions
+    action = await policy.generate.route(task)
+
+    # Execute action in browser
+    result = await browser.execute_action.route(action)
+
+    # Evaluate outcome
+    reward = await reward_model.evaluate.route(task, result)
+
+    await replay_buffer.add.route(Episode(...))
+```
+
+This pattern extends naturally to **agentic workflows** - agents that interact with tools, query APIs, and navigate complex environments while learning from outcomes.
+
+### Custom Environments
+
+Build your own environment service:
+
+```python
+from monarch.actor import Actor, endpoint
+
+class CustomEnv(Actor):
+    def __init__(self):
+        # Initialize your environment
+        self.state = self.reset()
+
+    @endpoint
+    async def reset(self):
+        """Reset environment to initial state."""
+        return initial_state
+
+    @endpoint
+    async def step(self, action):
+        """Execute action and return (observation, reward, done)."""
+        # Your environment logic here
+        return observation, reward, done
+
+# Deploy as a service
+env = CustomEnv.options(
+    procs=1,
+    num_replicas=16
+).as_service()
+
+# Use in training
+obs = await env.reset.route()
+while not done:
+    action = await policy.act.route(obs)
+    obs, reward, done = await env.step.route(action)
+```
+
+## Common Patterns
+
+### Warmup Phase
+
+Start training after collecting initial experience:
+
+```python
+async def warmup_then_train(warmup_episodes: int):
+    # Collect initial experience
+    for _ in range(warmup_episodes):
+        await generate_episode(dataloader, policy, reward, replay_buffer)
+
+    # Now start training
+    await continuous_training()
+```
+
+### Evaluation Episodes
+
+Interleave evaluation with training:
+
+```python
+async def train_with_eval(eval_interval: int):
+    training_step = 0
+
+    while True:
+        # Training phase
+        for _ in range(eval_interval):
+            await generate_episode(dataloader, policy, reward, replay_buffer)
+            batch = await replay_buffer.sample.route()
+            await trainer.train_step.route(batch)
+            training_step += 1
+
+        # Evaluation phase
+        eval_rewards = []
+        for _ in range(100):
+            episode = await generate_episode(
+                eval_dataloader, policy, reward, None  # Don't store in buffer
+            )
+            eval_rewards.append(episode.reward)
+
+        print(f"Step {training_step}: Eval reward = {np.mean(eval_rewards)}")
+```
+
+### Curriculum Learning
+
+Gradually increase task difficulty:
+
+```python
+async def curriculum_training():
+    difficulty = 0
+
+    while difficulty < max_difficulty:
+        # Train on current difficulty
+        for _ in range(episodes_per_difficulty):
+            prompt = await dataloader.sample.route(difficulty=difficulty)
+            await generate_episode_with_prompt(prompt, policy, reward, replay_buffer)
+
+        # Evaluate performance
+        success_rate = await evaluate(policy, difficulty)
+
+        # Move to next difficulty if threshold met
+        if success_rate > 0.8:
+            difficulty += 1
+            print(f"Advanced to difficulty {difficulty}")
+```
+
+### Multi-Task Training
+
+Train on multiple tasks simultaneously:
+
+```python
+async def multi_task_training(tasks: List[str]):
+    # Create separate dataloaders for each task
+    dataloaders = {task: create_dataloader(task) for task in tasks}
+
+    while True:
+        # Sample task uniformly (or with custom distribution)
+        task = random.choice(tasks)
+        dataloader = dataloaders[task]
+
+        # Generate episode for this task
+        await generate_episode(dataloader, policy, reward, replay_buffer)
+
+        # Train uses mixed experience from all tasks
+        batch = await replay_buffer.sample.route()
+        await trainer.train_step.route(batch)
+```
+
+## Debugging Tips
+
+### Start Small
+
+Begin with a minimal setup to validate your logic:
+
+```python
+# Single GPU, single replica, synchronous
+policy = PolicyActor.options(procs=1, with_gpus=True).as_service()
+reward = RewardActor.options(procs=1, with_gpus=True).as_service()
+
+# Run a few episodes
+for _ in range(10):
+    await generate_episode(dataloader, policy, reward, replay_buffer)
+```
+
+Once this works, scale up to multi-GPU and async training.
+
+### Add Logging
+
+Insert logging at key points:
+
+```python
+async def generate_episode(dataloader, policy, reward, replay_buffer):
+    start = time.time()
+
+    prompt, target = await dataloader.sample.route()
+    print(f"Sampled prompt in {time.time() - start:.2f}s")
+
+    gen_start = time.time()
+    response = await policy.generate.route(prompt)
+    print(f"Generated response in {time.time() - gen_start:.2f}s")
+
+    reward_start = time.time()
+    reward_value = await reward.evaluate_response.route(prompt, response.text, target)
+    print(f"Computed reward in {time.time() - reward_start:.2f}s")
+
+    await replay_buffer.add.route(Episode(...))
+    print(f"Total episode time: {time.time() - start:.2f}s")
+```
+
+### Monitor Metrics
+
+Track key metrics:
+
+```python
+from collections import deque
+
+recent_rewards = deque(maxlen=100)
+recent_kls = deque(maxlen=100)
+
+async def continuous_training():
+    training_step = 0
+
+    while True:
+        batch = await replay_buffer.sample.route()
+        if batch:
+            loss, kl = await trainer.train_step.route(batch)
+            recent_kls.append(kl)
+
+            if training_step % 100 == 0:
+                print(f"Step {training_step}")
+                print(f"  Avg reward: {np.mean(recent_rewards):.3f}")
+                print(f"  Avg KL: {np.mean(recent_kls):.3f}")
+                print(f"  Loss: {loss:.3f}")
+
+            training_step += 1
+```
+
+## See Also
+
+- {doc}`concepts` - Core philosophy and abstractions
+- {doc}`architecture` - How Services and TorchStore enable these patterns
+- {doc}`technology_stack` - Understanding the underlying components
+- {doc}`usage` - Configuration and practical examples
+- {doc}`tutorials` - Step-by-step guides
+- {doc}`api` - Complete API reference
diff --git a/docs/source/technology_stack.md b/docs/source/technology_stack.md
new file mode 100644
index 000000000..1651f26c7
--- /dev/null
+++ b/docs/source/technology_stack.md
@@ -0,0 +1,120 @@
+# Technology Stack
+
+TorchForge is built on a carefully curated stack of battle-tested components, each solving specific challenges in distributed RL. Understanding this stack helps you troubleshoot issues, optimize performance, and customize your setup.
+
+## Monarch: The Distributed Foundation
+
+**What it is:** Monarch is a PyTorch-native distributed programming framework that brings single-controller orchestration to entire clusters. It's implemented with a Python frontend and Rust backend for performance and robustness.
+
+**Why TorchForge needs it:**
+- **Single-Controller Model**: Write code that looks like a single Python program but scales to thousands of GPUs
+- **Actor Meshes**: Organize processes and actors into scalable, multi-dimensional arrays
+- **Fault Tolerance**: Progressive fault handling with fast failure detection and recovery
+- **RDMA Support**: Direct GPU-to-GPU memory transfers for efficient data movement
+
+**What it solves:** Traditional SPMD (Single Program, Multiple Data) approaches require complex coordination logic in your code. Monarch abstracts this away, letting you write RL algorithms naturally while it handles distributed complexity underneath.
+
+**Technical capabilities:**
+- Scalable messaging with multicast trees
+- Multipart messaging for zero-copy data transfers
+- Integration with PyTorch's distributed primitives
+- Separation of control plane (messaging) and data plane (bulk transfers)
+
+**Where you see it:** Every service creation, actor spawn, and distributed operation in TorchForge runs on Monarch primitives. It's the invisible orchestration layer that makes distributed RL feel simple.
+
+## vLLM: High-Performance Inference
+
+**What it is:** A fast and memory-efficient inference engine optimized for large language models, version 0.10.0 or higher required.
+
+**Why TorchForge needs it:**
+- **PagedAttention**: Memory-efficient attention mechanism that reduces memory fragmentation
+- **Continuous Batching**: Dynamic batching that maximizes GPU utilization
+- **High Throughput**: Handles generation for multiple concurrent rollouts efficiently
+- **Production-Ready**: Battle-tested at scale with proven performance
+
+**What it solves:** In RL for LLMs, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop.
+
+**Integration depth:** TorchForge integrates directly with vLLM's engine, giving you access to customize generation strategies, memory management, and inference logic as your research demands. You control the vLLM configuration while TorchForge handles distributed orchestration.
+
+**Where you see it:** Every policy generation call in your RL code runs through vLLM, whether you're doing synchronous PPO-style rollouts or fully asynchronous off-policy training.
+
+## TorchTitan: Production Training Infrastructure
+
+**What it is:** Meta's production-grade LLM training platform with advanced parallelism support, used to train Llama models on thousands of GPUs.
+
+**Why TorchForge needs it:**
+- **FSDP (Fully Sharded Data Parallel)**: Shard parameters, gradients, and optimizer states across GPUs
+- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching
+- **Tensor Parallelism**: Split individual tensors across devices for very large models
+- **Proven at Scale**: Battle-tested optimizations from production training runs
+
+**What it solves:** Modern models are too large to fit on single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently, with optimizations proven in production environments.
+
+**Integration depth:** TorchForge provides direct access to TorchTitan's training step logic and sharding strategies, enabling experimentation without framework constraints. You can customize the training loop while leveraging TorchTitan's proven infrastructure.
+
+**Where you see it:** Policy training, whether supervised fine-tuning or RL policy updates, runs through TorchTitan's training infrastructure with your choice of parallelism strategies.
+
+## TorchStore: Distributed Weight Storage
+
+**What it is:** A distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives, designed specifically for weight synchronization in distributed RL.
+
+**Why TorchForge needs it:**
+- **Automatic Resharding**: Handles complex weight transfer between different sharding strategies
+- **DTensor Support**: Native support for distributed tensors with automatic topology conversion
+- **RDMA Transfers**: High-bandwidth weight movement without synchronous GPU transfers
+- **Asynchronous Updates**: Training and inference can read/write weights independently
+
+**What it solves:** In async RL, new policy weights must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes. Traditional approaches either require synchronous GPU-to-GPU transfers (blocking training), use slow network filesystems (minutes per update), or demand complex manual resharding logic (error-prone). TorchStore solves all of these.
+
+**Technical capabilities:**
+- Simple key-value interface with complex optimizations underneath
+- Stays distributed across the cluster until requested
+- Flexible storage: co-located with trainers, on storage tier, sharded or replicated
+
+**Where you see it:** Weight synchronization between training and inference, allowing training to continue while inference replicas asynchronously fetch updated weights without blocking either process.
+
+## PyTorch Nightly: Cutting-Edge Features
+
+**Why Nightly is required:** TorchForge leverages the latest PyTorch features that aren't yet in stable releases:
+- **Native DTensor Support**: Distributed tensors that span multiple devices with automatic sharding
+- **Compiled Mode Optimizations**: Performance improvements through torch.compile
+- **Advanced Memory Management**: Latest FSDP and memory optimization features
+- **Bug Fixes**: Continuous improvements to distributed training primitives
+
+**Where you see it:** Every tensor operation, distributed primitive, and training optimization builds on PyTorch nightly's latest capabilities.
+
+## The Integration Philosophy
+
+TorchForge made a conscious decision not to reinvent the wheel. Instead, we integrate battle-tested components and add the coordination layer that makes them work together seamlessly.
+
+**What you get:**
+- **Direct component access**: Customize deeply when your research demands it
+- **Proven performance**: Battle-tested at massive scale in production environments
+- **Flexible composition**: Mix and match components or replace them with custom implementations
+- **Simplified orchestration**: TorchForge coordinates these components so you write algorithms, not infrastructure
+
+**TorchForge's role:** Coordination. We make these powerful but complex components work together seamlessly, exposing simple APIs for distributed RL while preserving deep customization capabilities when you need them.
+
+## Installation
+
+All these components are installed automatically through TorchForge's installation script:
+
+```bash
+git clone https://github.com/meta-pytorch/forge.git
+cd forge
+conda create -n forge python=3.10
+conda activate forge
+./scripts/install.sh
+```
+
+The script provides pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan, ensuring compatibility and reducing installation time.
+
+See {doc}`getting_started` for detailed installation instructions and troubleshooting.
+
+## See Also
+
+- {doc}`concepts` - Core philosophy and key abstractions
+- {doc}`architecture` - How Monarch, Services, and TorchStore work together
+- {doc}`rl_workflows` - Using these components to write RL algorithms
+- {doc}`getting_started` - Installation and setup guide
+- {doc}`usage` - Practical configuration examples