From b1794a9ac3e01e666072174e32af4150a2384aca Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Wed, 8 Oct 2025 16:58:33 -0400 Subject: [PATCH 1/8] Updating homepage, getting started, concepts. --- docs/source/concepts.md | 243 ++++++++++++++++++++++ docs/source/conf.py | 1 + docs/source/getting_started.md | 355 ++++++++++++++++++++++++++++++++- docs/source/index.md | 169 ++++++++++++++-- 4 files changed, 742 insertions(+), 26 deletions(-) diff --git a/docs/source/concepts.md b/docs/source/concepts.md index 075d1ef7f..bd1c52d84 100644 --- a/docs/source/concepts.md +++ b/docs/source/concepts.md @@ -2,3 +2,246 @@ This guide covers the fundamental concepts and architecture behind TorchForge, helping you understand how the system works and how to effectively use its components. + +## Architecture Overview + +TorchForge is built as a modular, distributed system designed specifically for post-training large language models. The architecture follows a clear separation of concerns, with specialized components handling different aspects of the training pipeline. + +### Core Components + +The system is organized into several key layers: + +**Controller** +: The orchestration component responsible for coordinating distributed training across multiple GPUs and nodes. The [Controller](api_controller.md) is the orchestration hub that manages resource allocation and scheduling, inter-process communication coordination, fault tolerance and recovery, and distributed state management. + +**Actor** +: A component responsible for executing training or inference tasks. [Actors](api_actors.md) are the workhorses of TorchForge, handling model operations: +- **Policy Actor**: Trains the primary model being optimized +- **Reward Actor**: Evaluates generated outputs and provides reward signals +- **Reference Actor**: Maintains a frozen baseline for KL divergence computation +- **Inference Actor**: Handles efficient generation of model outputs + +**Data Layer** +: The [Data](api_data.md) layer manages all aspects of data handling including dataset loading and preprocessing, batch construction and sampling, data distribution across workers, and custom data models and transformations. A **batch** represents the number of training examples processed together in a single forward and backward pass through the model. + +**Environment Layer** +: [Environments](api_envs.md) provide execution contexts for both training and inference operations. They handle training environments for supervised and reinforcement learning, inference environments for generation and evaluation, and resource management with device allocation. + +## Post-Training and Fine-Tuning + +**Post-Training** refers to training phases that occur after initial pre-training. While **pre-training** is the initial phase where a model learns general language understanding from large text corpora, post-training adapts the model to specific tasks and behaviors. + +**Fine-Tuning** is the process of adapting a pre-trained model to a specific task or domain by continuing training on task-specific data. TorchForge supports two primary post-training paradigms: + +### Supervised Fine-Tuning (SFT) + +**SFT** is a fine-tuning approach that trains a model on labeled input-output pairs using supervised learning. The process adapts a pre-trained model to specific tasks using structured **prompt**-response pairs: + +1. **Data Preparation**: Structure your training data as prompt-response pairs, where a prompt is the input text provided to elicit a specific generation +2. **Model Setup**: Load a pre-trained model **checkpoint** (a saved snapshot of model weights and optimizer state) +3. **Training**: Optimize the model using a **loss function** (a mathematical function measuring the difference between predicted and actual outputs) to generate target responses given prompts +4. **Evaluation**: Validate performance on held-out examples + +SFT is typically the first step in post-training, establishing a baseline before more advanced techniques. + +### Generalized Reward Policy Optimization (GRPO) + +**GRPO** is an advanced **reinforcement learning** algorithm for aligning large language models with human preferences and optimizing for specific reward functions. In reinforcement learning, an agent learns to make decisions by receiving rewards or penalties for its actions. + +The GRPO workflow involves: + +1. **Generation Phase**: The **policy model** (the model being trained to generate outputs that maximize rewards) generates multiple candidate responses for each prompt +2. **Scoring Phase**: The **reward model** (a model trained to score output quality) evaluates each candidate and assigns scores +3. **Advantage Computation**: Calculate advantages using reward values and **reference model** KL divergence. The reference model is a frozen copy of the policy model used to prevent the policy from deviating too far from its initial behavior +4. **Policy Update**: Update the policy to increase probability of high-reward actions during each **episode** (one complete training iteration) +5. **Constraint Enforcement**: Apply KL penalties to prevent policy collapse + +GRPO requires a minimum of 3 GPUs: +- GPU 0: Policy model (trainable) +- GPU 1: Reference model (frozen) +- GPU 2: Reward model (frozen) + +**RLHF (Reinforcement Learning from Human Feedback)** is a related training approach that uses human preferences to train a reward model, which then guides policy optimization through techniques like GRPO. + +## Distributed Training + +**Distributed Training** trains a model across multiple GPUs or machines to handle larger models and datasets. TorchForge leverages PyTorch's distributed capabilities to scale training across multiple devices. A **GPU (Graphics Processing Unit)** is the hardware accelerator used for parallel computation in deep learning training and **inference** (the process of using a trained model to make predictions without updating parameters). + +### Data Parallelism + +Each GPU processes different batches of data using the same model replica. Gradients are synchronized across devices during backpropagation using collective communication operations like **All-Reduce** (synchronize gradients across all workers). + +### Model Parallelism + +**Model Parallelism** is a distributed training strategy where different parts of a model are placed on different devices, necessary for models too large to fit on a single GPU. TorchForge supports several approaches: + +**FSDP (Fully Sharded Data Parallel)** +: A PyTorch distributed training strategy that implements **sharding** (splitting model parameters, gradients, or optimizer states across multiple devices) to reduce memory footprint per device. + +**Tensor Parallelism** +: A model parallelism strategy that splits individual tensors like large weight matrices across multiple devices, where each device computes a portion of the operation. + +**Pipeline Parallelism** +: A form of model parallelism where model layers are partitioned across devices and micro-batches are pipelined through the stages for efficient utilization. + +**DTensor (Distributed Tensor)** +: A tensor that is sharded across multiple devices in a distributed training setup, enabling transparent model parallelism. + +### Communication Strategies + +Efficient communication is critical for distributed training: +- **All-Reduce**: Synchronize gradients across all workers +- **Broadcast**: Share model parameters from one process to others +- **Point-to-Point**: Direct communication between specific processes + +## Data Models and Types + +Structured [data models](api_data.md) ensure type safety and consistency throughout the training pipeline: + +**Prompt Models** +: Structured input representations containing the text provided to language models + +**Response Models** +: Generated output with metadata, including the completion text and generation parameters + +**Episode Models** +: Complete RL interaction sequences capturing prompt, response, and reward information + +**Batch Models** +: Optimized batch representations for efficient parallel processing + +**Token** +: The basic unit of text processed by language models, typically representing words, subwords, or characters + +## Resource Management + +Effective resource management is crucial for training **LLMs (Large Language Models)** - deep learning models trained on vast amounts of text data, capable of understanding and generating human-like text. + +### Memory Optimization + +TorchForge employs several memory optimization techniques: + +**Gradient Checkpointing** +: Trade computation for memory by recomputing activations during the backward pass instead of storing them + +**Mixed Precision** +: Use FP16/BF16 for reduced memory footprint while maintaining model quality + +**Activation Offloading** +: Move activations to CPU when not needed on GPU, freeing up device memory + +**Parameter Sharding** +: Distribute model parameters across devices using techniques like FSDP + +### Compute Optimization + +Maximize GPU utilization through: + +**Asynchronous Execution** +: Overlap communication with computation to hide latency + +**Batch Size Tuning** +: Balance memory usage and throughput for optimal training speed + +**Dynamic Batching** +: Group requests efficiently for inference to maximize hardware utilization + +**Kernel Fusion** +: Combine operations to reduce memory bandwidth requirements + +## Integration Points + +TorchForge integrates with several PyTorch ecosystem projects to provide comprehensive functionality: + +### PyTorch Nightly + +Built on the latest PyTorch features, including: +- Native DTensor support for distributed tensors +- Compiled mode optimizations +- Advanced memory management + +### Monarch + +**Monarch** is a Meta-developed optimization framework for PyTorch that TorchForge depends on for certain operations, providing enhanced optimization capabilities. + +### vLLM + +**vLLM** is a fast and memory-efficient inference engine for large language models that TorchForge integrates for **generation** (producing text output from a language model given an input prompt): +- Paged attention for memory efficiency +- Continuous batching for throughput +- Speculative decoding support + +### TorchTitan + +**TorchTitan** is a PyTorch-based framework that TorchForge builds upon for certain distributed training capabilities, leveraging its foundations for scaling to large clusters. + +## Configuration Management + +TorchForge uses a hierarchical configuration system with multiple layers of settings: + +1. **Default Configs**: Sensible defaults for common scenarios +2. **Model Configs**: Pre-configured setups for popular models +3. **User Configs**: Custom overrides for specific needs +4. **Runtime Configs**: Dynamic adjustments during execution + +See [Usage](usage.md) for detailed configuration examples. + +## Training Lifecycle + +Understanding the complete training lifecycle helps you effectively use TorchForge: + +### Checkpointing + +A **checkpoint** is a saved snapshot of model weights and optimizer state, allowing training to be resumed or the model to be deployed for inference. Regular checkpointing provides fault tolerance and enables experimentation with different training configurations. + +### Warmup + +**Warmup** is a training phase where the learning rate gradually increases from a small value to the target learning rate, helping stabilize early training and prevent gradient explosions. + +### Loss Functions + +TorchForge provides specialized [loss functions](api_losses.md) for post-training: +- **Policy Gradient Losses**: For reinforcement learning optimization +- **Regularization Terms**: KL divergence constraints +- **Multi-Objective Losses**: Combine multiple training signals + +## Best Practices + +### Model Selection +- Start with smaller models for prototyping +- Use pre-configured model setups when available +- Validate configurations before large-scale training + +### Data Preparation +- Ensure balanced and diverse training data +- Implement proper train/validation splits +- Monitor data quality throughout training +- Verify token distributions match expectations + +### Training Strategy +- Begin with SFT before attempting GRPO +- Use gradient accumulation for larger effective batch sizes +- Monitor KL divergence to prevent policy collapse +- Implement regular checkpointing for fault tolerance +- Apply warmup schedules for stable training + +### Resource Optimization +- Profile memory usage to identify bottlenecks +- Tune batch sizes for your hardware configuration +- Consider mixed precision training to reduce memory +- Use appropriate parallelism strategies for your model size + +### Debugging +- Start with single-GPU training to isolate issues +- Enable verbose logging for distributed runs +- Use profiling tools to identify bottlenecks +- Validate data pipelines before full training +- Monitor loss curves and generation quality + +## See Also + +- {doc}`getting_started` - Installation and setup guide +- {doc}`usage` - Practical usage examples +- {doc}`tutorials` - Step-by-step guides +- {doc}`api` - Complete API reference +- {doc}`faq` - Common questions and troubleshooting diff --git a/docs/source/conf.py b/docs/source/conf.py index 69b3c20b5..98b02cc6c 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -117,6 +117,7 @@ "colon_fence", "deflist", "html_image", + "substitution", ] autodoc_default_options = { diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 57e1b63c8..6012497c0 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -1,9 +1,352 @@ -# Get Started +# Getting Started -Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models. +Welcome to TorchForge! This guide will walk you through installing TorchForge, verifying your setup, and running your first training job. -TorchForge specializes in post-training techniques for large language models, including: +## Prerequisites -- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data -- **Generalized Reward Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment -- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes +Before installing TorchForge, ensure your system meets the following requirements: + +### System Requirements + +| Component | Requirement | Notes | +|-----------|-------------|-------| +| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported | +| **Python** | 3.10 or higher | Python 3.11 recommended | +| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported | +| **CUDA** | 12.8 or higher | Required for GPU training | +| **Minimum GPUs** | 2 for SFT, 3 for GRPO | More GPUs enable larger models | +| **RAM** | 32GB+ recommended | Depends on model size | +| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints | + +### Required Tools + +1. **Conda or Miniconda**: For environment management + - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) + +2. **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies + - Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) + - After installing, authenticate with: `gh auth login` + - You can use either HTTPS or SSH as the authentication protocol + +3. **Git**: For cloning the repository + - Usually pre-installed on Linux systems + - Verify with: `git --version` + +## Installation + +TorchForge offers two installation methods. Choose the one that fits your setup: + +### Method 1: Basic Installation (Recommended) + +This method uses pre-packaged wheels for all dependencies, making installation faster and more reliable. + +**Step 1: Clone the Repository** + +```bash +git clone https://github.com/meta-pytorch/forge.git +cd forge +``` + +**Step 2: Create Conda Environment** + +```bash +conda create -n forge python=3.10 +conda activate forge +``` + +**Step 3: Run Installation Script** + +```bash +./scripts/install.sh +``` + +The installation script will: +- Install system dependencies using DNF (or your package manager) +- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan +- Install TorchForge and all Python dependencies + +```{tip} +**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: +`./scripts/install.sh --use-sudo` +``` + +**Step 4: Verify Installation** + +Test that TorchForge is properly installed: + +```bash +python -c "import forge; print(forge.__version__)" +``` + +### Method 2: Meta Internal Installation (Alternative) + +For Meta employees or those with access to Meta's internal tools: + +**Step 1: Install uv Package Manager** + +```bash +curl -LsSf https://astral.sh/uv/install.sh | sh +``` + +**Step 2: Clone and Setup** + +```bash +git clone https://github.com/meta-pytorch/forge +cd forge +uv sync --all-extras +source .venv/bin/activate +``` + +**Step 3: Configure CUDA** + +```bash +# Install CUDA if needed +feature install --persist cuda_12_9 + +# Set environment variables +export CUDA_VERSION=12.9 +export NVCC=/usr/local/cuda-$CUDA_VERSION/bin/nvcc +export CUDA_NVCC_EXECUTABLE=/usr/local/cuda-$CUDA_VERSION/bin/nvcc +export CUDA_HOME=/usr/local/cuda-$CUDA_VERSION +export PATH="$CUDA_HOME/bin:$PATH" +export CUDA_INCLUDE_DIRS=$CUDA_HOME/include +export CUDA_CUDART_LIBRARY=$CUDA_HOME/lib64/libcudart.so +export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH +``` + +**Step 4: Build vLLM from Source** + +```bash +git clone https://github.com/vllm-project/vllm.git --branch v0.10.0 +cd vllm +python use_existing_torch.py +uv pip install -r requirements/build.txt +uv pip install --no-build-isolation -e . +``` + +```{warning} +When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. +``` + +## Verifying Your Setup + +After installation, verify that all components are working correctly: + +### Check GPU Availability + +```bash +python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" +``` + +Expected output: `GPUs available: 2` (or more) + +### Check CUDA Version + +```bash +python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" +``` + +Expected output: `CUDA version: 12.8` (or higher) + +### Check Dependencies + +```bash +# Check vLLM +python -c "import vllm; print(f'vLLM: {vllm.__version__}')" + +# Check TorchForge modules +python -c "from forge import actors, controller, data; print('All modules imported successfully')" +``` + +## Quick Start Examples + +Now that TorchForge is installed, let's run some training examples: + +### Example 1: Supervised Fine-Tuning (SFT) + +This example fine-tunes Llama 3 8B on your data. **Requires: 2+ GPUs** + +**Step 1: Download the Model** + +```bash +uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ + --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ + --ignore-patterns "original/consolidated.00.pth" +``` + +```{note} +Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already. +``` + +**Step 2: Run Training** + +```bash +uv run forge run --nproc_per_node 2 \ + apps/sft/main.py \ + --config apps/sft/llama3_8b.yaml +``` + +**What's Happening:** +- `--nproc_per_node 2`: Use 2 GPUs for training +- `apps/sft/main.py`: SFT training script +- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters + +**Expected Output:** +``` +Initializing process group... +Loading model from /tmp/Meta-Llama-3.1-8B-Instruct... +Starting training... +Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001 +... +``` + +### Example 2: GRPO Training + +Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs** + +```bash +python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml +``` + +**What's Happening:** +- GPU 0: Policy model (being trained) +- GPU 1: Reference model (frozen baseline) +- GPU 2: Reward model (scoring outputs) + +**Expected Output:** +``` +Initializing GRPO training... +Loading policy model on GPU 0... +Loading reference model on GPU 1... +Loading reward model on GPU 2... +Episode 1 | Avg Reward: 0.75 | KL Divergence: 0.12 +... +``` + +### Example 3: Inference with vLLM + +Generate text using a trained model: + +```bash +python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml +``` + +This loads your model and starts an interactive generation session. + +## Understanding Configuration Files + +TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config: + +```yaml +# Example: apps/sft/llama3_8b.yaml +model: + name: meta-llama/Meta-Llama-3.1-8B-Instruct + path: /tmp/Meta-Llama-3.1-8B-Instruct + +training: + batch_size: 4 + learning_rate: 1e-5 + num_epochs: 10 + gradient_accumulation_steps: 4 + +distributed: + strategy: fsdp + precision: bf16 + +checkpointing: + save_interval: 1000 + output_dir: /tmp/checkpoints +``` + +**Key Sections:** +- **model**: Model path and settings +- **training**: Hyperparameters like batch size and learning rate +- **distributed**: Multi-GPU strategy and precision +- **checkpointing**: Where and when to save model checkpoints + +See {doc}`usage` for detailed configuration options. + +## Common Installation Issues + +### Issue: `gh: command not found` + +**Solution**: Install GitHub CLI: +```bash +# On Ubuntu/Debian +sudo apt install gh + +# On Fedora +sudo dnf install gh + +# Then authenticate +gh auth login +``` + +### Issue: `CUDA out of memory` + +**Solution**: Reduce batch size in your config file: +```yaml +training: + batch_size: 2 # Reduced from 4 + gradient_accumulation_steps: 8 # Increased to maintain effective batch size +``` + +### Issue: `ImportError: No module named 'torch'` + +**Solution**: Ensure you activated the conda environment: +```bash +conda activate forge +``` + +### Issue: vLLM wheel download fails + +**Solution**: The vLLM wheel is hosted on GitHub releases. Ensure you're authenticated with `gh auth login` and have internet access. + +### Issue: `Unsupported GPU architecture` + +**Solution**: Check your GPU compute capability: +```bash +python -c "import torch; print(torch.cuda.get_device_capability())" +``` + +TorchForge requires compute capability 7.0 or higher (Volta architecture or newer). + +## Next Steps + +Now that you have TorchForge installed and verified: + +1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture +2. **Explore Examples**: Check the `apps/` directory for more training examples +3. **Customize Training**: See {doc}`usage` for configuration patterns +4. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides +5. **API Documentation**: Explore {doc}`api` for detailed API reference + +## Getting Help + +If you encounter issues: + +1. **Check the FAQ**: {doc}`faq` covers common questions and solutions +2. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues) +3. **File a Bug Report**: Create a new issue with: + - Your system configuration + - Full error message + - Steps to reproduce + - Expected vs actual behavior + +```{tip} +When filing issues, include the output of: +```bash +python -c "import torch; import forge; print(f'PyTorch: {torch.__version__}\\nForge: {forge.__version__}\\nCUDA: {torch.version.cuda}\\nGPUs: {torch.cuda.device_count()}')" +``` +``` + +## Additional Resources + +- **GitHub Repository**: [github.com/meta-pytorch/forge](https://github.com/meta-pytorch/forge) +- **Example Notebooks**: Check `demo.ipynb` in the repository root +- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) +- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) + +--- + +**Ready to start training?** Head to {doc}`usage` for practical configuration examples and workflows. diff --git a/docs/source/index.md b/docs/source/index.md index c450b4ca7..8e9510b39 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -1,25 +1,150 @@ -# Welcome to TorchForge documentation! +# TorchForge Documentation -**TorchForge** is a PyTorch-native platform specifically designed -for post-training generative AI models. +**TorchForge** is a PyTorch-native platform built for post-training generative AI models and agentic development. Designed with the principle that researchers should write algorithms, not infrastructure. -Key Features ------------- +```{note} +**Early Development:** TorchForge is currently experimental. Expect bugs, incomplete features, and API changes. Please file issues on [GitHub](https://github.com/meta-pytorch/forge) for bug reports and feature requests. +``` + +## Why TorchForge? + +TorchForge introduces a "service"-centric architecture that provides the right abstractions for distributed complexity: + +- **Usability for Rapid Research**: Isolate your RL algorithms from infrastructure concerns +- **Hackability for Power Users**: Modify any part of the RL loop without touching infrastructure code +- **Scalability**: Seamlessly shift between async and synchronous training across thousands of GPUs + +## Core Capabilities + +::::{grid} 1 1 2 2 +:gutter: 3 + +:::{grid-item-card} Supervised Fine-Tuning (SFT) +:link: concepts +:link-type: doc + +Adapt pre-trained models to specific tasks using labeled data. Perfect for creating specialized models from foundation models. +::: + +:::{grid-item-card} Reinforcement Learning (GRPO) +:link: concepts +:link-type: doc + +Advanced policy optimization using Generalized Reward Policy Optimization for aligning models with human preferences and reward functions. +::: + +:::{grid-item-card} Distributed Training +:link: concepts +:link-type: doc + +Built-in support for multi-GPU and multi-node training with FSDP, tensor parallelism, and pipeline parallelism. +::: + +:::{grid-item-card} Integration Ecosystem +:link: concepts +:link-type: doc + +Seamlessly integrates with PyTorch nightly, Monarch, vLLM, and TorchTitan for a complete training and inference pipeline. +::: + +:::: + +## Getting Started Paths + +Choose your journey based on your experience level and goals: + +::::{grid} 1 1 2 2 +:gutter: 3 + +:::{grid-item-card} 🚀 New to TorchForge? +:link: getting_started +:link-type: doc + +Start here for installation instructions, system requirements, and your first training run. + +**Time to first run: ~15 minutes** +::: + +:::{grid-item-card} 📚 Understanding the Architecture +:link: concepts +:link-type: doc + +Learn about TorchForge's architecture, key concepts, and how the components work together. + +**Recommended for researchers** +::: -* **Post-Training Focus**: Specializes in techniques - like Supervised Fine-Tuning (SFT) and Generalized Reward Policy Optimization (GRPO) -* **PyTorch Integration**: Built natively on PyTorch with - dependencies on [PyTorch nightly](https://pytorch.org/get-started/locally/), - [Monarch](https://meta-pytorch.org/monarch), [vLLM](https://docs.vllm.ai/en/latest/), - and [TorchTitan](https://github.com/pytorch/torchtitan). -* **Multi-GPU Support**: Designed for distributed training - with minimum 3 GPU requirement for GRPO training -* **Model Support**: Includes pre-configured setups for popular models - like Llama3 8B and Qwen3.1 7B +:::{grid-item-card} 💻 Practical Usage +:link: usage +:link-type: doc + +Configuration patterns, common workflows, and real-world usage scenarios. + +**For hands-on development** +::: + +:::{grid-item-card} 📖 API Reference +:link: api +:link-type: doc + +Complete API documentation for all modules, classes, and functions. + +**For in-depth customization** +::: + +:::: + +## Quick Example + +Here's what a simple SFT training run looks like: + +```bash +# Download a model +uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ + --output-dir /tmp/Meta-Llama-3.1-8B-Instruct + +# Run supervised fine-tuning +uv run forge run --nproc_per_node 2 \ + apps/sft/main.py --config apps/sft/llama3_8b.yaml +``` + +See {doc}`getting_started` for complete installation and setup instructions. + +## System Requirements + +| Component | Requirement | +|-----------|-------------| +| **Python** | 3.10+ | +| **Operating System** | Linux (tested on Fedora/Ubuntu) | +| **GPU** | NVIDIA with CUDA support | +| **Minimum GPUs** | 2 for SFT, 3 for GRPO | +| **Dependencies** | PyTorch nightly, Monarch, vLLM, TorchTitan | + +## Supported Models + +TorchForge includes pre-configured setups for popular models: + +- **Llama 3 8B**: Production-ready configuration for supervised fine-tuning +- **Qwen 3.1 7B**: Optimized settings for GRPO training +- **Qwen 3 8B**: Multi-node training configurations +- **Custom Models**: Extensible architecture for any transformer-based model + +## Community & Support + +- **Documentation**: You're reading it! Use the navigation above to explore +- **GitHub Issues**: [Report bugs and request features](https://github.com/meta-pytorch/forge/issues) +- **Contributing**: See [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) +- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) + +```{tip} +Signal your intention to contribute in the issue tracker before starting significant work to coordinate efforts with the maintainers. +``` + +## Documentation Contents ```{toctree} :maxdepth: 1 -:caption: Contents: +:caption: Documentation getting_started concepts @@ -29,8 +154,12 @@ api faq ``` -## Indices and tables +## Indices + +* {ref}`genindex` - Index of all documented objects +* {ref}`modindex` - Python module index +* {ref}`search` - Search the documentation + +--- -* {ref}`genindex` -* {ref}`modindex` -* {ref}`search` +**License**: BSD 3-Clause | **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge) From 087e2ffafdb89cc0c319c929f8103509d50ae293 Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Wed, 8 Oct 2025 19:19:12 -0400 Subject: [PATCH 2/8] Update documentation with blog post insights: enhanced homepage, comprehensive getting started with dependency details, and expanded concepts page with Monarch, Services, TorchStore, and RL workflows --- docs/source/concepts.md | 507 +++++++++++++++++++++++++-------- docs/source/getting_started.md | 182 ++++++++++-- docs/source/index.md | 230 ++++++++++----- 3 files changed, 705 insertions(+), 214 deletions(-) diff --git a/docs/source/concepts.md b/docs/source/concepts.md index bd1c52d84..3d91eb58f 100644 --- a/docs/source/concepts.md +++ b/docs/source/concepts.md @@ -1,142 +1,410 @@ # Concepts -This guide covers the fundamental concepts and architecture behind TorchForge, -helping you understand how the system works and how to effectively use its components. +This guide covers the fundamental concepts and architecture behind TorchForge, helping you understand how the system works and how to effectively use its components. -## Architecture Overview +## The Core Philosophy -TorchForge is built as a modular, distributed system designed specifically for post-training large language models. The architecture follows a clear separation of concerns, with specialized components handling different aspects of the training pipeline. +TorchForge is built on one principle: **researchers should write algorithms, not infrastructure**. -### Core Components +The traditional approach to distributed RL requires you to write complex coordination logic, retry mechanisms, resource management, and synchronization code. TorchForge abstracts all of this away, letting you express RL algorithms as naturally as pseudocode while powerful infrastructure handles the distributed complexity underneath. -The system is organized into several key layers: +## The Foundation: Monarch -**Controller** -: The orchestration component responsible for coordinating distributed training across multiple GPUs and nodes. The [Controller](api_controller.md) is the orchestration hub that manages resource allocation and scheduling, inter-process communication coordination, fault tolerance and recovery, and distributed state management. +At TorchForge's core is **Monarch**, a PyTorch-native distributed programming framework that brings single-controller orchestration to entire GPU clusters. -**Actor** -: A component responsible for executing training or inference tasks. [Actors](api_actors.md) are the workhorses of TorchForge, handling model operations: -- **Policy Actor**: Trains the primary model being optimized -- **Reward Actor**: Evaluates generated outputs and provides reward signals -- **Reference Actor**: Maintains a frozen baseline for KL divergence computation -- **Inference Actor**: Handles efficient generation of model outputs +### Single-Controller vs SPMD -**Data Layer** -: The [Data](api_data.md) layer manages all aspects of data handling including dataset loading and preprocessing, batch construction and sampling, data distribution across workers, and custom data models and transformations. A **batch** represents the number of training examples processed together in a single forward and backward pass through the model. +Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: +- Asynchronous generation and training +- Multiple heterogeneous components (policy, reward model, reference model) +- Dynamic resource allocation +- Fault tolerance across components -**Environment Layer** -: [Environments](api_envs.md) provide execution contexts for both training and inference operations. They handle training environments for supervised and reinforcement learning, inference environments for generation and evaluation, and resource management with device allocation. +**Monarch's single-controller model** changes this entirely. You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. -## Post-Training and Fine-Tuning +### Actor Meshes -**Post-Training** refers to training phases that occur after initial pre-training. While **pre-training** is the initial phase where a model learns general language understanding from large text corpora, post-training adapts the model to specific tasks and behaviors. +Monarch organizes resources into multidimensional arrays called **meshes**: -**Fine-Tuning** is the process of adapting a pre-trained model to a specific task or domain by continuing training on task-specific data. TorchForge supports two primary post-training paradigms: +**Process Mesh** +: An array of processes spread across many hosts, typically one process per GPU -### Supervised Fine-Tuning (SFT) +**Actor Mesh** +: An array of actors, each running inside a separate process -**SFT** is a fine-tuning approach that trains a model on labeled input-output pairs using supervised learning. The process adapts a pre-trained model to specific tasks using structured **prompt**-response pairs: +Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems. You can slice meshes, broadcast operations, and operate on entire meshes with simple APIs. -1. **Data Preparation**: Structure your training data as prompt-response pairs, where a prompt is the input text provided to elicit a specific generation -2. **Model Setup**: Load a pre-trained model **checkpoint** (a saved snapshot of model weights and optimizer state) -3. **Training**: Optimize the model using a **loss function** (a mathematical function measuring the difference between predicted and actual outputs) to generate target responses given prompts -4. **Evaluation**: Validate performance on held-out examples +```python +from monarch.actor import Actor, this_host -SFT is typically the first step in post-training, establishing a baseline before more advanced techniques. +# Create a process mesh with 8 GPUs +procs = this_host().spawn_procs({"gpus": 8}) -### Generalized Reward Policy Optimization (GRPO) +# Define an actor +class PolicyActor(Actor): + @endpoint + def generate(self, prompt): + return self.model.generate(prompt) -**GRPO** is an advanced **reinforcement learning** algorithm for aligning large language models with human preferences and optimizing for specific reward functions. In reinforcement learning, an agent learns to make decisions by receiving rewards or penalties for its actions. +# Spawn actors across the mesh +actors = procs.spawn("policy", PolicyActor) -The GRPO workflow involves: +# Call methods on the entire mesh +results = actors.generate.call_all("Hello world") +``` -1. **Generation Phase**: The **policy model** (the model being trained to generate outputs that maximize rewards) generates multiple candidate responses for each prompt -2. **Scoring Phase**: The **reward model** (a model trained to score output quality) evaluates each candidate and assigns scores -3. **Advantage Computation**: Calculate advantages using reward values and **reference model** KL divergence. The reference model is a frozen copy of the policy model used to prevent the policy from deviating too far from its initial behavior -4. **Policy Update**: Update the policy to increase probability of high-reward actions during each **episode** (one complete training iteration) -5. **Constraint Enforcement**: Apply KL penalties to prevent policy collapse +### Fault Tolerance -GRPO requires a minimum of 3 GPUs: -- GPU 0: Policy model (trainable) -- GPU 1: Reference model (frozen) -- GPU 2: Reward model (frozen) +Monarch provides **progressive fault handling** - you write your code as if nothing fails. When something does fail, Monarch fails fast by default, stopping the whole program like an uncaught exception. -**RLHF (Reinforcement Learning from Human Feedback)** is a related training approach that uses human preferences to train a reward model, which then guides policy optimization through techniques like GRPO. +But you can progressively add fine-grained fault handling exactly where you need it: -## Distributed Training +```python +try: + result = await policy.generate.route(prompt) +except ActorFailure: + # Handle failure - maybe retry with different replica + result = await policy.generate.route(prompt) +``` -**Distributed Training** trains a model across multiple GPUs or machines to handle larger models and datasets. TorchForge leverages PyTorch's distributed capabilities to scale training across multiple devices. A **GPU (Graphics Processing Unit)** is the hardware accelerator used for parallel computation in deep learning training and **inference** (the process of using a trained model to make predictions without updating parameters). +For long-running RL training, this is crucial. Hardware failures are common at scale - in Meta's Llama 3 training, there were 419 interruptions across 54 days on a 16K GPU job (roughly one failure every 3 hours). -### Data Parallelism +### RDMA and Data Plane + +Monarch separates the **control plane** (messaging) from the **data plane** (bulk data transfers). This enables direct GPU-to-GPU memory transfers across your cluster using RDMA (Remote Direct Memory Access). + +Control commands go through one optimized path, while large data transfers (like model weights) go through another path optimized for bandwidth. + +## TorchForge Architecture + +With Monarch as the foundation, TorchForge builds higher-level abstractions specifically for RL workflows. + +### Services: RL-Friendly Actor Abstraction + +**Services** wrap Monarch's ActorMesh with patterns common in RL. A service is a managed group of actor replicas with built-in load balancing, fault tolerance, and routing primitives. + +```python +# Create a policy service with 16 replicas, each using 8 GPUs +policy = PolicyActor.options( + procs=8, + with_gpus=True, + num_replicas=16 +).as_service() + +# Create a lightweight coding environment service +coder = SandboxedCoder.options( + procs=1, + with_gpus=False, + num_replicas=16 +).as_service() +``` + +**Service Adverbs** provide intuitive operations: + +**route()** +: Load-balanced request to one replica +```python +response = await policy.generate.route(prompt) +``` + +**fanout()** +: Broadcast to ALL replicas in parallel +```python +await policy.update_weights.fanout(version) +``` + +**session()** +: Sticky sessions for stateful operations (maintains KV cache consistency) +```python +async with policy.session(): + response1 = await policy.generate.route(prompt1) + response2 = await policy.generate.route(prompt2) # Same replica +``` + +### Why Services Matter for RL + +Services solve critical infrastructure challenges: + +**Heterogeneous Scaling** +: Different components need different resources. Your policy might need 16 replicas × 8 GPUs for high-throughput vLLM inference. Your reward model might need 4 replicas × 4 GPUs. Your coding environment might need 16 lightweight CPU-only replicas. Services let each component scale independently. + +**Load Balancing** +: In async RL, multiple `continuous_rollouts()` tasks run concurrently. Services automatically distribute these rollouts across available replicas - no manual worker pool management. + +**Fault Tolerance** +: If a replica fails during a rollout, services detect it, mark it unhealthy, and route subsequent requests to healthy replicas. The failed replica gets restarted automatically. Your RL code never sees the failure. + +**Ephemeral Infrastructure** +: Services are created with your job and torn down when finished. Want to try a new reward model? Change your Python code. No standing deployments to maintain, no infrastructure to provision ahead of time. + +## TorchStore: Distributed Weight Storage + +In async RL, every training step produces new policy weights that must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes of data. **TorchStore** makes this efficient. + +### The Weight Synchronization Challenge + +Traditionally, you have two options: +1. **Build complex p2p mappings** between training and inference sharding strategies (fast but extremely complex) +2. **Use network filesystem** like NFS (simple but slow, with high infrastructure cost) + +TorchStore combines the **UX of central storage** with the **performance of in-memory p2p operations**. + +### How TorchStore Works + +TorchStore is a distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives: + +```python +import torchstore as ts +from torch.distributed._tensor import distribute_tensor, Shard +from torch.distributed.device_mesh import init_device_mesh + +# Training process: store sharded weights +async def store_weights(): + device_mesh = init_device_mesh("cuda", (4,)) + tensor = model.state_dict()['layer.weight'] + dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) + + # Each rank stores its shard + await ts.put("policy_weights_v123", dtensor) + +# Inference process: fetch with different sharding +async def load_weights(): + device_mesh = init_device_mesh("cuda", (2, 2)) # Different topology! + tensor = torch.empty_like(model.state_dict()['layer.weight']) + dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) + + # TorchStore handles resharding automatically + await ts.get("policy_weights_v123", dtensor) +``` + +**Key Features:** + +**Automatic Resharding** +: Handles complex weight transfer between different sharding strategies transparently + +**DTensor Native** +: Works seamlessly with PyTorch's distributed tensors + +**RDMA Transfers** +: Uses RDMA for high-bandwidth data movement without blocking GPUs + +**Asynchronous Updates** +: Training and inference can read/write weights independently, enabling true async RL + +**Flexible Storage** +: Store tensors co-located with trainers, on their own storage tier, sharded or replicated - change with minimal code modifications + +### Why TorchStore Matters + +Without TorchStore, weight synchronization becomes the bottleneck in async RL. Traditional approaches either: +- Require synchronous GPU-to-GPU transfers (blocking training) +- Use slow network filesystems (minutes per update) +- Demand complex manual resharding logic (error-prone, hard to maintain) + +TorchStore solves all of these, keeping data distributed across the cluster until requested and moving it efficiently with RDMA. + +## The RL Stack: Proven Components + +TorchForge made a conscious decision not to reinvent the wheel. We integrate battle-tested components: + +### vLLM: High-Throughput Inference -Each GPU processes different batches of data using the same model replica. Gradients are synchronized across devices during backpropagation using collective communication operations like **All-Reduce** (synchronize gradients across all workers). +**vLLM** handles policy generation with: +- **PagedAttention**: Memory-efficient attention that reduces fragmentation +- **Continuous Batching**: Dynamic batching for maximum GPU utilization +- **Production Performance**: Proven at scale -### Model Parallelism +In RL, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop. -**Model Parallelism** is a distributed training strategy where different parts of a model are placed on different devices, necessary for models too large to fit on a single GPU. TorchForge supports several approaches: +**Integration**: TorchForge integrates directly with vLLM's engine. You can customize generation strategies, memory management, and inference logic as your research demands. -**FSDP (Fully Sharded Data Parallel)** -: A PyTorch distributed training strategy that implements **sharding** (splitting model parameters, gradients, or optimizer states across multiple devices) to reduce memory footprint per device. +### TorchTitan: Production Training -**Tensor Parallelism** -: A model parallelism strategy that splits individual tensors like large weight matrices across multiple devices, where each device computes a portion of the operation. +**TorchTitan** brings Meta's production-grade training infrastructure: +- **FSDP**: Shard parameters, gradients, and optimizer states across GPUs +- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching +- **Tensor Parallelism**: Split individual tensors across devices for very large models +- **Proven at Scale**: Used to train Llama models on thousands of GPUs -**Pipeline Parallelism** -: A form of model parallelism where model layers are partitioned across devices and micro-batches are pipelined through the stages for efficient utilization. +Modern models are too large for single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently. -**DTensor (Distributed Tensor)** -: A tensor that is sharded across multiple devices in a distributed training setup, enabling transparent model parallelism. +**Integration**: TorchForge provides direct access to TorchTitan's training step logic and sharding strategies, enabling deep customization. -### Communication Strategies +### Role of Integration -Efficient communication is critical for distributed training: -- **All-Reduce**: Synchronize gradients across all workers -- **Broadcast**: Share model parameters from one process to others -- **Point-to-Point**: Direct communication between specific processes +These integrations give you: +- **Direct component access**: Customize deeply when needed +- **Proven performance**: Battle-tested at massive scale +- **Flexible composition**: Mix and match with custom components -## Data Models and Types +TorchForge's role is **coordination** - making these components work together seamlessly so you can express your RL algorithm naturally. -Structured [data models](api_data.md) ensure type safety and consistency throughout the training pipeline: +## Writing RL Algorithms -**Prompt Models** -: Structured input representations containing the text provided to language models +With these foundations, here's what RL code looks like in TorchForge: -**Response Models** -: Generated output with metadata, including the completion text and generation parameters +### Episode Generation -**Episode Models** -: Complete RL interaction sequences capturing prompt, response, and reward information +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + # Sample a prompt + prompt, target = await dataloader.sample.route() -**Batch Models** -: Optimized batch representations for efficient parallel processing + # Generate response (vLLM handles this efficiently) + response = await policy.generate.route(prompt) -**Token** -: The basic unit of text processed by language models, typically representing words, subwords, or characters + # Score the response + reward_value = await reward.evaluate_response.route( + prompt=prompt, + response=response.text, + target=target + ) + + # Store for training + await replay_buffer.add.route( + Episode( + prompt_ids=response.prompt_ids, + response_ids=response.token_ids, + reward=reward_value + ) + ) +``` + +Notice what's **not** here: +- No retry logic +- No resource allocation +- No synchronization code +- No infrastructure complexity + +Just your algorithm. + +### Asynchronous RL + +Compose this into fully async, off-policy training: + +```python +async def async_rl_loop(num_rollout_loops: int): + # Multiple concurrent rollout generators + rollout_tasks = [ + asyncio.create_task(continuous_rollouts()) + for _ in range(num_rollout_loops) + ] + + # Continuous training + training_task = asyncio.create_task(continuous_training()) + + await asyncio.gather(*rollout_tasks, training_task) + +async def continuous_rollouts(): + """Generate rollouts continuously using latest policy.""" + while True: + await generate_episode(dataloader, policy, reward, replay_buffer) + +async def continuous_training(): + """Train continuously on available experience.""" + training_step = 0 + while True: + batch = await replay_buffer.sample.route( + curr_policy_version=training_step + ) + + if batch is None: + await asyncio.sleep(0.1) # Wait for more experience + else: + loss = await trainer.train_step.route(batch) + training_step += 1 + + # Push updated weights (TorchStore handles this) + await trainer.push_weights.route(training_step) + # Broadcast to all policy replicas + await policy.update_weights.fanout(training_step) +``` + +### Synchronous RL + +The same `generate_episode()` function works for on-policy algorithms like PPO - just compose it differently: + +```python +async def synchronous_rl(batch_size: int): + """Synchronous on-policy RL: collect batch, then train.""" + version = 0 + + while True: + # Collect a full batch with current policy version + for _ in range(batch_size): + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Sample the batch we just collected + batch = await replay_buffer.sample.route( + curr_policy_version=version, + batch_size=batch_size + ) + + # Train on the complete batch + loss = await trainer.train_step.route(batch) + + # Update weights in lockstep + await trainer.push_weights.route(version + 1) + await policy.update_weights.fanout(version + 1) + version += 1 +``` + +**The Power of Composition**: Write your rollout logic once, compose it into any paradigm - on-policy, off-policy, or anywhere in between. + +## Extensible Environments + +RL often requires interacting with environments beyond text generation - executing code, using tools, running simulations. TorchForge makes these first-class citizens through the same service abstraction. + +### Code Execution + +For RL on coding tasks (RLVR - Reinforcement Learning with Verifiable Rewards): + +```python +# Lightweight CPU-only service for parallel execution +coder = SandboxedPythonCoder.options( + procs=1, + with_gpus=False, + num_replicas=16 +).as_service() + +# In your RL code +async def generate_episode(): + prompt = await dataloader.sample.route() + code = await policy.generate.route(prompt) + + # Execute safely in sandbox + stdout, stderr = await coder.execute.route(code) + reward = 1.0 if stderr == "" else 0.0 # Reward based on execution + + await replay_buffer.add.route(Episode(...)) +``` + +### Tool Integration + +Services make tools ephemeral - spawn them with your job, scale them independently, tear down when finished. The same coordination primitives work for any environment type. + +This pattern extends naturally to **agentic workflows** - agents that interact with tools, query APIs, and navigate complex environments while learning from outcomes. ## Resource Management -Effective resource management is crucial for training **LLMs (Large Language Models)** - deep learning models trained on vast amounts of text data, capable of understanding and generating human-like text. +Effective resource management is crucial for training large models. ### Memory Optimization -TorchForge employs several memory optimization techniques: - **Gradient Checkpointing** -: Trade computation for memory by recomputing activations during the backward pass instead of storing them +: Trade computation for memory by recomputing activations during backward pass **Mixed Precision** -: Use FP16/BF16 for reduced memory footprint while maintaining model quality +: Use FP16/BF16 for reduced memory footprint while maintaining quality **Activation Offloading** -: Move activations to CPU when not needed on GPU, freeing up device memory +: Move activations to CPU when not needed on GPU **Parameter Sharding** -: Distribute model parameters across devices using techniques like FSDP +: Distribute model parameters across devices using FSDP ### Compute Optimization -Maximize GPU utilization through: - **Asynchronous Execution** : Overlap communication with computation to hide latency @@ -144,81 +412,72 @@ Maximize GPU utilization through: : Balance memory usage and throughput for optimal training speed **Dynamic Batching** -: Group requests efficiently for inference to maximize hardware utilization +: Group requests efficiently for inference (vLLM does this) **Kernel Fusion** -: Combine operations to reduce memory bandwidth requirements +: Combine operations to reduce memory bandwidth (torch.compile helps) -## Integration Points +## Distributed Training Strategies -TorchForge integrates with several PyTorch ecosystem projects to provide comprehensive functionality: +TorchForge leverages multiple parallelism strategies through TorchTitan: -### PyTorch Nightly +### Data Parallelism -Built on the latest PyTorch features, including: -- Native DTensor support for distributed tensors -- Compiled mode optimizations -- Advanced memory management +Each GPU processes different batches using the same model replica. Gradients are synchronized via all-reduce operations. -### Monarch +### FSDP (Fully Sharded Data Parallel) -**Monarch** is a Meta-developed optimization framework for PyTorch that TorchForge depends on for certain operations, providing enhanced optimization capabilities. +**Sharding** splits model parameters, gradients, and optimizer states across multiple devices, dramatically reducing memory per GPU. FSDP is the strategy that enables training models larger than single-GPU memory. -### vLLM +### Tensor Parallelism -**vLLM** is a fast and memory-efficient inference engine for large language models that TorchForge integrates for **generation** (producing text output from a language model given an input prompt): -- Paged attention for memory efficiency -- Continuous batching for throughput -- Speculative decoding support +Split individual tensors (like large weight matrices) across devices. Each device computes a portion of the operation. -### TorchTitan +### Pipeline Parallelism -**TorchTitan** is a PyTorch-based framework that TorchForge builds upon for certain distributed training capabilities, leveraging its foundations for scaling to large clusters. +Partition model layers across devices, pipeline micro-batches through the stages for efficient utilization. -## Configuration Management +## Key Abstractions -TorchForge uses a hierarchical configuration system with multiple layers of settings: +Understanding these core abstractions helps you use TorchForge effectively: -1. **Default Configs**: Sensible defaults for common scenarios -2. **Model Configs**: Pre-configured setups for popular models -3. **User Configs**: Custom overrides for specific needs -4. **Runtime Configs**: Dynamic adjustments during execution +### Actor -See [Usage](usage.md) for detailed configuration examples. +A component that encapsulates a model along with its execution logic. Actors provide isolation (independent resources), flexibility (different parallelism strategies), and composability (combine to create complex pipelines). -## Training Lifecycle +### Service -Understanding the complete training lifecycle helps you effectively use TorchForge: +A managed group of actor replicas with built-in routing, load balancing, and fault tolerance. Services handle operational complexity so your RL code stays clean. -### Checkpointing +### DTensor (Distributed Tensor) -A **checkpoint** is a saved snapshot of model weights and optimizer state, allowing training to be resumed or the model to be deployed for inference. Regular checkpointing provides fault tolerance and enables experimentation with different training configurations. +A tensor sharded across multiple devices. TorchStore handles resharding DTensors between different topologies automatically. -### Warmup +### Episode -**Warmup** is a training phase where the learning rate gradually increases from a small value to the target learning rate, helping stabilize early training and prevent gradient explosions. +A complete RL interaction sequence - prompt, response, reward, and metadata. Episodes flow through your system from generation to training. -### Loss Functions +### Replay Buffer -TorchForge provides specialized [loss functions](api_losses.md) for post-training: -- **Policy Gradient Losses**: For reinforcement learning optimization -- **Regularization Terms**: KL divergence constraints -- **Multi-Objective Losses**: Combine multiple training signals +Stores episodes for training. Can be implemented with various strategies (FIFO, prioritized, etc.) and integrates with TorchStore for efficient storage. ## Best Practices ### Model Selection + - Start with smaller models for prototyping - Use pre-configured model setups when available - Validate configurations before large-scale training ### Data Preparation + - Ensure balanced and diverse training data - Implement proper train/validation splits - Monitor data quality throughout training - Verify token distributions match expectations ### Training Strategy + - Begin with SFT before attempting GRPO - Use gradient accumulation for larger effective batch sizes - Monitor KL divergence to prevent policy collapse @@ -226,22 +485,32 @@ TorchForge provides specialized [loss functions](api_losses.md) for post-trainin - Apply warmup schedules for stable training ### Resource Optimization + - Profile memory usage to identify bottlenecks - Tune batch sizes for your hardware configuration - Consider mixed precision training to reduce memory - Use appropriate parallelism strategies for your model size ### Debugging + - Start with single-GPU training to isolate issues - Enable verbose logging for distributed runs - Use profiling tools to identify bottlenecks - Validate data pipelines before full training - Monitor loss curves and generation quality +## Validation + +TorchForge has been validated in real-world deployments: + +- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA) +- **CoreWeave**: Large-scale training runs on 512 H100 GPU clusters with smooth, efficient performance +- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training + ## See Also -- {doc}`getting_started` - Installation and setup guide -- {doc}`usage` - Practical usage examples +- {doc}`getting_started` - Installation, setup, and first training run +- {doc}`usage` - Practical usage examples and configuration patterns - {doc}`tutorials` - Step-by-step guides - {doc}`api` - Complete API reference - {doc}`faq` - Common questions and troubleshooting diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 6012497c0..4d8b1580f 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -1,10 +1,10 @@ # Getting Started -Welcome to TorchForge! This guide will walk you through installing TorchForge, verifying your setup, and running your first training job. +Welcome to TorchForge! This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. -## Prerequisites +## Prerequisites -Before installing TorchForge, ensure your system meets the following requirements: +Before installing TorchForge, ensure your system meets the following requirements. ### System Requirements @@ -32,6 +32,79 @@ Before installing TorchForge, ensure your system meets the following requirement - Usually pre-installed on Linux systems - Verify with: `git --version` +## Understanding TorchForge's Dependencies + +TorchForge is built on a carefully curated stack of components, each solving specific challenges in distributed RL. Understanding these dependencies helps you troubleshoot issues and customize your setup. + +### Monarch: The Distributed Foundation + +**What it is:** Monarch is a PyTorch-native distributed programming framework that brings single-controller orchestration to entire clusters. + +**Why TorchForge needs it:** +- **Single-Controller Model**: Write code that looks like a single Python program but scales to thousands of GPUs +- **Actor Meshes**: Organize processes and actors into scalable, multi-dimensional arrays +- **Fault Tolerance**: Progressive fault handling with fast failure detection and recovery +- **RDMA Support**: Direct GPU-to-GPU memory transfers for efficient data movement + +**What it solves:** Traditional SPMD (Single Program, Multiple Data) approaches require complex coordination logic in your code. Monarch abstracts this away, letting you write RL algorithms naturally while it handles distributed complexity. + +**Technical details:** Monarch is implemented with a Python frontend and a Rust backend for performance and robustness. It provides: +- Scalable messaging with multicast trees +- Multipart messaging for zero-copy data transfers +- Integration with PyTorch's distributed primitives + +### vLLM: High-Performance Inference + +**What it is:** A fast and memory-efficient inference engine optimized for large language models. + +**Why TorchForge needs it:** +- **PagedAttention**: Memory-efficient attention mechanism that reduces memory fragmentation +- **Continuous Batching**: Dynamic batching that maximizes GPU utilization +- **High Throughput**: Handles generation for multiple concurrent rollouts efficiently +- **Production-Ready**: Battle-tested at scale with proven performance + +**What it solves:** In RL for LLMs, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop. + +**Technical details:** vLLM version 0.10.0+ is required. TorchForge integrates directly with vLLM's engine, giving you access to customize generation strategies, memory management, and inference logic. + +### TorchTitan: Production Training Infrastructure + +**What it is:** Meta's production-grade LLM training platform with advanced parallelism support. + +**Why TorchForge needs it:** +- **FSDP (Fully Sharded Data Parallel)**: Shard parameters, gradients, and optimizer states across GPUs +- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching +- **Tensor Parallelism**: Split individual tensors across devices for very large models +- **Proven at Scale**: Used to train Llama models on thousands of GPUs + +**What it solves:** Modern models are too large to fit on single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently, with optimizations battle-tested in production. + +**Technical details:** TorchForge integrates with TorchTitan for training step logic and sharding strategies, enabling experimentation without framework constraints. + +### TorchStore: Weight Synchronization + +**What it is:** A distributed, in-memory key-value store for PyTorch tensors, built on Monarch. + +**Why TorchForge needs it:** +- **Automatic Resharding**: Handles complex weight transfer between different sharding strategies +- **DTensor Support**: Native support for distributed tensors +- **RDMA Transfers**: High-bandwidth weight movement without synchronous GPU transfers +- **Asynchronous Updates**: Training and inference can read/write weights independently + +**What it solves:** In async RL, new policy weights must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes. TorchStore makes this efficient, handling resharding automatically and using RDMA for fast transfers. + +**Technical details:** TorchStore provides a simple key-value interface while optimizing data movement behind the scenes, staying distributed across the cluster until requested. + +### PyTorch Nightly: Cutting-Edge Features + +**Why Nightly:** TorchForge requires the latest PyTorch features: +- **Native DTensor Support**: Distributed tensors that span multiple devices +- **Compiled Mode Optimizations**: Performance improvements through torch.compile +- **Advanced Memory Management**: Latest FSDP and memory optimization features +- **Bug Fixes**: Continuous improvements to distributed training primitives + +**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included. + ## Installation TorchForge offers two installation methods. Choose the one that fits your setup: @@ -64,6 +137,7 @@ The installation script will: - Install system dependencies using DNF (or your package manager) - Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan - Install TorchForge and all Python dependencies +- Configure the environment for GPU training ```{tip} **Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: @@ -75,7 +149,9 @@ The installation script will: Test that TorchForge is properly installed: ```bash -python -c "import forge; print(forge.__version__)" +python -c "import forge; print(f'TorchForge version: {forge.__version__}')" +python -c "import monarch; print('Monarch: OK')" +python -c "import vllm; print(f'vLLM version: {vllm.__version__}')" ``` ### Method 2: Meta Internal Installation (Alternative) @@ -148,14 +224,36 @@ python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" Expected output: `CUDA version: 12.8` (or higher) -### Check Dependencies +### Check All Dependencies + +```bash +# Check core components +python -c "import torch, forge, monarch, vllm; print('All imports successful')" + +# Check specific versions +python -c " +import torch +import forge +import vllm + +print(f'PyTorch: {torch.__version__}') +print(f'TorchForge: {forge.__version__}') +print(f'vLLM: {vllm.__version__}') +print(f'CUDA: {torch.version.cuda}') +print(f'GPUs: {torch.cuda.device_count()}') +" +``` + +### Verify Monarch ```bash -# Check vLLM -python -c "import vllm; print(f'vLLM: {vllm.__version__}')" +python -c " +from monarch.actor import Actor, this_host -# Check TorchForge modules -python -c "from forge import actors, controller, data; print('All modules imported successfully')" +# Test basic Monarch functionality +procs = this_host().spawn_procs({'gpus': 1}) +print('Monarch: Process spawning works') +" ``` ## Quick Start Examples @@ -164,7 +262,7 @@ Now that TorchForge is installed, let's run some training examples: ### Example 1: Supervised Fine-Tuning (SFT) -This example fine-tunes Llama 3 8B on your data. **Requires: 2+ GPUs** +Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs** **Step 1: Download the Model** @@ -190,6 +288,8 @@ uv run forge run --nproc_per_node 2 \ - `--nproc_per_node 2`: Use 2 GPUs for training - `apps/sft/main.py`: SFT training script - `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters +- **TorchTitan** handles model sharding across the 2 GPUs +- **Monarch** coordinates the distributed training **Expected Output:** ``` @@ -209,9 +309,11 @@ python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml ``` **What's Happening:** -- GPU 0: Policy model (being trained) +- GPU 0: Policy model (being trained, powered by TorchTitan) - GPU 1: Reference model (frozen baseline) -- GPU 2: Reward model (scoring outputs) +- GPU 2: Reward model (scoring outputs, powered by vLLM) +- **Monarch** orchestrates all three components +- **TorchStore** handles weight synchronization from training to inference **Expected Output:** ``` @@ -231,7 +333,7 @@ Generate text using a trained model: python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml ``` -This loads your model and starts an interactive generation session. +This loads your model with vLLM and starts an interactive generation session. ## Understanding Configuration Files @@ -250,7 +352,7 @@ training: gradient_accumulation_steps: 4 distributed: - strategy: fsdp + strategy: fsdp # Managed by TorchTitan precision: bf16 checkpointing: @@ -261,7 +363,7 @@ checkpointing: **Key Sections:** - **model**: Model path and settings - **training**: Hyperparameters like batch size and learning rate -- **distributed**: Multi-GPU strategy and precision +- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan - **checkpointing**: Where and when to save model checkpoints See {doc}`usage` for detailed configuration options. @@ -311,11 +413,22 @@ python -c "import torch; print(torch.cuda.get_device_capability())" TorchForge requires compute capability 7.0 or higher (Volta architecture or newer). +### Issue: Monarch actor spawn failures + +**Symptom**: Errors like "Failed to spawn actors" or "Process allocation failed" + +**Solution**: Verify your GPU count matches your configuration: +```bash +nvidia-smi # Check available GPUs +``` + +Ensure your config requests fewer processes than available GPUs. + ## Next Steps Now that you have TorchForge installed and verified: -1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture +1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture, including Monarch, Services, and TorchStore 2. **Explore Examples**: Check the `apps/` directory for more training examples 3. **Customize Training**: See {doc}`usage` for configuration patterns 4. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides @@ -328,25 +441,50 @@ If you encounter issues: 1. **Check the FAQ**: {doc}`faq` covers common questions and solutions 2. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues) 3. **File a Bug Report**: Create a new issue with: - - Your system configuration + - Your system configuration (output of diagnostic command below) - Full error message - Steps to reproduce - Expected vs actual behavior -```{tip} -When filing issues, include the output of: +**Diagnostic command:** ```bash -python -c "import torch; import forge; print(f'PyTorch: {torch.__version__}\\nForge: {forge.__version__}\\nCUDA: {torch.version.cuda}\\nGPUs: {torch.cuda.device_count()}')" -``` +python -c " +import torch +import forge + +try: + import monarch + monarch_status = 'OK' +except Exception as e: + monarch_status = str(e) + +try: + import vllm + vllm_version = vllm.__version__ +except Exception as e: + vllm_version = str(e) + +print(f'PyTorch: {torch.__version__}') +print(f'TorchForge: {forge.__version__}') +print(f'Monarch: {monarch_status}') +print(f'vLLM: {vllm_version}') +print(f'CUDA: {torch.version.cuda}') +print(f'GPUs: {torch.cuda.device_count()}') +" ``` +Include this output in your bug reports! + ## Additional Resources - **GitHub Repository**: [github.com/meta-pytorch/forge](https://github.com/meta-pytorch/forge) - **Example Notebooks**: Check `demo.ipynb` in the repository root - **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) - **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) +- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch) +- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai) +- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan) --- -**Ready to start training?** Head to {doc}`usage` for practical configuration examples and workflows. +**Ready to start training?** Head to {doc}`usage` for practical configuration examples and workflows, or dive into {doc}`concepts` to understand how all the pieces work together. diff --git a/docs/source/index.md b/docs/source/index.md index 8e9510b39..67b96ce1d 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -1,149 +1,233 @@ # TorchForge Documentation -**TorchForge** is a PyTorch-native platform built for post-training generative AI models and agentic development. Designed with the principle that researchers should write algorithms, not infrastructure. +**TorchForge** is a PyTorch-native library for RL post-training and agentic development. Built on the principle that **researchers should write algorithms, not infrastructure**. ```{note} -**Early Development:** TorchForge is currently experimental. Expect bugs, incomplete features, and API changes. Please file issues on [GitHub](https://github.com/meta-pytorch/forge) for bug reports and feature requests. +**Experimental Status:** TorchForge is currently in early development. Expect bugs, incomplete features, and API changes. Please file issues on [GitHub](https://github.com/meta-pytorch/forge) for bug reports and feature requests. ``` ## Why TorchForge? -TorchForge introduces a "service"-centric architecture that provides the right abstractions for distributed complexity: +Reinforcement Learning has become essential to frontier AI - from instruction following and reasoning to complex research capabilities. But infrastructure complexity often dominates the actual research. -- **Usability for Rapid Research**: Isolate your RL algorithms from infrastructure concerns -- **Hackability for Power Users**: Modify any part of the RL loop without touching infrastructure code -- **Scalability**: Seamlessly shift between async and synchronous training across thousands of GPUs +TorchForge lets you **express RL algorithms as naturally as pseudocode**, while powerful infrastructure handles distribution, fault tolerance, and optimization underneath. -## Core Capabilities +### Core Design Principles + +- **Algorithms, Not Infrastructure**: Write your RL logic without distributed systems code +- **Any Degree of Asynchrony**: From fully synchronous PPO to fully async off-policy training +- **Composable Components**: Mix and match proven frameworks (vLLM, TorchTitan) with custom logic +- **Built on Solid Foundations**: Leverages Monarch's single-controller model for simplified distributed programming + +## Foundation: The Technology Stack + +TorchForge is built on carefully selected, battle-tested components: ::::{grid} 1 1 2 2 :gutter: 3 -:::{grid-item-card} Supervised Fine-Tuning (SFT) -:link: concepts -:link-type: doc +:::{grid-item-card} **Monarch** +:link: https://meta-pytorch.org/monarch -Adapt pre-trained models to specific tasks using labeled data. Perfect for creating specialized models from foundation models. +Single-controller distributed programming framework that orchestrates clusters like you'd program a single machine. Provides actor meshes, fault tolerance, and RDMA-based data transfers. + +**Why it matters:** Eliminates SPMD complexity, making distributed RL tractable ::: -:::{grid-item-card} Reinforcement Learning (GRPO) -:link: concepts -:link-type: doc +:::{grid-item-card} **vLLM** +:link: https://docs.vllm.ai + +High-throughput, memory-efficient inference engine with PagedAttention and continuous batching. -Advanced policy optimization using Generalized Reward Policy Optimization for aligning models with human preferences and reward functions. +**Why it matters:** Handles policy generation efficiently at scale ::: -:::{grid-item-card} Distributed Training -:link: concepts -:link-type: doc +:::{grid-item-card} **TorchTitan** +:link: https://github.com/pytorch/torchtitan + +Meta's production-grade LLM training platform with FSDP, pipeline parallelism, and tensor parallelism. -Built-in support for multi-GPU and multi-node training with FSDP, tensor parallelism, and pipeline parallelism. +**Why it matters:** Battle-tested training infrastructure proven at scale ::: -:::{grid-item-card} Integration Ecosystem -:link: concepts -:link-type: doc +:::{grid-item-card} **TorchStore** -Seamlessly integrates with PyTorch nightly, Monarch, vLLM, and TorchTitan for a complete training and inference pipeline. +Distributed, in-memory key-value store for PyTorch tensors built on Monarch, optimized for weight synchronization with automatic DTensor resharding. + +**Why it matters:** Solves the weight transfer bottleneck in async RL ::: :::: -## Getting Started Paths +## What You Can Build + +::::{grid} 1 1 2 3 +:gutter: 2 + +:::{grid-item-card} Supervised Fine-Tuning +Adapt foundation models to specific tasks using labeled data with efficient multi-GPU training. +::: + +:::{grid-item-card} GRPO Training +Train models with Generalized Reward Policy Optimization for aligning with human preferences. +::: + +:::{grid-item-card} Asynchronous RL +Continuous rollout generation with non-blocking training for maximum throughput. +::: + +:::{grid-item-card} Code Execution +Safe, sandboxed code execution environments for RL on coding tasks (RLVR). +::: + +:::{grid-item-card} Tool Integration +Extensible environment system for agents that interact with tools and APIs. +::: + +:::{grid-item-card} Custom Workflows +Build your own components and compose them naturally with existing infrastructure. +::: + +:::: + +## Requirements at a Glance + +Before diving in, ensure your system meets these requirements: + +| Component | Requirement | Why It's Needed | +|-----------|-------------|-----------------| +| **Operating System** | Linux (Fedora/Ubuntu) | Dependency compatibility | +| **Python** | 3.10+ | Core runtime | +| **CUDA** | 12.8+ | GPU acceleration | +| **GPUs** | 2+ for SFT, 3+ for GRPO | Distributed training & separate policy/ref/reward models | +| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) | +| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system | +| **vLLM** | v0.10.0+ | Fast inference with PagedAttention | +| **TorchTitan** | Latest | Production training infrastructure | + +See {doc}`getting_started` for detailed installation instructions. + +## Quick Start -Choose your journey based on your experience level and goals: +Here's what training looks like with TorchForge: + +```bash +# Install dependencies +conda create -n forge python=3.10 +conda activate forge +git clone https://github.com/meta-pytorch/forge +cd forge +./scripts/install.sh + +# Download a model +uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ + --output-dir /tmp/Meta-Llama-3.1-8B-Instruct + +# Run SFT training (requires 2+ GPUs) +uv run forge run --nproc_per_node 2 \ + apps/sft/main.py --config apps/sft/llama3_8b.yaml + +# Run GRPO training (requires 3+ GPUs) +python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml +``` + +## Writing RL Code + +With TorchForge, your RL logic looks like pseudocode: + +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + # Sample a prompt + prompt, target = await dataloader.sample.route() + + # Generate response + response = await policy.generate.route(prompt) + + # Score the response + reward_value = await reward.evaluate_response.route( + prompt=prompt, + response=response.text, + target=target + ) + + # Store for training + await replay_buffer.add.route( + Episode(prompt_ids=response.prompt_ids, + response_ids=response.token_ids, + reward=reward_value) + ) +``` + +No retry logic, no resource management, no synchronization code - just your algorithm. + +## Documentation Paths + +Choose your learning path: ::::{grid} 1 1 2 2 :gutter: 3 -:::{grid-item-card} 🚀 New to TorchForge? +:::{grid-item-card} 🚀 Getting Started :link: getting_started :link-type: doc -Start here for installation instructions, system requirements, and your first training run. +Installation, prerequisites, verification, and your first training run. **Time to first run: ~15 minutes** ::: -:::{grid-item-card} 📚 Understanding the Architecture +:::{grid-item-card} 🧠 Core Concepts :link: concepts :link-type: doc -Learn about TorchForge's architecture, key concepts, and how the components work together. +Architecture, Monarch integration, Services, TorchStore, and how everything works together. -**Recommended for researchers** +**For understanding the system** ::: -:::{grid-item-card} 💻 Practical Usage +:::{grid-item-card} 💻 Usage Patterns :link: usage :link-type: doc -Configuration patterns, common workflows, and real-world usage scenarios. +Configuration examples, common workflows, and practical scenarios. -**For hands-on development** +**For day-to-day development** ::: :::{grid-item-card} 📖 API Reference :link: api :link-type: doc -Complete API documentation for all modules, classes, and functions. +Complete API documentation for customization and extension. -**For in-depth customization** +**For deep integration** ::: :::: -## Quick Example - -Here's what a simple SFT training run looks like: - -```bash -# Download a model -uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ - --output-dir /tmp/Meta-Llama-3.1-8B-Instruct - -# Run supervised fine-tuning -uv run forge run --nproc_per_node 2 \ - apps/sft/main.py --config apps/sft/llama3_8b.yaml -``` - -See {doc}`getting_started` for complete installation and setup instructions. - -## System Requirements - -| Component | Requirement | -|-----------|-------------| -| **Python** | 3.10+ | -| **Operating System** | Linux (tested on Fedora/Ubuntu) | -| **GPU** | NVIDIA with CUDA support | -| **Minimum GPUs** | 2 for SFT, 3 for GRPO | -| **Dependencies** | PyTorch nightly, Monarch, vLLM, TorchTitan | - -## Supported Models +## Validation & Partnerships -TorchForge includes pre-configured setups for popular models: +TorchForge has been validated in real-world deployments: -- **Llama 3 8B**: Production-ready configuration for supervised fine-tuning -- **Qwen 3.1 7B**: Optimized settings for GRPO training -- **Qwen 3 8B**: Multi-node training configurations -- **Custom Models**: Extensible architecture for any transformer-based model +- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA) +- **CoreWeave**: Large-scale training on 512 H100 GPU clusters with smooth, efficient performance +- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training -## Community & Support +## Community -- **Documentation**: You're reading it! Use the navigation above to explore -- **GitHub Issues**: [Report bugs and request features](https://github.com/meta-pytorch/forge/issues) -- **Contributing**: See [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) +- **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge) +- **Issues**: [Report bugs and request features](https://github.com/meta-pytorch/forge/issues) +- **Contributing**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) - **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) ```{tip} -Signal your intention to contribute in the issue tracker before starting significant work to coordinate efforts with the maintainers. +Before starting significant work, signal your intention in the issue tracker to coordinate with maintainers. ``` ## Documentation Contents ```{toctree} -:maxdepth: 1 +:maxdepth: 2 :caption: Documentation getting_started From a0b24120f0b0415df3fd387e038eae0b61ac0089 Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Fri, 10 Oct 2025 17:24:51 -0400 Subject: [PATCH 3/8] Update docs/source/getting_started.md Co-authored-by: Svetlana Karslioglu --- docs/source/getting_started.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 4d8b1580f..6c7681c16 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -1,6 +1,6 @@ # Getting Started -Welcome to TorchForge! This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. +This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. ## Prerequisites From b6d466c23d93816a2fb783edf1355f1eeff1e79b Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Fri, 10 Oct 2025 17:24:59 -0400 Subject: [PATCH 4/8] Update docs/source/index.md Co-authored-by: Svetlana Karslioglu --- docs/source/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/index.md b/docs/source/index.md index 67b96ce1d..cfabace79 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -242,7 +242,6 @@ faq * {ref}`genindex` - Index of all documented objects * {ref}`modindex` - Python module index -* {ref}`search` - Search the documentation --- From b5641759e3efcaac1c7c80effb665d8e158c9cb2 Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Fri, 10 Oct 2025 17:25:07 -0400 Subject: [PATCH 5/8] Update docs/source/index.md Co-authored-by: Svetlana Karslioglu --- docs/source/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/source/index.md b/docs/source/index.md index cfabace79..4f02bedd9 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -224,7 +224,6 @@ TorchForge has been validated in real-world deployments: Before starting significant work, signal your intention in the issue tracker to coordinate with maintainers. ``` -## Documentation Contents ```{toctree} :maxdepth: 2 From 92ca62797be78a3495a80d51d2c6010ce4e32a39 Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Fri, 10 Oct 2025 18:59:57 -0400 Subject: [PATCH 6/8] Minor fixes and updates. --- docs/source/getting_started.md | 294 ++++++++++++++------------------- docs/source/index.md | 5 +- 2 files changed, 122 insertions(+), 177 deletions(-) diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 4d8b1580f..6dc8d58dc 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -2,12 +2,10 @@ Welcome to TorchForge! This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. -## Prerequisites +## System Requirements Before installing TorchForge, ensure your system meets the following requirements. -### System Requirements - | Component | Requirement | Notes | |-----------|-------------|-------| | **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported | @@ -18,25 +16,25 @@ Before installing TorchForge, ensure your system meets the following requirement | **RAM** | 32GB+ recommended | Depends on model size | | **Disk Space** | 50GB+ free | For models, datasets, and checkpoints | -### Required Tools +## Prerequisites -1. **Conda or Miniconda**: For environment management - - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) +- **Conda or Miniconda**: For environment management + - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) -2. **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies - - Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) - - After installing, authenticate with: `gh auth login` - - You can use either HTTPS or SSH as the authentication protocol +- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies + - Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) + - After installing, authenticate with: `gh auth login` + - You can use either HTTPS or SSH as the authentication protocol -3. **Git**: For cloning the repository - - Usually pre-installed on Linux systems - - Verify with: `git --version` +- **Git**: For cloning the repository + - Usually pre-installed on Linux systems + - Verify with: `git --version` ## Understanding TorchForge's Dependencies TorchForge is built on a carefully curated stack of components, each solving specific challenges in distributed RL. Understanding these dependencies helps you troubleshoot issues and customize your setup. -### Monarch: The Distributed Foundation +### Monarch **What it is:** Monarch is a PyTorch-native distributed programming framework that brings single-controller orchestration to entire clusters. @@ -53,7 +51,7 @@ TorchForge is built on a carefully curated stack of components, each solving spe - Multipart messaging for zero-copy data transfers - Integration with PyTorch's distributed primitives -### vLLM: High-Performance Inference +### vLLM **What it is:** A fast and memory-efficient inference engine optimized for large language models. @@ -67,7 +65,7 @@ TorchForge is built on a carefully curated stack of components, each solving spe **Technical details:** vLLM version 0.10.0+ is required. TorchForge integrates directly with vLLM's engine, giving you access to customize generation strategies, memory management, and inference logic. -### TorchTitan: Production Training Infrastructure +### TorchTitan **What it is:** Meta's production-grade LLM training platform with advanced parallelism support. @@ -81,7 +79,7 @@ TorchForge is built on a carefully curated stack of components, each solving spe **Technical details:** TorchForge integrates with TorchTitan for training step logic and sharding strategies, enabling experimentation without framework constraints. -### TorchStore: Weight Synchronization +### TorchStore **What it is:** A distributed, in-memory key-value store for PyTorch tensors, built on Monarch. @@ -95,7 +93,7 @@ TorchForge is built on a carefully curated stack of components, each solving spe **Technical details:** TorchStore provides a simple key-value interface while optimizing data movement behind the scenes, staying distributed across the cluster until requested. -### PyTorch Nightly: Cutting-Edge Features +### PyTorch Nightly **Why Nightly:** TorchForge requires the latest PyTorch features: - **Native DTensor Support**: Distributed tensors that span multiple devices @@ -107,198 +105,148 @@ TorchForge is built on a carefully curated stack of components, each solving spe ## Installation -TorchForge offers two installation methods. Choose the one that fits your setup: +TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable. -### Method 1: Basic Installation (Recommended) +1. **Clone the Repository** -This method uses pre-packaged wheels for all dependencies, making installation faster and more reliable. + ```bash + git clone https://github.com/meta-pytorch/forge.git + cd forge + ``` -**Step 1: Clone the Repository** +2. **Create Conda Environment** -```bash -git clone https://github.com/meta-pytorch/forge.git -cd forge -``` + ```bash + conda create -n forge python=3.10 + conda activate forge + ``` -**Step 2: Create Conda Environment** +3. **Run Installation Script** -```bash -conda create -n forge python=3.10 -conda activate forge -``` + ```bash + ./scripts/install.sh + ``` -**Step 3: Run Installation Script** + The installation script will: + - Install system dependencies using DNF (or your package manager) + - Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan + - Install TorchForge and all Python dependencies + - Configure the environment for GPU training -```bash -./scripts/install.sh -``` + ```{tip} + **Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: + `./scripts/install.sh --use-sudo` + ``` -The installation script will: -- Install system dependencies using DNF (or your package manager) -- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan -- Install TorchForge and all Python dependencies -- Configure the environment for GPU training +4. **Verify Installation** -```{tip} -**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: -`./scripts/install.sh --use-sudo` -``` + Test that TorchForge is properly installed: -**Step 4: Verify Installation** + ```bash + python -c "import forge; print(f'TorchForge version: {forge.__version__}')" + python -c "import monarch; print('Monarch: OK')" + python -c "import vllm; print(f'vLLM version: {vllm.__version__}')" + ``` -Test that TorchForge is properly installed: - -```bash -python -c "import forge; print(f'TorchForge version: {forge.__version__}')" -python -c "import monarch; print('Monarch: OK')" -python -c "import vllm; print(f'vLLM version: {vllm.__version__}')" -``` - -### Method 2: Meta Internal Installation (Alternative) - -For Meta employees or those with access to Meta's internal tools: - -**Step 1: Install uv Package Manager** - -```bash -curl -LsSf https://astral.sh/uv/install.sh | sh -``` - -**Step 2: Clone and Setup** - -```bash -git clone https://github.com/meta-pytorch/forge -cd forge -uv sync --all-extras -source .venv/bin/activate -``` - -**Step 3: Configure CUDA** - -```bash -# Install CUDA if needed -feature install --persist cuda_12_9 - -# Set environment variables -export CUDA_VERSION=12.9 -export NVCC=/usr/local/cuda-$CUDA_VERSION/bin/nvcc -export CUDA_NVCC_EXECUTABLE=/usr/local/cuda-$CUDA_VERSION/bin/nvcc -export CUDA_HOME=/usr/local/cuda-$CUDA_VERSION -export PATH="$CUDA_HOME/bin:$PATH" -export CUDA_INCLUDE_DIRS=$CUDA_HOME/include -export CUDA_CUDART_LIBRARY=$CUDA_HOME/lib64/libcudart.so -export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH -``` - -**Step 4: Build vLLM from Source** - -```bash -git clone https://github.com/vllm-project/vllm.git --branch v0.10.0 -cd vllm -python use_existing_torch.py -uv pip install -r requirements/build.txt -uv pip install --no-build-isolation -e . -``` - -```{warning} -When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. -``` + ```{warning} + When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. + ``` ## Verifying Your Setup After installation, verify that all components are working correctly: -### Check GPU Availability +1. **Check GPU Availability** -```bash -python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" -``` + ```bash + python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" + ``` -Expected output: `GPUs available: 2` (or more) + Expected output: `GPUs available: 2` (or more) -### Check CUDA Version +2. **Check CUDA Version** -```bash -python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" -``` + ```bash + python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" + ``` -Expected output: `CUDA version: 12.8` (or higher) + Expected output: `CUDA version: 12.8` (or higher) -### Check All Dependencies +3. **Check All Dependencies** -```bash -# Check core components -python -c "import torch, forge, monarch, vllm; print('All imports successful')" + ```bash + # Check core components + python -c "import torch, forge, monarch, vllm; print('All imports successful')" -# Check specific versions -python -c " -import torch -import forge -import vllm + # Check specific versions + python -c " + import torch + import forge + import vllm -print(f'PyTorch: {torch.__version__}') -print(f'TorchForge: {forge.__version__}') -print(f'vLLM: {vllm.__version__}') -print(f'CUDA: {torch.version.cuda}') -print(f'GPUs: {torch.cuda.device_count()}') -" -``` + print(f'PyTorch: {torch.__version__}') + print(f'TorchForge: {forge.__version__}') + print(f'vLLM: {vllm.__version__}') + print(f'CUDA: {torch.version.cuda}') + print(f'GPUs: {torch.cuda.device_count()}') + " + ``` -### Verify Monarch +4. **Verify Monarch** -```bash -python -c " -from monarch.actor import Actor, this_host + ```bash + python -c " + from monarch.actor import Actor, this_host -# Test basic Monarch functionality -procs = this_host().spawn_procs({'gpus': 1}) -print('Monarch: Process spawning works') -" -``` + # Test basic Monarch functionality + procs = this_host().spawn_procs({'gpus': 1}) + print('Monarch: Process spawning works') + " + ``` ## Quick Start Examples -Now that TorchForge is installed, let's run some training examples: +Now that TorchForge is installed, let's run some training examples. ### Example 1: Supervised Fine-Tuning (SFT) Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs** -**Step 1: Download the Model** - -```bash -uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ - --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ - --ignore-patterns "original/consolidated.00.pth" -``` - -```{note} -Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already. -``` - -**Step 2: Run Training** - -```bash -uv run forge run --nproc_per_node 2 \ - apps/sft/main.py \ - --config apps/sft/llama3_8b.yaml -``` - -**What's Happening:** -- `--nproc_per_node 2`: Use 2 GPUs for training -- `apps/sft/main.py`: SFT training script -- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters -- **TorchTitan** handles model sharding across the 2 GPUs -- **Monarch** coordinates the distributed training - -**Expected Output:** -``` -Initializing process group... -Loading model from /tmp/Meta-Llama-3.1-8B-Instruct... -Starting training... -Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001 -... -``` +1. **Download the Model** + + ```bash + uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ + --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ + --ignore-patterns "original/consolidated.00.pth" + ``` + + ```{note} + Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already. + ``` + +2. **Run Training** + + ```bash + uv run forge run --nproc_per_node 2 \ + apps/sft/main.py \ + --config apps/sft/llama3_8b.yaml + ``` + + **What's Happening:** + - `--nproc_per_node 2`: Use 2 GPUs for training + - `apps/sft/main.py`: SFT training script + - `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters + - **TorchTitan** handles model sharding across the 2 GPUs + - **Monarch** coordinates the distributed training + + **Expected Output:** + ``` + Initializing process group... + Loading model from /tmp/Meta-Llama-3.1-8B-Instruct... + Starting training... + Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001 + ... + ``` ### Example 2: GRPO Training diff --git a/docs/source/index.md b/docs/source/index.md index 67b96ce1d..7c409041e 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -51,6 +51,7 @@ Meta's production-grade LLM training platform with FSDP, pipeline parallelism, a ::: :::{grid-item-card} **TorchStore** +:link: https://github.com/meta-pytorch/torchstore Distributed, in-memory key-value store for PyTorch tensors built on Monarch, optimized for weight synchronization with automatic DTensor resharding. @@ -224,11 +225,8 @@ TorchForge has been validated in real-world deployments: Before starting significant work, signal your intention in the issue tracker to coordinate with maintainers. ``` -## Documentation Contents - ```{toctree} :maxdepth: 2 -:caption: Documentation getting_started concepts @@ -242,7 +240,6 @@ faq * {ref}`genindex` - Index of all documented objects * {ref}`modindex` - Python module index -* {ref}`search` - Search the documentation --- From 34640e7b4fe986fa4cb6116de2a400841d47fcd0 Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Fri, 10 Oct 2025 19:17:54 -0400 Subject: [PATCH 7/8] Update docs/source/getting_started.md Co-authored-by: Svetlana Karslioglu --- docs/source/getting_started.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index f8863eaca..55e480983 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -1,3 +1,7 @@ +--- +orphan: true +--- + # Getting Started This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. From 32c8d789074facc52584c4c77c53619feae74b75 Mon Sep 17 00:00:00 2001 From: Alanna Burke Date: Fri, 10 Oct 2025 20:49:48 -0400 Subject: [PATCH 8/8] Restructing info. --- docs/source/architecture.md | 256 ++++++++++++++++ docs/source/concepts.md | 499 +++++--------------------------- docs/source/conf.py | 2 + docs/source/faq.md | 56 +++- docs/source/getting_started.md | 126 -------- docs/source/index.md | 2 +- docs/source/rl_workflows.md | 375 ++++++++++++++++++++++++ docs/source/technology_stack.md | 120 ++++++++ 8 files changed, 876 insertions(+), 560 deletions(-) create mode 100644 docs/source/architecture.md create mode 100644 docs/source/rl_workflows.md create mode 100644 docs/source/technology_stack.md diff --git a/docs/source/architecture.md b/docs/source/architecture.md new file mode 100644 index 000000000..b7c274a9a --- /dev/null +++ b/docs/source/architecture.md @@ -0,0 +1,256 @@ +# Architecture + +This guide provides a deep dive into TorchForge's architecture, explaining how Monarch, Services, and TorchStore work together to enable distributed RL. + +## The Foundation: Monarch + +At TorchForge's core is **Monarch**, a PyTorch-native distributed programming framework that brings single-controller orchestration to entire GPU clusters. + +### Single-Controller vs SPMD + +Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: +- Asynchronous generation and training +- Multiple heterogeneous components (policy, reward model, reference model) +- Dynamic resource allocation +- Fault tolerance across components + +**Monarch's single-controller model** changes this entirely. You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. + +### Actor Meshes + +Monarch organizes resources into multidimensional arrays called **meshes**: + +**Process Mesh** +: An array of processes spread across many hosts, typically one process per GPU + +**Actor Mesh** +: An array of actors, each running inside a separate process + +Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems. You can slice meshes, broadcast operations, and operate on entire meshes with simple APIs. + +```python +from monarch.actor import Actor, this_host + +# Create a process mesh with 8 GPUs +procs = this_host().spawn_procs({"gpus": 8}) + +# Define an actor +class PolicyActor(Actor): + @endpoint + def generate(self, prompt): + return self.model.generate(prompt) + +# Spawn actors across the mesh +actors = procs.spawn("policy", PolicyActor) + +# Call methods on the entire mesh +results = actors.generate.call_all("Hello world") +``` + +### Fault Tolerance + +Monarch provides **progressive fault handling** - you write your code as if nothing fails. When something does fail, Monarch fails fast by default, stopping the whole program like an uncaught exception. + +But you can progressively add fine-grained fault handling exactly where you need it: + +```python +try: + result = await policy.generate.route(prompt) +except ActorFailure: + # Handle failure - maybe retry with different replica + result = await policy.generate.route(prompt) +``` + +For long-running RL training, this is crucial. Hardware failures are common at scale - in Meta's Llama 3 training, there were 419 interruptions across 54 days on a 16K GPU job (roughly one failure every 3 hours). + +### RDMA and Data Plane + +Monarch separates the **control plane** (messaging) from the **data plane** (bulk data transfers). This enables direct GPU-to-GPU memory transfers across your cluster using RDMA (Remote Direct Memory Access). + +Control commands go through one optimized path, while large data transfers (like model weights) go through another path optimized for bandwidth. + +## Services: RL-Friendly Actor Abstraction + +**Services** wrap Monarch's ActorMesh with patterns common in RL. A service is a managed group of actor replicas with built-in load balancing, fault tolerance, and routing primitives. + +```python +# Create a policy service with 16 replicas, each using 8 GPUs +policy = PolicyActor.options( + procs=8, + with_gpus=True, + num_replicas=16 +).as_service() + +# Create a lightweight coding environment service +coder = SandboxedCoder.options( + procs=1, + with_gpus=False, + num_replicas=16 +).as_service() +``` + +### Service Adverbs + +Services provide intuitive operations called "adverbs": + +**route()** +: Load-balanced request to one replica +```python +response = await policy.generate.route(prompt) +``` + +**fanout()** +: Broadcast to ALL replicas in parallel +```python +await policy.update_weights.fanout(version) +``` + +**session()** +: Sticky sessions for stateful operations (maintains KV cache consistency) +```python +async with policy.session(): + response1 = await policy.generate.route(prompt1) + response2 = await policy.generate.route(prompt2) # Same replica +``` + +### Why Services Matter for RL + +Services solve critical infrastructure challenges: + +**Heterogeneous Scaling** +: Different components need different resources. Your policy might need 16 replicas × 8 GPUs for high-throughput vLLM inference. Your reward model might need 4 replicas × 4 GPUs. Your coding environment might need 16 lightweight CPU-only replicas. Services let each component scale independently. + +**Load Balancing** +: In async RL, multiple `continuous_rollouts()` tasks run concurrently. Services automatically distribute these rollouts across available replicas - no manual worker pool management. + +**Fault Tolerance** +: If a replica fails during a rollout, services detect it, mark it unhealthy, and route subsequent requests to healthy replicas. The failed replica gets restarted automatically. Your RL code never sees the failure. + +**Ephemeral Infrastructure** +: Services are created with your job and torn down when finished. Want to try a new reward model? Change your Python code. No standing deployments to maintain, no infrastructure to provision ahead of time. + +## TorchStore: Distributed Weight Storage + +In async RL, every training step produces new policy weights that must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes of data. **TorchStore** makes this efficient. + +### The Weight Synchronization Challenge + +Traditionally, you have two options: +1. **Build complex p2p mappings** between training and inference sharding strategies (fast but extremely complex) +2. **Use network filesystem** like NFS (simple but slow, with high infrastructure cost) + +TorchStore combines the **UX of central storage** with the **performance of in-memory p2p operations**. + +### How TorchStore Works + +TorchStore is a distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives: + +```python +import torchstore as ts +from torch.distributed._tensor import distribute_tensor, Shard +from torch.distributed.device_mesh import init_device_mesh + +# Training process: store sharded weights +async def store_weights(): + device_mesh = init_device_mesh("cuda", (4,)) + tensor = model.state_dict()['layer.weight'] + dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) + + # Each rank stores its shard + await ts.put("policy_weights_v123", dtensor) + +# Inference process: fetch with different sharding +async def load_weights(): + device_mesh = init_device_mesh("cuda", (2, 2)) # Different topology! + tensor = torch.empty_like(model.state_dict()['layer.weight']) + dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) + + # TorchStore handles resharding automatically + await ts.get("policy_weights_v123", dtensor) +``` + +**Key Features:** + +**Automatic Resharding** +: Handles complex weight transfer between different sharding strategies transparently + +**DTensor Native** +: Works seamlessly with PyTorch's distributed tensors + +**RDMA Transfers** +: Uses RDMA for high-bandwidth data movement without blocking GPUs + +**Asynchronous Updates** +: Training and inference can read/write weights independently, enabling true async RL + +**Flexible Storage** +: Store tensors co-located with trainers, on their own storage tier, sharded or replicated - change with minimal code modifications + +### Why TorchStore Matters + +Without TorchStore, weight synchronization becomes the bottleneck in async RL. Traditional approaches either: +- Require synchronous GPU-to-GPU transfers (blocking training) +- Use slow network filesystems (minutes per update) +- Demand complex manual resharding logic (error-prone, hard to maintain) + +TorchStore solves all of these, keeping data distributed across the cluster until requested and moving it efficiently with RDMA. + +## Resource Management + +Effective resource management is crucial for training large models. + +### Memory Optimization + +**Gradient Checkpointing** +: Trade computation for memory by recomputing activations during backward pass + +**Mixed Precision** +: Use FP16/BF16 for reduced memory footprint while maintaining quality + +**Activation Offloading** +: Move activations to CPU when not needed on GPU + +**Parameter Sharding** +: Distribute model parameters across devices using FSDP + +### Compute Optimization + +**Asynchronous Execution** +: Overlap communication with computation to hide latency + +**Batch Size Tuning** +: Balance memory usage and throughput for optimal training speed + +**Dynamic Batching** +: Group requests efficiently for inference (vLLM does this) + +**Kernel Fusion** +: Combine operations to reduce memory bandwidth (torch.compile helps) + +## Distributed Training Strategies + +TorchForge leverages multiple parallelism strategies through TorchTitan: + +### Data Parallelism + +Each GPU processes different batches using the same model replica. Gradients are synchronized via all-reduce operations. + +### FSDP (Fully Sharded Data Parallel) + +**Sharding** splits model parameters, gradients, and optimizer states across multiple devices, dramatically reducing memory per GPU. FSDP is the strategy that enables training models larger than single-GPU memory. + +### Tensor Parallelism + +Split individual tensors (like large weight matrices) across devices. Each device computes a portion of the operation. + +### Pipeline Parallelism + +Partition model layers across devices, pipeline micro-batches through the stages for efficient utilization. + +## See Also + +- {doc}`concepts` - Core philosophy and key abstractions +- {doc}`technology_stack` - Understanding the dependency stack +- {doc}`rl_workflows` - Writing RL algorithms with these components +- {doc}`getting_started` - Installation and setup +- {doc}`usage` - Practical usage examples diff --git a/docs/source/concepts.md b/docs/source/concepts.md index 3d91eb58f..7371a813e 100644 --- a/docs/source/concepts.md +++ b/docs/source/concepts.md @@ -1,6 +1,6 @@ # Concepts -This guide covers the fundamental concepts and architecture behind TorchForge, helping you understand how the system works and how to effectively use its components. +This guide introduces the fundamental principles and concepts behind TorchForge, helping you understand the philosophy that drives the system. ## The Core Philosophy @@ -8,458 +8,84 @@ TorchForge is built on one principle: **researchers should write algorithms, not The traditional approach to distributed RL requires you to write complex coordination logic, retry mechanisms, resource management, and synchronization code. TorchForge abstracts all of this away, letting you express RL algorithms as naturally as pseudocode while powerful infrastructure handles the distributed complexity underneath. -## The Foundation: Monarch - -At TorchForge's core is **Monarch**, a PyTorch-native distributed programming framework that brings single-controller orchestration to entire GPU clusters. - -### Single-Controller vs SPMD - -Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: -- Asynchronous generation and training -- Multiple heterogeneous components (policy, reward model, reference model) -- Dynamic resource allocation -- Fault tolerance across components - -**Monarch's single-controller model** changes this entirely. You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. - -### Actor Meshes - -Monarch organizes resources into multidimensional arrays called **meshes**: - -**Process Mesh** -: An array of processes spread across many hosts, typically one process per GPU - -**Actor Mesh** -: An array of actors, each running inside a separate process - -Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems. You can slice meshes, broadcast operations, and operate on entire meshes with simple APIs. - -```python -from monarch.actor import Actor, this_host - -# Create a process mesh with 8 GPUs -procs = this_host().spawn_procs({"gpus": 8}) - -# Define an actor -class PolicyActor(Actor): - @endpoint - def generate(self, prompt): - return self.model.generate(prompt) - -# Spawn actors across the mesh -actors = procs.spawn("policy", PolicyActor) - -# Call methods on the entire mesh -results = actors.generate.call_all("Hello world") -``` - -### Fault Tolerance - -Monarch provides **progressive fault handling** - you write your code as if nothing fails. When something does fail, Monarch fails fast by default, stopping the whole program like an uncaught exception. - -But you can progressively add fine-grained fault handling exactly where you need it: - -```python -try: - result = await policy.generate.route(prompt) -except ActorFailure: - # Handle failure - maybe retry with different replica - result = await policy.generate.route(prompt) -``` - -For long-running RL training, this is crucial. Hardware failures are common at scale - in Meta's Llama 3 training, there were 419 interruptions across 54 days on a 16K GPU job (roughly one failure every 3 hours). - -### RDMA and Data Plane - -Monarch separates the **control plane** (messaging) from the **data plane** (bulk data transfers). This enables direct GPU-to-GPU memory transfers across your cluster using RDMA (Remote Direct Memory Access). - -Control commands go through one optimized path, while large data transfers (like model weights) go through another path optimized for bandwidth. - -## TorchForge Architecture - -With Monarch as the foundation, TorchForge builds higher-level abstractions specifically for RL workflows. - -### Services: RL-Friendly Actor Abstraction - -**Services** wrap Monarch's ActorMesh with patterns common in RL. A service is a managed group of actor replicas with built-in load balancing, fault tolerance, and routing primitives. - -```python -# Create a policy service with 16 replicas, each using 8 GPUs -policy = PolicyActor.options( - procs=8, - with_gpus=True, - num_replicas=16 -).as_service() - -# Create a lightweight coding environment service -coder = SandboxedCoder.options( - procs=1, - with_gpus=False, - num_replicas=16 -).as_service() -``` - -**Service Adverbs** provide intuitive operations: - -**route()** -: Load-balanced request to one replica -```python -response = await policy.generate.route(prompt) -``` - -**fanout()** -: Broadcast to ALL replicas in parallel -```python -await policy.update_weights.fanout(version) -``` - -**session()** -: Sticky sessions for stateful operations (maintains KV cache consistency) -```python -async with policy.session(): - response1 = await policy.generate.route(prompt1) - response2 = await policy.generate.route(prompt2) # Same replica -``` - -### Why Services Matter for RL - -Services solve critical infrastructure challenges: - -**Heterogeneous Scaling** -: Different components need different resources. Your policy might need 16 replicas × 8 GPUs for high-throughput vLLM inference. Your reward model might need 4 replicas × 4 GPUs. Your coding environment might need 16 lightweight CPU-only replicas. Services let each component scale independently. - -**Load Balancing** -: In async RL, multiple `continuous_rollouts()` tasks run concurrently. Services automatically distribute these rollouts across available replicas - no manual worker pool management. - -**Fault Tolerance** -: If a replica fails during a rollout, services detect it, mark it unhealthy, and route subsequent requests to healthy replicas. The failed replica gets restarted automatically. Your RL code never sees the failure. - -**Ephemeral Infrastructure** -: Services are created with your job and torn down when finished. Want to try a new reward model? Change your Python code. No standing deployments to maintain, no infrastructure to provision ahead of time. - -## TorchStore: Distributed Weight Storage - -In async RL, every training step produces new policy weights that must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes of data. **TorchStore** makes this efficient. - -### The Weight Synchronization Challenge - -Traditionally, you have two options: -1. **Build complex p2p mappings** between training and inference sharding strategies (fast but extremely complex) -2. **Use network filesystem** like NFS (simple but slow, with high infrastructure cost) - -TorchStore combines the **UX of central storage** with the **performance of in-memory p2p operations**. - -### How TorchStore Works - -TorchStore is a distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives: - -```python -import torchstore as ts -from torch.distributed._tensor import distribute_tensor, Shard -from torch.distributed.device_mesh import init_device_mesh - -# Training process: store sharded weights -async def store_weights(): - device_mesh = init_device_mesh("cuda", (4,)) - tensor = model.state_dict()['layer.weight'] - dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) - - # Each rank stores its shard - await ts.put("policy_weights_v123", dtensor) - -# Inference process: fetch with different sharding -async def load_weights(): - device_mesh = init_device_mesh("cuda", (2, 2)) # Different topology! - tensor = torch.empty_like(model.state_dict()['layer.weight']) - dtensor = distribute_tensor(tensor, device_mesh, [Shard(0)]) - - # TorchStore handles resharding automatically - await ts.get("policy_weights_v123", dtensor) -``` - -**Key Features:** - -**Automatic Resharding** -: Handles complex weight transfer between different sharding strategies transparently - -**DTensor Native** -: Works seamlessly with PyTorch's distributed tensors - -**RDMA Transfers** -: Uses RDMA for high-bandwidth data movement without blocking GPUs - -**Asynchronous Updates** -: Training and inference can read/write weights independently, enabling true async RL - -**Flexible Storage** -: Store tensors co-located with trainers, on their own storage tier, sharded or replicated - change with minimal code modifications - -### Why TorchStore Matters - -Without TorchStore, weight synchronization becomes the bottleneck in async RL. Traditional approaches either: -- Require synchronous GPU-to-GPU transfers (blocking training) -- Use slow network filesystems (minutes per update) -- Demand complex manual resharding logic (error-prone, hard to maintain) - -TorchStore solves all of these, keeping data distributed across the cluster until requested and moving it efficiently with RDMA. - -## The RL Stack: Proven Components - -TorchForge made a conscious decision not to reinvent the wheel. We integrate battle-tested components: - -### vLLM: High-Throughput Inference - -**vLLM** handles policy generation with: -- **PagedAttention**: Memory-efficient attention that reduces fragmentation -- **Continuous Batching**: Dynamic batching for maximum GPU utilization -- **Production Performance**: Proven at scale - -In RL, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop. - -**Integration**: TorchForge integrates directly with vLLM's engine. You can customize generation strategies, memory management, and inference logic as your research demands. - -### TorchTitan: Production Training - -**TorchTitan** brings Meta's production-grade training infrastructure: -- **FSDP**: Shard parameters, gradients, and optimizer states across GPUs -- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching -- **Tensor Parallelism**: Split individual tensors across devices for very large models -- **Proven at Scale**: Used to train Llama models on thousands of GPUs - -Modern models are too large for single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently. - -**Integration**: TorchForge provides direct access to TorchTitan's training step logic and sharding strategies, enabling deep customization. - -### Role of Integration - -These integrations give you: -- **Direct component access**: Customize deeply when needed -- **Proven performance**: Battle-tested at massive scale -- **Flexible composition**: Mix and match with custom components - -TorchForge's role is **coordination** - making these components work together seamlessly so you can express your RL algorithm naturally. - -## Writing RL Algorithms - -With these foundations, here's what RL code looks like in TorchForge: - -### Episode Generation - -```python -async def generate_episode(dataloader, policy, reward, replay_buffer): - # Sample a prompt - prompt, target = await dataloader.sample.route() - - # Generate response (vLLM handles this efficiently) - response = await policy.generate.route(prompt) - - # Score the response - reward_value = await reward.evaluate_response.route( - prompt=prompt, - response=response.text, - target=target - ) - - # Store for training - await replay_buffer.add.route( - Episode( - prompt_ids=response.prompt_ids, - response_ids=response.token_ids, - reward=reward_value - ) - ) -``` - -Notice what's **not** here: -- No retry logic -- No resource allocation -- No synchronization code -- No infrastructure complexity - -Just your algorithm. - -### Asynchronous RL - -Compose this into fully async, off-policy training: - -```python -async def async_rl_loop(num_rollout_loops: int): - # Multiple concurrent rollout generators - rollout_tasks = [ - asyncio.create_task(continuous_rollouts()) - for _ in range(num_rollout_loops) - ] - - # Continuous training - training_task = asyncio.create_task(continuous_training()) - - await asyncio.gather(*rollout_tasks, training_task) - -async def continuous_rollouts(): - """Generate rollouts continuously using latest policy.""" - while True: - await generate_episode(dataloader, policy, reward, replay_buffer) - -async def continuous_training(): - """Train continuously on available experience.""" - training_step = 0 - while True: - batch = await replay_buffer.sample.route( - curr_policy_version=training_step - ) - - if batch is None: - await asyncio.sleep(0.1) # Wait for more experience - else: - loss = await trainer.train_step.route(batch) - training_step += 1 - - # Push updated weights (TorchStore handles this) - await trainer.push_weights.route(training_step) - # Broadcast to all policy replicas - await policy.update_weights.fanout(training_step) -``` - -### Synchronous RL - -The same `generate_episode()` function works for on-policy algorithms like PPO - just compose it differently: - -```python -async def synchronous_rl(batch_size: int): - """Synchronous on-policy RL: collect batch, then train.""" - version = 0 - - while True: - # Collect a full batch with current policy version - for _ in range(batch_size): - await generate_episode(dataloader, policy, reward, replay_buffer) - - # Sample the batch we just collected - batch = await replay_buffer.sample.route( - curr_policy_version=version, - batch_size=batch_size - ) - - # Train on the complete batch - loss = await trainer.train_step.route(batch) - - # Update weights in lockstep - await trainer.push_weights.route(version + 1) - await policy.update_weights.fanout(version + 1) - version += 1 -``` - -**The Power of Composition**: Write your rollout logic once, compose it into any paradigm - on-policy, off-policy, or anywhere in between. - -## Extensible Environments - -RL often requires interacting with environments beyond text generation - executing code, using tools, running simulations. TorchForge makes these first-class citizens through the same service abstraction. - -### Code Execution - -For RL on coding tasks (RLVR - Reinforcement Learning with Verifiable Rewards): - -```python -# Lightweight CPU-only service for parallel execution -coder = SandboxedPythonCoder.options( - procs=1, - with_gpus=False, - num_replicas=16 -).as_service() - -# In your RL code -async def generate_episode(): - prompt = await dataloader.sample.route() - code = await policy.generate.route(prompt) - - # Execute safely in sandbox - stdout, stderr = await coder.execute.route(code) - reward = 1.0 if stderr == "" else 0.0 # Reward based on execution - - await replay_buffer.add.route(Episode(...)) -``` - -### Tool Integration - -Services make tools ephemeral - spawn them with your job, scale them independently, tear down when finished. The same coordination primitives work for any environment type. - -This pattern extends naturally to **agentic workflows** - agents that interact with tools, query APIs, and navigate complex environments while learning from outcomes. - -## Resource Management - -Effective resource management is crucial for training large models. - -### Memory Optimization - -**Gradient Checkpointing** -: Trade computation for memory by recomputing activations during backward pass - -**Mixed Precision** -: Use FP16/BF16 for reduced memory footprint while maintaining quality - -**Activation Offloading** -: Move activations to CPU when not needed on GPU - -**Parameter Sharding** -: Distribute model parameters across devices using FSDP - -### Compute Optimization +## Key Abstractions -**Asynchronous Execution** -: Overlap communication with computation to hide latency +Understanding these core abstractions helps you use TorchForge effectively: -**Batch Size Tuning** -: Balance memory usage and throughput for optimal training speed +### Actor -**Dynamic Batching** -: Group requests efficiently for inference (vLLM does this) +A component that encapsulates a model along with its execution logic. Actors provide: +- **Isolation**: Independent resources and failure domains +- **Flexibility**: Different parallelism strategies per actor +- **Composability**: Combine actors to create complex pipelines -**Kernel Fusion** -: Combine operations to reduce memory bandwidth (torch.compile helps) +### Service -## Distributed Training Strategies +A managed group of actor replicas with built-in routing, load balancing, and fault tolerance. Services handle operational complexity so your RL code stays clean. Think of services as horizontally scaled actors with automatic load distribution. -TorchForge leverages multiple parallelism strategies through TorchTitan: +### DTensor (Distributed Tensor) -### Data Parallelism +A tensor sharded across multiple devices. TorchStore handles resharding DTensors between different topologies automatically, making distributed tensor operations transparent. -Each GPU processes different batches using the same model replica. Gradients are synchronized via all-reduce operations. +### Episode -### FSDP (Fully Sharded Data Parallel) +A complete RL interaction sequence containing: +- **Prompt**: Input to the policy +- **Response**: Generated output +- **Reward**: Feedback signal +- **Metadata**: Additional context (timestamps, model versions, etc.) -**Sharding** splits model parameters, gradients, and optimizer states across multiple devices, dramatically reducing memory per GPU. FSDP is the strategy that enables training models larger than single-GPU memory. +Episodes flow through your system from generation to replay buffer to training. -### Tensor Parallelism +### Replay Buffer -Split individual tensors (like large weight matrices) across devices. Each device computes a portion of the operation. +Stores episodes for training. Can be implemented with various strategies: +- **FIFO**: Simple queue for on-policy algorithms +- **Prioritized**: Importance sampling for off-policy learning +- **Reservoir**: Uniform sampling from history +- **Hybrid**: Mix multiple strategies -### Pipeline Parallelism +Integrates with TorchStore for efficient distributed storage. -Partition model layers across devices, pipeline micro-batches through the stages for efficient utilization. +## Design Principles -## Key Abstractions +### Single-Controller Model -Understanding these core abstractions helps you use TorchForge effectively: +Traditional distributed training uses **SPMD (Single Program, Multiple Data)** - where multiple copies of the same script run across different machines, each with only a local view of the workflow. This works well for simple data-parallel training, but becomes notoriously difficult for complex RL workflows with: +- Asynchronous generation and training +- Multiple heterogeneous components (policy, reward model, reference model) +- Dynamic resource allocation +- Fault tolerance across components -### Actor +TorchForge adopts **Monarch's single-controller model**: You write one Python script that orchestrates all distributed resources, making them feel almost local. The code looks and feels like a single-machine program, but can scale across thousands of GPUs. -A component that encapsulates a model along with its execution logic. Actors provide isolation (independent resources), flexibility (different parallelism strategies), and composability (combine to create complex pipelines). +### Composable Components -### Service +Write your core logic once, compose it into any paradigm: +- **Synchronous on-policy** (PPO, GRPO) +- **Asynchronous off-policy** (continuous rollouts + training) +- **Hybrid approaches** (batch collection with async training) -A managed group of actor replicas with built-in routing, load balancing, and fault tolerance. Services handle operational complexity so your RL code stays clean. +The same `generate_episode()` function works everywhere. Just change how you compose it. -### DTensor (Distributed Tensor) +### Ephemeral Infrastructure -A tensor sharded across multiple devices. TorchStore handles resharding DTensors between different topologies automatically. +Services are created with your job and torn down when finished: +- No standing deployments to maintain +- No infrastructure to provision ahead of time +- Want to try a new reward model? Change your Python code and rerun -### Episode +This dramatically reduces operational overhead and enables rapid experimentation. -A complete RL interaction sequence - prompt, response, reward, and metadata. Episodes flow through your system from generation to training. +### Progressive Fault Tolerance -### Replay Buffer +Write code as if nothing fails. When failures do occur: +- Monarch fails fast by default (like uncaught exceptions) +- Add fine-grained fault handling exactly where you need it +- Services automatically route around failed replicas +- Failed actors restart automatically -Stores episodes for training. Can be implemented with various strategies (FIFO, prioritized, etc.) and integrates with TorchStore for efficient storage. +You choose your fault tolerance granularity based on your needs. ## Best Practices @@ -507,10 +133,19 @@ TorchForge has been validated in real-world deployments: - **CoreWeave**: Large-scale training runs on 512 H100 GPU clusters with smooth, efficient performance - **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training -## See Also +## Learn More + +Dive deeper into specific topics: + +```{toctree} +:maxdepth: 1 + +architecture +technology_stack +rl_workflows +``` -- {doc}`getting_started` - Installation, setup, and first training run -- {doc}`usage` - Practical usage examples and configuration patterns -- {doc}`tutorials` - Step-by-step guides +**Related Documentation:** +- {doc}`getting_started` - Installation and first training run +- {doc}`usage` - Practical configuration examples - {doc}`api` - Complete API reference -- {doc}`faq` - Common questions and troubleshooting diff --git a/docs/source/conf.py b/docs/source/conf.py index 98b02cc6c..342188948 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -98,6 +98,8 @@ "navbar_center": "navbar-nav", "canonical_url": "https://meta-pytorch.org/forge/", "header_links_before_dropdown": 7, + "show_toc_level": 2, + "navigation_depth": 3, } theme_variables = pytorch_sphinx_theme2.get_theme_variables() diff --git a/docs/source/faq.md b/docs/source/faq.md index d3c027866..ebbe4f38e 100644 --- a/docs/source/faq.md +++ b/docs/source/faq.md @@ -1,3 +1,57 @@ # FAQ -This FAQ covers common questions and issues encountered when using TorchForge. +## Common Installation Issues + +### Issue: `gh: command not found` + +**Solution**: Install GitHub CLI: +```bash +# On Ubuntu/Debian +sudo apt install gh + +# On Fedora +sudo dnf install gh + +# Then authenticate +gh auth login +``` + +### Issue: `CUDA out of memory` + +**Solution**: Reduce batch size in your config file: +```yaml +training: + batch_size: 2 # Reduced from 4 + gradient_accumulation_steps: 8 # Increased to maintain effective batch size +``` + +### Issue: `ImportError: No module named 'torch'` + +**Solution**: Ensure you activated the conda environment: +```bash +conda activate forge +``` + +### Issue: vLLM wheel download fails + +**Solution**: The vLLM wheel is hosted on GitHub releases. Ensure you're authenticated with `gh auth login` and have internet access. + +### Issue: `Unsupported GPU architecture` + +**Solution**: Check your GPU compute capability: +```bash +python -c "import torch; print(torch.cuda.get_device_capability())" +``` + +TorchForge requires compute capability 7.0 or higher (Volta architecture or newer). + +### Issue: Monarch actor spawn failures + +**Symptom**: Errors like "Failed to spawn actors" or "Process allocation failed" + +**Solution**: Verify your GPU count matches your configuration: +```bash +nvidia-smi # Check available GPUs +``` + +Ensure your config requests fewer processes than available GPUs. diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 55e480983..17e08db75 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -34,76 +34,6 @@ Before installing TorchForge, ensure your system meets the following requirement - Usually pre-installed on Linux systems - Verify with: `git --version` -## Understanding TorchForge's Dependencies - -TorchForge is built on a carefully curated stack of components, each solving specific challenges in distributed RL. Understanding these dependencies helps you troubleshoot issues and customize your setup. - -### Monarch - -**What it is:** Monarch is a PyTorch-native distributed programming framework that brings single-controller orchestration to entire clusters. - -**Why TorchForge needs it:** -- **Single-Controller Model**: Write code that looks like a single Python program but scales to thousands of GPUs -- **Actor Meshes**: Organize processes and actors into scalable, multi-dimensional arrays -- **Fault Tolerance**: Progressive fault handling with fast failure detection and recovery -- **RDMA Support**: Direct GPU-to-GPU memory transfers for efficient data movement - -**What it solves:** Traditional SPMD (Single Program, Multiple Data) approaches require complex coordination logic in your code. Monarch abstracts this away, letting you write RL algorithms naturally while it handles distributed complexity. - -**Technical details:** Monarch is implemented with a Python frontend and a Rust backend for performance and robustness. It provides: -- Scalable messaging with multicast trees -- Multipart messaging for zero-copy data transfers -- Integration with PyTorch's distributed primitives - -### vLLM - -**What it is:** A fast and memory-efficient inference engine optimized for large language models. - -**Why TorchForge needs it:** -- **PagedAttention**: Memory-efficient attention mechanism that reduces memory fragmentation -- **Continuous Batching**: Dynamic batching that maximizes GPU utilization -- **High Throughput**: Handles generation for multiple concurrent rollouts efficiently -- **Production-Ready**: Battle-tested at scale with proven performance - -**What it solves:** In RL for LLMs, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop. - -**Technical details:** vLLM version 0.10.0+ is required. TorchForge integrates directly with vLLM's engine, giving you access to customize generation strategies, memory management, and inference logic. - -### TorchTitan - -**What it is:** Meta's production-grade LLM training platform with advanced parallelism support. - -**Why TorchForge needs it:** -- **FSDP (Fully Sharded Data Parallel)**: Shard parameters, gradients, and optimizer states across GPUs -- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching -- **Tensor Parallelism**: Split individual tensors across devices for very large models -- **Proven at Scale**: Used to train Llama models on thousands of GPUs - -**What it solves:** Modern models are too large to fit on single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently, with optimizations battle-tested in production. - -**Technical details:** TorchForge integrates with TorchTitan for training step logic and sharding strategies, enabling experimentation without framework constraints. - -### TorchStore - -**What it is:** A distributed, in-memory key-value store for PyTorch tensors, built on Monarch. - -**Why TorchForge needs it:** -- **Automatic Resharding**: Handles complex weight transfer between different sharding strategies -- **DTensor Support**: Native support for distributed tensors -- **RDMA Transfers**: High-bandwidth weight movement without synchronous GPU transfers -- **Asynchronous Updates**: Training and inference can read/write weights independently - -**What it solves:** In async RL, new policy weights must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes. TorchStore makes this efficient, handling resharding automatically and using RDMA for fast transfers. - -**Technical details:** TorchStore provides a simple key-value interface while optimizing data movement behind the scenes, staying distributed across the cluster until requested. - -### PyTorch Nightly - -**Why Nightly:** TorchForge requires the latest PyTorch features: -- **Native DTensor Support**: Distributed tensors that span multiple devices -- **Compiled Mode Optimizations**: Performance improvements through torch.compile -- **Advanced Memory Management**: Latest FSDP and memory optimization features -- **Bug Fixes**: Continuous improvements to distributed training primitives **Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included. @@ -320,62 +250,6 @@ checkpointing: See {doc}`usage` for detailed configuration options. -## Common Installation Issues - -### Issue: `gh: command not found` - -**Solution**: Install GitHub CLI: -```bash -# On Ubuntu/Debian -sudo apt install gh - -# On Fedora -sudo dnf install gh - -# Then authenticate -gh auth login -``` - -### Issue: `CUDA out of memory` - -**Solution**: Reduce batch size in your config file: -```yaml -training: - batch_size: 2 # Reduced from 4 - gradient_accumulation_steps: 8 # Increased to maintain effective batch size -``` - -### Issue: `ImportError: No module named 'torch'` - -**Solution**: Ensure you activated the conda environment: -```bash -conda activate forge -``` - -### Issue: vLLM wheel download fails - -**Solution**: The vLLM wheel is hosted on GitHub releases. Ensure you're authenticated with `gh auth login` and have internet access. - -### Issue: `Unsupported GPU architecture` - -**Solution**: Check your GPU compute capability: -```bash -python -c "import torch; print(torch.cuda.get_device_capability())" -``` - -TorchForge requires compute capability 7.0 or higher (Volta architecture or newer). - -### Issue: Monarch actor spawn failures - -**Symptom**: Errors like "Failed to spawn actors" or "Process allocation failed" - -**Solution**: Verify your GPU count matches your configuration: -```bash -nvidia-smi # Check available GPUs -``` - -Ensure your config requests fewer processes than available GPUs. - ## Next Steps Now that you have TorchForge installed and verified: diff --git a/docs/source/index.md b/docs/source/index.md index e76771961..b9e3a1475 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -225,9 +225,9 @@ TorchForge has been validated in real-world deployments: Before starting significant work, signal your intention in the issue tracker to coordinate with maintainers. ``` - ```{toctree} :maxdepth: 2 +:caption: Documentation getting_started concepts diff --git a/docs/source/rl_workflows.md b/docs/source/rl_workflows.md new file mode 100644 index 000000000..9a828e271 --- /dev/null +++ b/docs/source/rl_workflows.md @@ -0,0 +1,375 @@ +# RL Workflows + +This guide shows you how to write RL algorithms with TorchForge, from simple episode generation to complex asynchronous training loops. + +## Writing RL Algorithms + +With TorchForge's foundations (Monarch, Services, TorchStore), here's what RL code looks like: + +### Episode Generation + +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + # Sample a prompt + prompt, target = await dataloader.sample.route() + + # Generate response (vLLM handles this efficiently) + response = await policy.generate.route(prompt) + + # Score the response + reward_value = await reward.evaluate_response.route( + prompt=prompt, + response=response.text, + target=target + ) + + # Store for training + await replay_buffer.add.route( + Episode( + prompt_ids=response.prompt_ids, + response_ids=response.token_ids, + reward=reward_value + ) + ) +``` + +Notice what's **not** here: +- No retry logic +- No resource allocation +- No synchronization code +- No infrastructure complexity + +Just your algorithm. + +### Asynchronous RL + +Compose this into fully async, off-policy training: + +```python +async def async_rl_loop(num_rollout_loops: int): + # Multiple concurrent rollout generators + rollout_tasks = [ + asyncio.create_task(continuous_rollouts()) + for _ in range(num_rollout_loops) + ] + + # Continuous training + training_task = asyncio.create_task(continuous_training()) + + await asyncio.gather(*rollout_tasks, training_task) + +async def continuous_rollouts(): + """Generate rollouts continuously using latest policy.""" + while True: + await generate_episode(dataloader, policy, reward, replay_buffer) + +async def continuous_training(): + """Train continuously on available experience.""" + training_step = 0 + while True: + batch = await replay_buffer.sample.route( + curr_policy_version=training_step + ) + + if batch is None: + await asyncio.sleep(0.1) # Wait for more experience + else: + loss = await trainer.train_step.route(batch) + training_step += 1 + + # Push updated weights (TorchStore handles this) + await trainer.push_weights.route(training_step) + # Broadcast to all policy replicas + await policy.update_weights.fanout(training_step) +``` + +### Synchronous RL + +The same `generate_episode()` function works for on-policy algorithms like PPO - just compose it differently: + +```python +async def synchronous_rl(batch_size: int): + """Synchronous on-policy RL: collect batch, then train.""" + version = 0 + + while True: + # Collect a full batch with current policy version + for _ in range(batch_size): + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Sample the batch we just collected + batch = await replay_buffer.sample.route( + curr_policy_version=version, + batch_size=batch_size + ) + + # Train on the complete batch + loss = await trainer.train_step.route(batch) + + # Update weights in lockstep + await trainer.push_weights.route(version + 1) + await policy.update_weights.fanout(version + 1) + version += 1 +``` + +**The Power of Composition**: Write your rollout logic once, compose it into any paradigm - on-policy, off-policy, or anywhere in between. + +## Extensible Environments + +RL often requires interacting with environments beyond text generation - executing code, using tools, running simulations. TorchForge makes these first-class citizens through the same service abstraction. + +### Code Execution + +For RL on coding tasks (RLVR - Reinforcement Learning with Verifiable Rewards): + +```python +# Lightweight CPU-only service for parallel execution +coder = SandboxedPythonCoder.options( + procs=1, + with_gpus=False, + num_replicas=16 +).as_service() + +# In your RL code +async def generate_episode(): + prompt = await dataloader.sample.route() + code = await policy.generate.route(prompt) + + # Execute safely in sandbox + stdout, stderr = await coder.execute.route(code) + reward = 1.0 if stderr == "" else 0.0 # Reward based on execution + + await replay_buffer.add.route(Episode(...)) +``` + +### Tool Integration + +Services make tools ephemeral - spawn them with your job, scale them independently, tear down when finished. The same coordination primitives work for any environment type. + +```python +# Create a web browsing environment +browser = WebBrowsingEnv.options( + procs=1, + with_gpus=False, + num_replicas=8 +).as_service() + +# Use it in your RL loop +async def generate_episode(): + task = await dataloader.sample.route() + + # Agent decides on actions + action = await policy.generate.route(task) + + # Execute action in browser + result = await browser.execute_action.route(action) + + # Evaluate outcome + reward = await reward_model.evaluate.route(task, result) + + await replay_buffer.add.route(Episode(...)) +``` + +This pattern extends naturally to **agentic workflows** - agents that interact with tools, query APIs, and navigate complex environments while learning from outcomes. + +### Custom Environments + +Build your own environment service: + +```python +from monarch.actor import Actor, endpoint + +class CustomEnv(Actor): + def __init__(self): + # Initialize your environment + self.state = self.reset() + + @endpoint + async def reset(self): + """Reset environment to initial state.""" + return initial_state + + @endpoint + async def step(self, action): + """Execute action and return (observation, reward, done).""" + # Your environment logic here + return observation, reward, done + +# Deploy as a service +env = CustomEnv.options( + procs=1, + num_replicas=16 +).as_service() + +# Use in training +obs = await env.reset.route() +while not done: + action = await policy.act.route(obs) + obs, reward, done = await env.step.route(action) +``` + +## Common Patterns + +### Warmup Phase + +Start training after collecting initial experience: + +```python +async def warmup_then_train(warmup_episodes: int): + # Collect initial experience + for _ in range(warmup_episodes): + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Now start training + await continuous_training() +``` + +### Evaluation Episodes + +Interleave evaluation with training: + +```python +async def train_with_eval(eval_interval: int): + training_step = 0 + + while True: + # Training phase + for _ in range(eval_interval): + await generate_episode(dataloader, policy, reward, replay_buffer) + batch = await replay_buffer.sample.route() + await trainer.train_step.route(batch) + training_step += 1 + + # Evaluation phase + eval_rewards = [] + for _ in range(100): + episode = await generate_episode( + eval_dataloader, policy, reward, None # Don't store in buffer + ) + eval_rewards.append(episode.reward) + + print(f"Step {training_step}: Eval reward = {np.mean(eval_rewards)}") +``` + +### Curriculum Learning + +Gradually increase task difficulty: + +```python +async def curriculum_training(): + difficulty = 0 + + while difficulty < max_difficulty: + # Train on current difficulty + for _ in range(episodes_per_difficulty): + prompt = await dataloader.sample.route(difficulty=difficulty) + await generate_episode_with_prompt(prompt, policy, reward, replay_buffer) + + # Evaluate performance + success_rate = await evaluate(policy, difficulty) + + # Move to next difficulty if threshold met + if success_rate > 0.8: + difficulty += 1 + print(f"Advanced to difficulty {difficulty}") +``` + +### Multi-Task Training + +Train on multiple tasks simultaneously: + +```python +async def multi_task_training(tasks: List[str]): + # Create separate dataloaders for each task + dataloaders = {task: create_dataloader(task) for task in tasks} + + while True: + # Sample task uniformly (or with custom distribution) + task = random.choice(tasks) + dataloader = dataloaders[task] + + # Generate episode for this task + await generate_episode(dataloader, policy, reward, replay_buffer) + + # Train uses mixed experience from all tasks + batch = await replay_buffer.sample.route() + await trainer.train_step.route(batch) +``` + +## Debugging Tips + +### Start Small + +Begin with a minimal setup to validate your logic: + +```python +# Single GPU, single replica, synchronous +policy = PolicyActor.options(procs=1, with_gpus=True).as_service() +reward = RewardActor.options(procs=1, with_gpus=True).as_service() + +# Run a few episodes +for _ in range(10): + await generate_episode(dataloader, policy, reward, replay_buffer) +``` + +Once this works, scale up to multi-GPU and async training. + +### Add Logging + +Insert logging at key points: + +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + start = time.time() + + prompt, target = await dataloader.sample.route() + print(f"Sampled prompt in {time.time() - start:.2f}s") + + gen_start = time.time() + response = await policy.generate.route(prompt) + print(f"Generated response in {time.time() - gen_start:.2f}s") + + reward_start = time.time() + reward_value = await reward.evaluate_response.route(prompt, response.text, target) + print(f"Computed reward in {time.time() - reward_start:.2f}s") + + await replay_buffer.add.route(Episode(...)) + print(f"Total episode time: {time.time() - start:.2f}s") +``` + +### Monitor Metrics + +Track key metrics: + +```python +from collections import deque + +recent_rewards = deque(maxlen=100) +recent_kls = deque(maxlen=100) + +async def continuous_training(): + training_step = 0 + + while True: + batch = await replay_buffer.sample.route() + if batch: + loss, kl = await trainer.train_step.route(batch) + recent_kls.append(kl) + + if training_step % 100 == 0: + print(f"Step {training_step}") + print(f" Avg reward: {np.mean(recent_rewards):.3f}") + print(f" Avg KL: {np.mean(recent_kls):.3f}") + print(f" Loss: {loss:.3f}") + + training_step += 1 +``` + +## See Also + +- {doc}`concepts` - Core philosophy and abstractions +- {doc}`architecture` - How Services and TorchStore enable these patterns +- {doc}`technology_stack` - Understanding the underlying components +- {doc}`usage` - Configuration and practical examples +- {doc}`tutorials` - Step-by-step guides +- {doc}`api` - Complete API reference diff --git a/docs/source/technology_stack.md b/docs/source/technology_stack.md new file mode 100644 index 000000000..1651f26c7 --- /dev/null +++ b/docs/source/technology_stack.md @@ -0,0 +1,120 @@ +# Technology Stack + +TorchForge is built on a carefully curated stack of battle-tested components, each solving specific challenges in distributed RL. Understanding this stack helps you troubleshoot issues, optimize performance, and customize your setup. + +## Monarch: The Distributed Foundation + +**What it is:** Monarch is a PyTorch-native distributed programming framework that brings single-controller orchestration to entire clusters. It's implemented with a Python frontend and Rust backend for performance and robustness. + +**Why TorchForge needs it:** +- **Single-Controller Model**: Write code that looks like a single Python program but scales to thousands of GPUs +- **Actor Meshes**: Organize processes and actors into scalable, multi-dimensional arrays +- **Fault Tolerance**: Progressive fault handling with fast failure detection and recovery +- **RDMA Support**: Direct GPU-to-GPU memory transfers for efficient data movement + +**What it solves:** Traditional SPMD (Single Program, Multiple Data) approaches require complex coordination logic in your code. Monarch abstracts this away, letting you write RL algorithms naturally while it handles distributed complexity underneath. + +**Technical capabilities:** +- Scalable messaging with multicast trees +- Multipart messaging for zero-copy data transfers +- Integration with PyTorch's distributed primitives +- Separation of control plane (messaging) and data plane (bulk transfers) + +**Where you see it:** Every service creation, actor spawn, and distributed operation in TorchForge runs on Monarch primitives. It's the invisible orchestration layer that makes distributed RL feel simple. + +## vLLM: High-Performance Inference + +**What it is:** A fast and memory-efficient inference engine optimized for large language models, version 0.10.0 or higher required. + +**Why TorchForge needs it:** +- **PagedAttention**: Memory-efficient attention mechanism that reduces memory fragmentation +- **Continuous Batching**: Dynamic batching that maximizes GPU utilization +- **High Throughput**: Handles generation for multiple concurrent rollouts efficiently +- **Production-Ready**: Battle-tested at scale with proven performance + +**What it solves:** In RL for LLMs, policy generation is often the bottleneck. Autoregressive generation is costly, and blocking training on it kills throughput. vLLM enables fast, efficient inference that doesn't bottleneck your training loop. + +**Integration depth:** TorchForge integrates directly with vLLM's engine, giving you access to customize generation strategies, memory management, and inference logic as your research demands. You control the vLLM configuration while TorchForge handles distributed orchestration. + +**Where you see it:** Every policy generation call in your RL code runs through vLLM, whether you're doing synchronous PPO-style rollouts or fully asynchronous off-policy training. + +## TorchTitan: Production Training Infrastructure + +**What it is:** Meta's production-grade LLM training platform with advanced parallelism support, used to train Llama models on thousands of GPUs. + +**Why TorchForge needs it:** +- **FSDP (Fully Sharded Data Parallel)**: Shard parameters, gradients, and optimizer states across GPUs +- **Pipeline Parallelism**: Split model layers across devices with efficient micro-batching +- **Tensor Parallelism**: Split individual tensors across devices for very large models +- **Proven at Scale**: Battle-tested optimizations from production training runs + +**What it solves:** Modern models are too large to fit on single GPUs. TorchTitan provides the sophisticated sharding and parallelism strategies needed to train them efficiently, with optimizations proven in production environments. + +**Integration depth:** TorchForge provides direct access to TorchTitan's training step logic and sharding strategies, enabling experimentation without framework constraints. You can customize the training loop while leveraging TorchTitan's proven infrastructure. + +**Where you see it:** Policy training, whether supervised fine-tuning or RL policy updates, runs through TorchTitan's training infrastructure with your choice of parallelism strategies. + +## TorchStore: Distributed Weight Storage + +**What it is:** A distributed, in-memory key-value store for PyTorch tensors, built on Monarch primitives, designed specifically for weight synchronization in distributed RL. + +**Why TorchForge needs it:** +- **Automatic Resharding**: Handles complex weight transfer between different sharding strategies +- **DTensor Support**: Native support for distributed tensors with automatic topology conversion +- **RDMA Transfers**: High-bandwidth weight movement without synchronous GPU transfers +- **Asynchronous Updates**: Training and inference can read/write weights independently + +**What it solves:** In async RL, new policy weights must propagate to all inference replicas. For a 70B parameter model across 16 replicas, this means moving hundreds of gigabytes. Traditional approaches either require synchronous GPU-to-GPU transfers (blocking training), use slow network filesystems (minutes per update), or demand complex manual resharding logic (error-prone). TorchStore solves all of these. + +**Technical capabilities:** +- Simple key-value interface with complex optimizations underneath +- Stays distributed across the cluster until requested +- Flexible storage: co-located with trainers, on storage tier, sharded or replicated + +**Where you see it:** Weight synchronization between training and inference, allowing training to continue while inference replicas asynchronously fetch updated weights without blocking either process. + +## PyTorch Nightly: Cutting-Edge Features + +**Why Nightly is required:** TorchForge leverages the latest PyTorch features that aren't yet in stable releases: +- **Native DTensor Support**: Distributed tensors that span multiple devices with automatic sharding +- **Compiled Mode Optimizations**: Performance improvements through torch.compile +- **Advanced Memory Management**: Latest FSDP and memory optimization features +- **Bug Fixes**: Continuous improvements to distributed training primitives + +**Where you see it:** Every tensor operation, distributed primitive, and training optimization builds on PyTorch nightly's latest capabilities. + +## The Integration Philosophy + +TorchForge made a conscious decision not to reinvent the wheel. Instead, we integrate battle-tested components and add the coordination layer that makes them work together seamlessly. + +**What you get:** +- **Direct component access**: Customize deeply when your research demands it +- **Proven performance**: Battle-tested at massive scale in production environments +- **Flexible composition**: Mix and match components or replace them with custom implementations +- **Simplified orchestration**: TorchForge coordinates these components so you write algorithms, not infrastructure + +**TorchForge's role:** Coordination. We make these powerful but complex components work together seamlessly, exposing simple APIs for distributed RL while preserving deep customization capabilities when you need them. + +## Installation + +All these components are installed automatically through TorchForge's installation script: + +```bash +git clone https://github.com/meta-pytorch/forge.git +cd forge +conda create -n forge python=3.10 +conda activate forge +./scripts/install.sh +``` + +The script provides pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan, ensuring compatibility and reducing installation time. + +See {doc}`getting_started` for detailed installation instructions and troubleshooting. + +## See Also + +- {doc}`concepts` - Core philosophy and key abstractions +- {doc}`architecture` - How Monarch, Services, and TorchStore work together +- {doc}`rl_workflows` - Using these components to write RL algorithms +- {doc}`getting_started` - Installation and setup guide +- {doc}`usage` - Practical configuration examples