diff --git a/docs/source/conf.py b/docs/source/conf.py index 2ee6771ea..13acd25cc 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -140,8 +140,8 @@ def get_version_path(): "navbar_center": "navbar-nav", "canonical_url": "https://meta-pytorch.org/forge/", "header_links_before_dropdown": 7, - "show_nav_level": 2, "show_toc_level": 2, + "navigation_depth": 3, } theme_variables = pytorch_sphinx_theme2.get_theme_variables() @@ -173,6 +173,7 @@ def get_version_path(): "colon_fence", "deflist", "html_image", + "substitution", ] # Configure MyST parser to treat mermaid code blocks as mermaid directives diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index 3fe46de7e..6c2218806 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -1,9 +1,289 @@ -# Get Started +# Getting Started -Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models. +This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. -TorchForge specializes in post-training techniques for large language models, including: +## System Requirements -- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data -- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment -- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes +Before installing TorchForge, ensure your system meets the following requirements. + +| Component | Requirement | Notes | +|-----------|-------------|-------| +| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported | +| **Python** | 3.10 or higher | Python 3.11 recommended | +| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported | +| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models | +| **CUDA** | 12.8 | Required for GPU training | +| **RAM** | 32GB+ recommended | Depends on model size | +| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints | +| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) | +| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system | +| **vLLM** | v0.10.0+ | Fast inference with PagedAttention | +| **TorchTitan** | Latest | Production training infrastructure | + + +## Prerequisites + +- **Conda or Miniconda**: For environment management + - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) + +- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies + - Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) + - After installing, authenticate with: `gh auth login` + - You can use either HTTPS or SSH as the authentication protocol + +- **Git**: For cloning the repository + - Usually pre-installed on Linux systems + - Verify with: `git --version` + + +**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included. + +## Installation + +TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable. + +1. **Clone the Repository** + + ```bash + git clone https://github.com/meta-pytorch/forge.git + cd forge + ``` + +2. **Create Conda Environment** + + ```bash + conda create -n forge python=3.10 + conda activate forge + ``` + +3. **Run Installation Script** + + ```bash + ./scripts/install.sh + ``` + + The installation script will: + - Install system dependencies using DNF (or your package manager) + - Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan + - Install TorchForge and all Python dependencies + - Configure the environment for GPU training + + ```{tip} + **Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: + `./scripts/install.sh --use-sudo` + ``` + + ```{warning} + When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. + ``` + +## Verifying Your Setup + +After installation, verify that all components are working correctly: + +1. **Check GPU Availability** + + ```bash + python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" + ``` + + Expected output: `GPUs available: 2` (or more) + +2. **Check CUDA Version** + + ```bash + python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" + ``` + + Expected output: `CUDA version: 12.8` +3. **Check All Dependencies** + + ```bash + # Check core components + python -c "import torch, forge, monarch, vllm; print('All imports successful')" + + # Check specific versions + python -c " + import torch + import forge + import vllm + + print(f'PyTorch: {torch.__version__}') + print(f'TorchForge: {forge.__version__}') + print(f'vLLM: {vllm.__version__}') + print(f'CUDA: {torch.version.cuda}') + print(f'GPUs: {torch.cuda.device_count()}') + " + ``` + +4. **Verify Monarch** + + ```bash + python -c " + from monarch.actor import Actor, this_host + + # Test basic Monarch functionality + procs = this_host().spawn_procs({'gpus': 1}) + print('Monarch: Process spawning works') + " + ``` + +## Quick Start Examples + +Now that TorchForge is installed, let's run some training examples. + +Here's what training looks like with TorchForge: + +```bash +# Install dependencies +conda create -n forge python=3.10 +conda activate forge +git clone https://github.com/meta-pytorch/forge +cd forge +./scripts/install.sh + +# Download a model +hf download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /tmp/Meta-Llama-3.1-8B-Instruct --exclude "original/consolidated.00.pth" + +# Run SFT training (requires 2+ GPUs) +uv run forge run --nproc_per_node 2 \ + apps/sft/main.py --config apps/sft/llama3_8b.yaml + +# Run GRPO training (requires 3+ GPUs) +python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml +``` + +### Example 1: Supervised Fine-Tuning (SFT) + +Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs** + +1. **Download the Model** + + ```bash + uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ + --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ + --ignore-patterns "original/consolidated.00.pth" + ``` + + ```{note} + Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already. + ``` + +2. **Run Training** + + ```bash + python -m apps.sft.main --config apps/sft/llama3_8b.yaml + ``` + + **What's Happening:** + - `--nproc_per_node 2`: Use 2 GPUs for training + - `apps/sft/main.py`: SFT training script + - `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters + - **TorchTitan** handles model sharding across the 2 GPUs + - **Monarch** coordinates the distributed training + + + ``` + +### Example 2: GRPO Training + +Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs** + +```bash +python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml +``` + +**What's Happening:** +- GPU 0: Trainer model (being trained, powered by TorchTitan) +- GPU 1: Reference model (frozen baseline, powered by TorchTitan) +- GPU 2: Policy model (scoring outputs, powered by vLLM) +- **Monarch** orchestrates all three components +- **TorchStore** handles weight synchronization from training to inference + +## Understanding Configuration Files + +TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config: + +```yaml +# Example: apps/sft/llama3_8b.yaml +model: + name: meta-llama/Meta-Llama-3.1-8B-Instruct + path: /tmp/Meta-Llama-3.1-8B-Instruct + +training: + batch_size: 4 + learning_rate: 1e-5 + num_epochs: 10 + gradient_accumulation_steps: 4 + +distributed: + strategy: fsdp # Managed by TorchTitan + precision: bf16 + +checkpointing: + save_interval: 1000 + output_dir: /tmp/checkpoints +``` + +**Key Sections:** +- **model**: Model path and settings +- **training**: Hyperparameters like batch size and learning rate +- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan +- **checkpointing**: Where and when to save model checkpoints + +## Next Steps + +Now that you have TorchForge installed and verified: + +1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture, including Monarch, Services, and TorchStore +2. **Explore Examples**: Check the `apps/` directory for more training examples +4. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides +5. **API Documentation**: Explore {doc}`api` for detailed API reference + +## Getting Help + +If you encounter issues: + +1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues) +2. **File a Bug Report**: Create a new issue with: + - Your system configuration (output of diagnostic command below) + - Full error message + - Steps to reproduce + - Expected vs actual behavior + +**Diagnostic command:** +```bash +python -c " +import torch +import forge + +try: + import monarch + monarch_status = 'OK' +except Exception as e: + monarch_status = str(e) + +try: + import vllm + vllm_version = vllm.__version__ +except Exception as e: + vllm_version = str(e) + +print(f'PyTorch: {torch.__version__}') +print(f'TorchForge: {forge.__version__}') +print(f'Monarch: {monarch_status}') +print(f'vLLM: {vllm_version}') +print(f'CUDA: {torch.version.cuda}') +print(f'GPUs: {torch.cuda.device_count()}') +" +``` + +Include this output in your bug reports! + +## Additional Resources + +- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) +- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) +- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch) +- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai) +- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan) diff --git a/docs/source/index.md b/docs/source/index.md index 802d62baa..074fa228f 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -1,11 +1,183 @@ -# Welcome to TorchForge documentation! +# TorchForge Documentation -**TorchForge** is a PyTorch-native platform specifically designed -for post-training generative AI models. +**TorchForge** is a PyTorch-native library for RL post-training and agentic development. Built on the principle that **researchers should write algorithms, not infrastructure**. -Key Features ------------- +```{note} +**Experimental Status:** TorchForge is currently in early development. Expect bugs, incomplete features, and API changes. Please file issues on [GitHub](https://github.com/meta-pytorch/forge) for bug reports and feature requests. +``` + +## Why TorchForge? + +Reinforcement Learning has become essential to frontier AI - from instruction following and reasoning to complex research capabilities. But infrastructure complexity often dominates the actual research. + +TorchForge lets you **express RL algorithms as naturally as pseudocode**, while powerful infrastructure handles distribution, fault tolerance, and optimization underneath. + +### Core Design Principles + +- **Algorithms, Not Infrastructure**: Write your RL logic without distributed systems code +- **Any Degree of Asynchrony**: From fully synchronous PPO to fully async off-policy training +- **Composable Components**: Mix and match proven frameworks (vLLM, TorchTitan) with custom logic +- **Built on Solid Foundations**: Leverages Monarch's single-controller model for simplified distributed programming + +## Foundation: The Technology Stack + +TorchForge is built on carefully selected, battle-tested components: + +::::{grid} 1 1 2 2 +:gutter: 3 + +:::{grid-item-card} **Monarch** +:link: https://meta-pytorch.org/monarch + +Single-controller distributed programming framework that orchestrates clusters like you'd program a single machine. Provides actor meshes, fault tolerance, and RDMA-based data transfers. + +**Why it matters:** Eliminates SPMD complexity, making distributed RL tractable +::: + +:::{grid-item-card} **vLLM** +:link: https://docs.vllm.ai + +High-throughput, memory-efficient inference engine with PagedAttention and continuous batching. + +**Why it matters:** Handles policy generation efficiently at scale +::: + +:::{grid-item-card} **TorchTitan** +:link: https://github.com/pytorch/torchtitan + +Meta's production-grade LLM training platform with FSDP, pipeline parallelism, and tensor parallelism. + +**Why it matters:** Battle-tested training infrastructure proven at scale +::: + +:::{grid-item-card} **TorchStore** +:link: https://github.com/meta-pytorch/torchstore + +Distributed, in-memory key-value store for PyTorch tensors built on Monarch, optimized for weight synchronization with automatic DTensor resharding. + +**Why it matters:** Solves the weight transfer bottleneck in async RL +::: + +:::: + +## What You Can Build + +::::{grid} 1 1 2 3 +:gutter: 2 + +:::{grid-item-card} Supervised Fine-Tuning +Adapt foundation models to specific tasks using labeled data with efficient multi-GPU training. +::: + +:::{grid-item-card} GRPO Training +Train models with Generalized Reward Policy Optimization for aligning with human preferences. +::: + +:::{grid-item-card} Asynchronous RL +Continuous rollout generation with non-blocking training for maximum throughput. +::: + +:::{grid-item-card} Code Execution +Safe, sandboxed code execution environments for RL on coding tasks (RLVR). +::: + +:::{grid-item-card} Tool Integration +Extensible environment system for agents that interact with tools and APIs. +::: + +:::{grid-item-card} Custom Workflows +Build your own components and compose them naturally with existing infrastructure. +::: + +:::: +## Requirements at a Glance + +Before diving in, check out {doc}`getting_started` and ensure your system meets the requirements. + +## Writing RL Code + +With TorchForge, your RL logic looks like pseudocode: + +```python +async def generate_episode(dataloader, policy, reward, replay_buffer): + # Sample a prompt + prompt, target = await dataloader.sample.route() + + # Generate response + response = await policy.generate.route(prompt) + + # Score the response + reward_value = await reward.evaluate_response.route( + prompt=prompt, + response=response.text, + target=target + ) + + # Store for training + await replay_buffer.add.route( + Episode(prompt_ids=response.prompt_ids, + response_ids=response.token_ids, + reward=reward_value) + ) +``` + +No retry logic, no resource management, no synchronization code - just your algorithm. + +## Documentation Paths + +Choose your learning path: + +::::{grid} 1 1 2 2 +:gutter: 3 + +:::{grid-item-card} 🚀 Getting Started +:link: getting_started +:link-type: doc + +Installation, prerequisites, verification, and your first training run. + +**Time to first run: ~15 minutes** +::: + +:::{grid-item-card} 💻 Tutorials +:link: tutorials +:link-type: doc + +Step-by-step guides and practical examples for training with TorchForge. + +**For hands-on development** +::: + +:::{grid-item-card} 📖 API Reference +:link: api +:link-type: doc + +Complete API documentation for customization and extension. + +**For deep integration** +::: + +:::: + +## Validation & Partnerships + +TorchForge has been validated in real-world deployments: + +- **Stanford Collaboration**: Integration with the Weaver weak verifier project, training models that hill-climb on challenging reasoning benchmarks (MATH, GPQA) +- **CoreWeave**: Large-scale training on 512 H100 GPU clusters with smooth, efficient performance +- **Scale**: Tested across hundreds of GPUs with continuous rollouts and asynchronous training + +## Community + +- **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge) +- **Issues**: [Report bugs and request features](https://github.com/meta-pytorch/forge/issues) +- **Contributing**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) +- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) + +```{tip} +Before starting significant work, signal your intention in the issue tracker to coordinate with maintainers. +``` * **Post-Training Focus**: Specializes in techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) * **PyTorch Integration**: Built natively on PyTorch with @@ -18,17 +190,19 @@ Key Features like Llama3 8B and Qwen3.1 7B ```{toctree} -:maxdepth: 1 -:caption: Contents: +:maxdepth: 2 +:caption: Documentation getting_started -concepts tutorials api ``` -## Indices and tables +## Indices + +* {ref}`genindex` - Index of all documented objects +* {ref}`modindex` - Python module index + +--- -* {ref}`genindex` -* {ref}`modindex` -* {ref}`search` +**License**: BSD 3-Clause | **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge)