Skip to content
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b1794a9
Updating homepage, getting started, concepts.
AlannaBurke Oct 8, 2025
087e2ff
Update documentation with blog post insights: enhanced homepage, comp…
AlannaBurke Oct 8, 2025
a0b2412
Update docs/source/getting_started.md
AlannaBurke Oct 10, 2025
b6d466c
Update docs/source/index.md
AlannaBurke Oct 10, 2025
b564175
Update docs/source/index.md
AlannaBurke Oct 10, 2025
92ca627
Minor fixes and updates.
AlannaBurke Oct 10, 2025
f4b951b
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke Oct 10, 2025
34640e7
Update docs/source/getting_started.md
AlannaBurke Oct 10, 2025
32c8d78
Restructing info.
AlannaBurke Oct 11, 2025
e448c90
Merge branch 'main' of github.com:meta-pytorch/forge into getting-sta…
AlannaBurke Oct 14, 2025
ce9b472
Update docs/source/getting_started.md
AlannaBurke Oct 14, 2025
e998d94
Merge branch 'getting-started' of github.com:meta-pytorch/forge into …
AlannaBurke Oct 15, 2025
c89393c
Updating gpu references.
AlannaBurke Oct 15, 2025
7a31e26
Updating toctree entries.
AlannaBurke Oct 15, 2025
af4eae7
Removing FAQs
AlannaBurke Oct 15, 2025
9d49ee6
Removing FAQ references.
AlannaBurke Oct 15, 2025
c410375
Update docs/source/getting_started.md
AlannaBurke Oct 15, 2025
6c70c8f
Merge branch 'main' into getting-started
AlannaBurke Oct 15, 2025
f9b136a
docs: Improve homepage and getting started pages
AlannaBurke Oct 17, 2025
c41f035
Updating index and getting started pages.
AlannaBurke Oct 17, 2025
e39f29b
Removing broken links and references to concepts.
AlannaBurke Oct 17, 2025
c6c309a
Minor doc changes.
AlannaBurke Oct 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,8 @@ def get_version_path():
"navbar_center": "navbar-nav",
"canonical_url": "https://meta-pytorch.org/forge/",
"header_links_before_dropdown": 7,
"show_nav_level": 2,
"show_toc_level": 2,
"navigation_depth": 3,
}

theme_variables = pytorch_sphinx_theme2.get_theme_variables()
Expand Down Expand Up @@ -173,6 +173,7 @@ def get_version_path():
"colon_fence",
"deflist",
"html_image",
"substitution",
]

# Configure MyST parser to treat mermaid code blocks as mermaid directives
Expand Down
306 changes: 300 additions & 6 deletions docs/source/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,303 @@
# Get Started
---
orphan: true
---

Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models.
# Getting Started

TorchForge specializes in post-training techniques for large language models, including:
This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job.

- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data
- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment
- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes
## System Requirements

Before installing TorchForge, ensure your system meets the following requirements.

| Component | Requirement | Notes |
|-----------|-------------|-------|
| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported |
| **Python** | 3.10 or higher | Python 3.11 recommended |
| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported |
| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models |
| **CUDA** | 12.8 | Required for GPU training |
| **RAM** | 32GB+ recommended | Depends on model size |
| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints |
| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) |
| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system |
| **vLLM** | v0.10.0+ | Fast inference with PagedAttention |
| **TorchTitan** | Latest | Production training infrastructure |


## Prerequisites

- **Conda or Miniconda**: For environment management
- Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html)

- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies
- Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation)
- After installing, authenticate with: `gh auth login`
- You can use either HTTPS or SSH as the authentication protocol

- **Git**: For cloning the repository
- Usually pre-installed on Linux systems
- Verify with: `git --version`


**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.

## Installation

TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable.

1. **Clone the Repository**

```bash
git clone https://github.com/meta-pytorch/forge.git
cd forge
```

2. **Create Conda Environment**

```bash
conda create -n forge python=3.10
conda activate forge
```

3. **Run Installation Script**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be ok for now, but even possibly as soon as EOD today we may have different instructions cc @joecummings

If we keep a script, what the script does will be different. I think we can ship this for now, and update this once we're done


```bash
./scripts/install.sh
```

The installation script will:
- Install system dependencies using DNF (or your package manager)
- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan
- Install TorchForge and all Python dependencies
- Configure the environment for GPU training

```{tip}
**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use:
`./scripts/install.sh --use-sudo`
```

```{warning}
When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM.
```

## Verifying Your Setup

After installation, verify that all components are working correctly:

1. **Check GPU Availability**

```bash
python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')"
```

Expected output: `GPUs available: 2` (or more)

2. **Check CUDA Version**

```bash
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"
```

Expected output: `CUDA version: 12.8`
3. **Check All Dependencies**

```bash
# Check core components
python -c "import torch, forge, monarch, vllm; print('All imports successful')"

# Check specific versions
python -c "
import torch
import forge
import vllm
print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'vLLM: {vllm.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPUs: {torch.cuda.device_count()}')
"
```

4. **Verify Monarch**

```bash
python -c "
from monarch.actor import Actor, this_host
# Test basic Monarch functionality
procs = this_host().spawn_procs({'gpus': 1})
print('Monarch: Process spawning works')
"
```

## Quick Start Examples

Now that TorchForge is installed, let's run some training examples.

Here's what training looks like with TorchForge:

```bash
# Install dependencies
conda create -n forge python=3.10
conda activate forge
git clone https://github.com/meta-pytorch/forge
cd forge
./scripts/install.sh

# Download a model
uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
--output-dir /tmp/Meta-Llama-3.1-8B-Instruct

# Run SFT training (requires 2+ GPUs)
uv run forge run --nproc_per_node 2 \
apps/sft/main.py --config apps/sft/llama3_8b.yaml

# Run GRPO training (requires 3+ GPUs)
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

### Example 1: Supervised Fine-Tuning (SFT)

Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs**

1. **Download the Model**

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebsmothers @daniellepintz - could you two please review these commands for SFT and ensure this is what we want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlannaBurke let's use this command

hf download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /tmp/Meta-Llama-3.1-8B-Instruct --exclude "original/consolidated.00.pth"

Copy link
Contributor

@daniellepintz daniellepintz Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is no need to download the model anymore, we simplified that in the configs. users just need to run this cmd once they have acc to the model:

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

this is an easier workflow for users

```bash
uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
--output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
--ignore-patterns "original/consolidated.00.pth"
```

```{note}
Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already.
```

2. **Run Training**

```bash
uv run forge run --nproc_per_node 2 \
apps/sft/main.py \
--config apps/sft/llama3_8b.yaml
```

**What's Happening:**
- `--nproc_per_node 2`: Use 2 GPUs for training
- `apps/sft/main.py`: SFT training script
- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters
- **TorchTitan** handles model sharding across the 2 GPUs
- **Monarch** coordinates the distributed training

**Expected Output:**
```
Initializing process group...
Loading model from /tmp/Meta-Llama-3.1-8B-Instruct...
Starting training...
Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001
...
```

### Example 2: GRPO Training

Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs**

```bash
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

**What's Happening:**
- GPU 0: Trainer model (being trained, powered by TorchTitan)
- GPU 1: Reference model (frozen baseline, powered by TorchTitan)
- GPU 2: Policy model (scoring outputs, powered by vLLM)
- **Monarch** orchestrates all three components
- **TorchStore** handles weight synchronization from training to inference

## Understanding Configuration Files

TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config:

```yaml
# Example: apps/sft/llama3_8b.yaml
model:
name: meta-llama/Meta-Llama-3.1-8B-Instruct
path: /tmp/Meta-Llama-3.1-8B-Instruct

training:
batch_size: 4
learning_rate: 1e-5
num_epochs: 10
gradient_accumulation_steps: 4

distributed:
strategy: fsdp # Managed by TorchTitan
precision: bf16

checkpointing:
save_interval: 1000
output_dir: /tmp/checkpoints
```
**Key Sections:**
- **model**: Model path and settings
- **training**: Hyperparameters like batch size and learning rate
- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan
- **checkpointing**: Where and when to save model checkpoints
See the configuration examples in your training scripts for detailed options.
## Next Steps
Now that you have TorchForge installed and verified:
1. **Explore Examples**: Check the `apps/` directory for more training examples
2. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides
3. **API Documentation**: Explore {doc}`api` for detailed API reference

## Getting Help

If you encounter issues:

1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues)
2. **File a Bug Report**: Create a new issue with:
- Your system configuration (output of diagnostic command below)
- Full error message
- Steps to reproduce
- Expected vs actual behavior

**Diagnostic command:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really good idea - let's keep this, and I think we should come up with a script for this in our issue templates..

cc @joecummings @daniellepintz ? not sure who to tag here

```bash
python -c "
import torch
import forge
try:
import monarch
monarch_status = 'OK'
except Exception as e:
monarch_status = str(e)
try:
import vllm
vllm_version = vllm.__version__
except Exception as e:
vllm_version = str(e)
print(f'PyTorch: {torch.__version__}')
print(f'TorchForge: {forge.__version__}')
print(f'Monarch: {monarch_status}')
print(f'vLLM: {vllm_version}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPUs: {torch.cuda.device_count()}')
"
```

Include this output in your bug reports!

## Additional Resources

- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch)
- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai)
- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)
Loading
Loading