-
Notifications
You must be signed in to change notification settings - Fork 16
Docs Content Part 1: Homepage and getting started #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 19 commits
b1794a9
087e2ff
a0b2412
b6d466c
b564175
92ca627
f4b951b
34640e7
32c8d78
e448c90
ce9b472
e998d94
c89393c
7a31e26
af4eae7
9d49ee6
c410375
6c70c8f
f9b136a
c41f035
e39f29b
c6c309a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,313 @@ | ||
# Get Started | ||
--- | ||
orphan: true | ||
--- | ||
|
||
Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models. | ||
# Getting Started | ||
|
||
TorchForge specializes in post-training techniques for large language models, including: | ||
This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. | ||
|
||
- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data | ||
- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment | ||
- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes | ||
## System Requirements | ||
|
||
Before installing TorchForge, ensure your system meets the following requirements. | ||
|
||
| Component | Requirement | Notes | | ||
|-----------|-------------|-------| | ||
| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported | | ||
| **Python** | 3.10 or higher | Python 3.11 recommended | | ||
| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported | | ||
| **CUDA** | 12.8 or higher | Required for GPU training | | ||
| **Minimum GPUs** | 2 for SFT, 3 for GRPO | More GPUs enable larger models | | ||
| **RAM** | 32GB+ recommended | Depends on model size | | ||
| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints | | ||
|
||
## Prerequisites | ||
|
||
- **Conda or Miniconda**: For environment management | ||
- Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) | ||
|
||
- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies | ||
- Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) | ||
- After installing, authenticate with: `gh auth login` | ||
- You can use either HTTPS or SSH as the authentication protocol | ||
|
||
- **Git**: For cloning the repository | ||
- Usually pre-installed on Linux systems | ||
- Verify with: `git --version` | ||
|
||
|
||
**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included. | ||
|
||
## Installation | ||
|
||
TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable. | ||
|
||
1. **Clone the Repository** | ||
|
||
```bash | ||
git clone https://github.com/meta-pytorch/forge.git | ||
cd forge | ||
``` | ||
|
||
2. **Create Conda Environment** | ||
|
||
```bash | ||
conda create -n forge python=3.10 | ||
conda activate forge | ||
``` | ||
|
||
3. **Run Installation Script** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this may be ok for now, but even possibly as soon as EOD today we may have different instructions cc @joecummings If we keep a script, what the script does will be different. I think we can ship this for now, and update this once we're done |
||
|
||
```bash | ||
./scripts/install.sh | ||
``` | ||
|
||
The installation script will: | ||
- Install system dependencies using DNF (or your package manager) | ||
- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan | ||
- Install TorchForge and all Python dependencies | ||
- Configure the environment for GPU training | ||
|
||
```{tip} | ||
**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: | ||
`./scripts/install.sh --use-sudo` | ||
``` | ||
|
||
4. **Verify Installation** | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Test that TorchForge is properly installed: | ||
|
||
```bash | ||
python -c "import forge; print(f'TorchForge version: {forge.__version__}')" | ||
python -c "import monarch; print('Monarch: OK')" | ||
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')" | ||
``` | ||
|
||
```{warning} | ||
When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. | ||
``` | ||
|
||
## Verifying Your Setup | ||
|
||
After installation, verify that all components are working correctly: | ||
|
||
1. **Check GPU Availability** | ||
|
||
```bash | ||
python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" | ||
``` | ||
|
||
Expected output: `GPUs available: 2` (or more) | ||
|
||
2. **Check CUDA Version** | ||
|
||
```bash | ||
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" | ||
``` | ||
|
||
Expected output: `CUDA version: 12.8` (or higher) | ||
|
||
3. **Check All Dependencies** | ||
|
||
```bash | ||
# Check core components | ||
python -c "import torch, forge, monarch, vllm; print('All imports successful')" | ||
|
||
# Check specific versions | ||
python -c " | ||
import torch | ||
import forge | ||
import vllm | ||
print(f'PyTorch: {torch.__version__}') | ||
print(f'TorchForge: {forge.__version__}') | ||
print(f'vLLM: {vllm.__version__}') | ||
print(f'CUDA: {torch.version.cuda}') | ||
print(f'GPUs: {torch.cuda.device_count()}') | ||
" | ||
``` | ||
|
||
4. **Verify Monarch** | ||
|
||
```bash | ||
python -c " | ||
from monarch.actor import Actor, this_host | ||
# Test basic Monarch functionality | ||
procs = this_host().spawn_procs({'gpus': 1}) | ||
print('Monarch: Process spawning works') | ||
" | ||
``` | ||
|
||
## Quick Start Examples | ||
|
||
Now that TorchForge is installed, let's run some training examples. | ||
|
||
### Example 1: Supervised Fine-Tuning (SFT) | ||
|
||
Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs** | ||
|
||
1. **Download the Model** | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ebsmothers @daniellepintz - could you two please review these commands for SFT and ensure this is what we want? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @AlannaBurke let's use this command
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually there is no need to download the model anymore, we simplified that in the configs. users just need to run this cmd once they have acc to the model:
this is an easier workflow for users |
||
```bash | ||
uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \ | ||
--output-dir /tmp/Meta-Llama-3.1-8B-Instruct \ | ||
--ignore-patterns "original/consolidated.00.pth" | ||
``` | ||
|
||
```{note} | ||
Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already. | ||
``` | ||
|
||
2. **Run Training** | ||
|
||
```bash | ||
uv run forge run --nproc_per_node 2 \ | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
apps/sft/main.py \ | ||
--config apps/sft/llama3_8b.yaml | ||
``` | ||
|
||
**What's Happening:** | ||
- `--nproc_per_node 2`: Use 2 GPUs for training | ||
- `apps/sft/main.py`: SFT training script | ||
- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters | ||
- **TorchTitan** handles model sharding across the 2 GPUs | ||
- **Monarch** coordinates the distributed training | ||
|
||
**Expected Output:** | ||
``` | ||
Initializing process group... | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Loading model from /tmp/Meta-Llama-3.1-8B-Instruct... | ||
Starting training... | ||
Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001 | ||
... | ||
``` | ||
|
||
### Example 2: GRPO Training | ||
|
||
Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs** | ||
|
||
```bash | ||
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml | ||
``` | ||
|
||
**What's Happening:** | ||
- GPU 0: Policy model (being trained, powered by TorchTitan) | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
- GPU 1: Reference model (frozen baseline) | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
- GPU 2: Reward model (scoring outputs, powered by vLLM) | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
- **Monarch** orchestrates all three components | ||
- **TorchStore** handles weight synchronization from training to inference | ||
|
||
**Expected Output:** | ||
```bash | ||
Initializing GRPO training... | ||
Loading policy model on GPU 0... | ||
Loading reference model on GPU 1... | ||
Loading reward model on GPU 2... | ||
Episode 1 | Avg Reward: 0.75 | KL Divergence: 0.12 | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
... | ||
``` | ||
|
||
### Example 3: Inference with vLLM | ||
AlannaBurke marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Generate text using a trained model: | ||
|
||
```bash | ||
python -m apps.vllm.main --config apps/vllm/llama3_8b.yaml | ||
``` | ||
|
||
This loads your model with vLLM and starts an interactive generation session. | ||
|
||
## Understanding Configuration Files | ||
|
||
TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config: | ||
|
||
```yaml | ||
# Example: apps/sft/llama3_8b.yaml | ||
model: | ||
name: meta-llama/Meta-Llama-3.1-8B-Instruct | ||
path: /tmp/Meta-Llama-3.1-8B-Instruct | ||
|
||
training: | ||
batch_size: 4 | ||
learning_rate: 1e-5 | ||
num_epochs: 10 | ||
gradient_accumulation_steps: 4 | ||
|
||
distributed: | ||
strategy: fsdp # Managed by TorchTitan | ||
precision: bf16 | ||
|
||
checkpointing: | ||
save_interval: 1000 | ||
output_dir: /tmp/checkpoints | ||
``` | ||
**Key Sections:** | ||
- **model**: Model path and settings | ||
- **training**: Hyperparameters like batch size and learning rate | ||
- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan | ||
- **checkpointing**: Where and when to save model checkpoints | ||
See {doc}`usage` for detailed configuration options. | ||
|
||
## Next Steps | ||
|
||
Now that you have TorchForge installed and verified: | ||
|
||
1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture, including Monarch, Services, and TorchStore | ||
2. **Explore Examples**: Check the `apps/` directory for more training examples | ||
3. **Customize Training**: See {doc}`usage` for configuration patterns | ||
4. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides | ||
5. **API Documentation**: Explore {doc}`api` for detailed API reference | ||
|
||
## Getting Help | ||
|
||
If you encounter issues: | ||
|
||
1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues) | ||
2. **File a Bug Report**: Create a new issue with: | ||
- Your system configuration (output of diagnostic command below) | ||
- Full error message | ||
- Steps to reproduce | ||
- Expected vs actual behavior | ||
|
||
**Diagnostic command:** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a really good idea - let's keep this, and I think we should come up with a script for this in our issue templates.. cc @joecummings @daniellepintz ? not sure who to tag here |
||
```bash | ||
python -c " | ||
import torch | ||
import forge | ||
try: | ||
import monarch | ||
monarch_status = 'OK' | ||
except Exception as e: | ||
monarch_status = str(e) | ||
try: | ||
import vllm | ||
vllm_version = vllm.__version__ | ||
except Exception as e: | ||
vllm_version = str(e) | ||
print(f'PyTorch: {torch.__version__}') | ||
print(f'TorchForge: {forge.__version__}') | ||
print(f'Monarch: {monarch_status}') | ||
print(f'vLLM: {vllm_version}') | ||
print(f'CUDA: {torch.version.cuda}') | ||
print(f'GPUs: {torch.cuda.device_count()}') | ||
" | ||
``` | ||
|
||
Include this output in your bug reports! | ||
|
||
## Additional Resources | ||
|
||
- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) | ||
- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) | ||
- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch) | ||
- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai) | ||
- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan) | ||
|
||
--- | ||
|
||
**Ready to start training?** Head to {doc}`usage` for practical configuration examples and workflows, or dive into {doc}`concepts` to understand how all the pieces work together. |
Uh oh!
There was an error while loading. Please reload this page.