Skip to content

Commit ff5c5e1

Browse files
committed
Fixing updates to index and getting started.
1 parent 025d730 commit ff5c5e1

File tree

2 files changed

+0
-308
lines changed

2 files changed

+0
-308
lines changed

docs/source/getting_started.md

Lines changed: 0 additions & 298 deletions
Original file line numberDiff line numberDiff line change
@@ -36,304 +36,6 @@ Before installing TorchForge, ensure your system meets the following requirement
3636
- Verify with: `git --version`
3737

3838

39-
**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.
40-
41-
## Installation
42-
43-
TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable.
44-
45-
1. **Clone the Repository**
46-
47-
```bash
48-
git clone https://github.com/meta-pytorch/forge.git
49-
cd forge
50-
```
51-
52-
2. **Create Conda Environment**
53-
54-
```bash
55-
conda create -n forge python=3.10
56-
conda activate forge
57-
```
58-
59-
3. **Run Installation Script**
60-
61-
```bash
62-
./scripts/install.sh
63-
```
64-
65-
The installation script will:
66-
- Install system dependencies using DNF (or your package manager)
67-
- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan
68-
- Install TorchForge and all Python dependencies
69-
- Configure the environment for GPU training
70-
71-
```{tip}
72-
**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use:
73-
`./scripts/install.sh --use-sudo`
74-
```
75-
76-
```{warning}
77-
When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM.
78-
```
79-
80-
## Verifying Your Setup
81-
82-
After installation, verify that all components are working correctly:
83-
84-
1. **Check GPU Availability**
85-
86-
```bash
87-
python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')"
88-
```
89-
90-
Expected output: `GPUs available: 2` (or more)
91-
92-
2. **Check CUDA Version**
93-
94-
```bash
95-
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"
96-
```
97-
98-
Expected output: `CUDA version: 12.8`
99-
3. **Check All Dependencies**
100-
101-
```bash
102-
# Check core components
103-
python -c "import torch, forge, monarch, vllm; print('All imports successful')"
104-
105-
# Check specific versions
106-
python -c "
107-
import torch
108-
import forge
109-
import vllm
110-
111-
print(f'PyTorch: {torch.__version__}')
112-
print(f'TorchForge: {forge.__version__}')
113-
print(f'vLLM: {vllm.__version__}')
114-
print(f'CUDA: {torch.version.cuda}')
115-
print(f'GPUs: {torch.cuda.device_count()}')
116-
"
117-
```
118-
119-
4. **Verify Monarch**
120-
121-
```bash
122-
python -c "
123-
from monarch.actor import Actor, this_host
124-
125-
# Test basic Monarch functionality
126-
procs = this_host().spawn_procs({'gpus': 1})
127-
print('Monarch: Process spawning works')
128-
"
129-
```
130-
131-
## Quick Start Examples
132-
133-
Now that TorchForge is installed, let's run some training examples.
134-
135-
Here's what training looks like with TorchForge:
136-
137-
```bash
138-
# Install dependencies
139-
conda create -n forge python=3.10
140-
conda activate forge
141-
git clone https://github.com/meta-pytorch/forge
142-
cd forge
143-
./scripts/install.sh
144-
145-
# Download a model
146-
uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
147-
--output-dir /tmp/Meta-Llama-3.1-8B-Instruct
148-
149-
# Run SFT training (requires 2+ GPUs)
150-
uv run forge run --nproc_per_node 2 \
151-
apps/sft/main.py --config apps/sft/llama3_8b.yaml
152-
153-
# Run GRPO training (requires 3+ GPUs)
154-
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
155-
```
156-
157-
### Example 1: Supervised Fine-Tuning (SFT)
158-
159-
Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs**
160-
161-
1. **Download the Model**
162-
163-
```bash
164-
uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
165-
--output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
166-
--ignore-patterns "original/consolidated.00.pth"
167-
```
168-
169-
```{note}
170-
Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already.
171-
```
172-
173-
2. **Run Training**
174-
175-
```bash
176-
uv run forge run --nproc_per_node 2 \
177-
apps/sft/main.py \
178-
--config apps/sft/llama3_8b.yaml
179-
```
180-
181-
**What's Happening:**
182-
- `--nproc_per_node 2`: Use 2 GPUs for training
183-
- `apps/sft/main.py`: SFT training script
184-
- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters
185-
- **TorchTitan** handles model sharding across the 2 GPUs
186-
- **Monarch** coordinates the distributed training
187-
188-
**Expected Output:**
189-
```
190-
Initializing process group...
191-
Loading model from /tmp/Meta-Llama-3.1-8B-Instruct...
192-
Starting training...
193-
Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001
194-
...
195-
```
196-
197-
### Example 2: GRPO Training
198-
199-
Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs**
200-
201-
```bash
202-
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
203-
```
204-
205-
**What's Happening:**
206-
- GPU 0: Trainer model (being trained, powered by TorchTitan)
207-
- GPU 1: Reference model (frozen baseline, powered by TorchTitan)
208-
- GPU 2: Policy model (scoring outputs, powered by vLLM)
209-
- **Monarch** orchestrates all three components
210-
- **TorchStore** handles weight synchronization from training to inference
211-
212-
## Understanding Configuration Files
213-
214-
TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config:
215-
216-
```yaml
217-
# Example: apps/sft/llama3_8b.yaml
218-
model:
219-
name: meta-llama/Meta-Llama-3.1-8B-Instruct
220-
path: /tmp/Meta-Llama-3.1-8B-Instruct
221-
222-
training:
223-
batch_size: 4
224-
learning_rate: 1e-5
225-
num_epochs: 10
226-
gradient_accumulation_steps: 4
227-
228-
distributed:
229-
strategy: fsdp # Managed by TorchTitan
230-
precision: bf16
231-
232-
checkpointing:
233-
save_interval: 1000
234-
output_dir: /tmp/checkpoints
235-
```
236-
237-
**Key Sections:**
238-
- **model**: Model path and settings
239-
- **training**: Hyperparameters like batch size and learning rate
240-
- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan
241-
- **checkpointing**: Where and when to save model checkpoints
242-
243-
## Next Steps
244-
245-
Now that you have TorchForge installed and verified:
246-
247-
1. **Learn the Concepts**: Read {doc}`concepts` to understand TorchForge's architecture, including Monarch, Services, and TorchStore
248-
2. **Explore Examples**: Check the `apps/` directory for more training examples
249-
4. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides
250-
5. **API Documentation**: Explore {doc}`api` for detailed API reference
251-
252-
## Getting Help
253-
254-
If you encounter issues:
255-
256-
1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues)
257-
2. **File a Bug Report**: Create a new issue with:
258-
- Your system configuration (output of diagnostic command below)
259-
- Full error message
260-
- Steps to reproduce
261-
- Expected vs actual behavior
262-
263-
**Diagnostic command:**
264-
```bash
265-
python -c "
266-
import torch
267-
import forge
268-
269-
try:
270-
import monarch
271-
monarch_status = 'OK'
272-
except Exception as e:
273-
monarch_status = str(e)
274-
275-
try:
276-
import vllm
277-
vllm_version = vllm.__version__
278-
except Exception as e:
279-
vllm_version = str(e)
280-
281-
print(f'PyTorch: {torch.__version__}')
282-
print(f'TorchForge: {forge.__version__}')
283-
print(f'Monarch: {monarch_status}')
284-
print(f'vLLM: {vllm_version}')
285-
print(f'CUDA: {torch.version.cuda}')
286-
print(f'GPUs: {torch.cuda.device_count()}')
287-
"
288-
```
289-
290-
Include this output in your bug reports!
291-
292-
## Additional Resources
293-
294-
- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
295-
- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
296-
- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch)
297-
- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai)
298-
- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)
299-
# Getting Started
300-
301-
This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job.
302-
303-
## System Requirements
304-
305-
Before installing TorchForge, ensure your system meets the following requirements.
306-
307-
| Component | Requirement | Notes |
308-
|-----------|-------------|-------|
309-
| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported |
310-
| **Python** | 3.10 or higher | Python 3.11 recommended |
311-
| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported |
312-
| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models |
313-
| **CUDA** | 12.8 | Required for GPU training |
314-
| **RAM** | 32GB+ recommended | Depends on model size |
315-
| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints |
316-
| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) |
317-
| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system |
318-
| **vLLM** | v0.10.0+ | Fast inference with PagedAttention |
319-
| **TorchTitan** | Latest | Production training infrastructure |
320-
321-
322-
## Prerequisites
323-
324-
- **Conda or Miniconda**: For environment management
325-
- Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html)
326-
327-
- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies
328-
- Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation)
329-
- After installing, authenticate with: `gh auth login`
330-
- You can use either HTTPS or SSH as the authentication protocol
331-
332-
- **Git**: For cloning the repository
333-
- Usually pre-installed on Linux systems
334-
- Verify with: `git --version`
335-
336-
33739
**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.
33840

33941
## Installation

docs/source/index.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -201,8 +201,6 @@ Before starting significant work, signal your intention in the issue tracker to
201201
```{toctree}
202202
:maxdepth: 2
203203
:caption: Documentation
204-
:maxdepth: 2
205-
:caption: Documentation
206204
207205
getting_started
208206
concepts
@@ -218,11 +216,3 @@ api
218216
---
219217

220218
**License**: BSD 3-Clause | **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge)
221-
## Indices
222-
223-
* {ref}`genindex` - Index of all documented objects
224-
* {ref}`modindex` - Python module index
225-
226-
---
227-
228-
**License**: BSD 3-Clause | **GitHub**: [meta-pytorch/forge](https://github.com/meta-pytorch/forge)

0 commit comments

Comments
 (0)