@@ -36,304 +36,6 @@ Before installing TorchForge, ensure your system meets the following requirement
3636 - Verify with: ` git --version `
3737
3838
39- ** Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.
40-
41- ## Installation
42-
43- TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable.
44-
45- 1 . ** Clone the Repository**
46-
47- ``` bash
48- git clone https://github.com/meta-pytorch/forge.git
49- cd forge
50- ```
51-
52- 2 . ** Create Conda Environment**
53-
54- ``` bash
55- conda create -n forge python=3.10
56- conda activate forge
57- ```
58-
59- 3 . ** Run Installation Script**
60-
61- ``` bash
62- ./scripts/install.sh
63- ```
64-
65- The installation script will:
66- - Install system dependencies using DNF (or your package manager)
67- - Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan
68- - Install TorchForge and all Python dependencies
69- - Configure the environment for GPU training
70-
71- ``` {tip}
72- **Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use:
73- `./scripts/install.sh --use-sudo`
74- ```
75-
76- ``` {warning}
77- When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM.
78- ```
79-
80- ## Verifying Your Setup
81-
82- After installation, verify that all components are working correctly:
83-
84- 1 . ** Check GPU Availability**
85-
86- ``` bash
87- python -c " import torch; print(f'GPUs available: {torch.cuda.device_count()}')"
88- ```
89-
90- Expected output: ` GPUs available: 2 ` (or more)
91-
92- 2 . ** Check CUDA Version**
93-
94- ``` bash
95- python -c " import torch; print(f'CUDA version: {torch.version.cuda}')"
96- ```
97-
98- Expected output: ` CUDA version: 12.8 `
99- 3 . ** Check All Dependencies**
100-
101- ``` bash
102- # Check core components
103- python -c " import torch, forge, monarch, vllm; print('All imports successful')"
104-
105- # Check specific versions
106- python -c "
107- import torch
108- import forge
109- import vllm
110-
111- print(f'PyTorch: {torch.__version__}')
112- print(f'TorchForge: {forge.__version__}')
113- print(f'vLLM: {vllm.__version__}')
114- print(f'CUDA: {torch.version.cuda}')
115- print(f'GPUs: {torch.cuda.device_count()}')
116- "
117- ```
118-
119- 4 . ** Verify Monarch**
120-
121- ``` bash
122- python -c "
123- from monarch.actor import Actor, this_host
124-
125- # Test basic Monarch functionality
126- procs = this_host().spawn_procs({'gpus': 1})
127- print('Monarch: Process spawning works')
128- "
129- ```
130-
131- ## Quick Start Examples
132-
133- Now that TorchForge is installed, let's run some training examples.
134-
135- Here's what training looks like with TorchForge:
136-
137- ``` bash
138- # Install dependencies
139- conda create -n forge python=3.10
140- conda activate forge
141- git clone https://github.com/meta-pytorch/forge
142- cd forge
143- ./scripts/install.sh
144-
145- # Download a model
146- uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
147- --output-dir /tmp/Meta-Llama-3.1-8B-Instruct
148-
149- # Run SFT training (requires 2+ GPUs)
150- uv run forge run --nproc_per_node 2 \
151- apps/sft/main.py --config apps/sft/llama3_8b.yaml
152-
153- # Run GRPO training (requires 3+ GPUs)
154- python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
155- ```
156-
157- ### Example 1: Supervised Fine-Tuning (SFT)
158-
159- Fine-tune Llama 3 8B on your data. ** Requires: 2+ GPUs**
160-
161- 1 . ** Download the Model**
162-
163- ``` bash
164- uv run forge download meta-llama/Meta-Llama-3.1-8B-Instruct \
165- --output-dir /tmp/Meta-Llama-3.1-8B-Instruct \
166- --ignore-patterns " original/consolidated.00.pth"
167- ```
168-
169- ``` {note}
170- Model downloads require Hugging Face authentication. Run `huggingface-cli login` first if you haven't already.
171- ```
172-
173- 2 . ** Run Training**
174-
175- ``` bash
176- uv run forge run --nproc_per_node 2 \
177- apps/sft/main.py \
178- --config apps/sft/llama3_8b.yaml
179- ```
180-
181- ** What's Happening:**
182- - ` --nproc_per_node 2 ` : Use 2 GPUs for training
183- - ` apps/sft/main.py ` : SFT training script
184- - ` --config apps/sft/llama3_8b.yaml ` : Configuration file with hyperparameters
185- - ** TorchTitan** handles model sharding across the 2 GPUs
186- - ** Monarch** coordinates the distributed training
187-
188- ** Expected Output:**
189- ```
190- Initializing process group...
191- Loading model from /tmp/Meta-Llama-3.1-8B-Instruct...
192- Starting training...
193- Epoch 1/10 | Step 100 | Loss: 2.45 | LR: 0.0001
194- ...
195- ```
196-
197- ### Example 2: GRPO Training
198-
199- Train a model using reinforcement learning with GRPO. ** Requires: 3+ GPUs**
200-
201- ``` bash
202- python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
203- ```
204-
205- ** What's Happening:**
206- - GPU 0: Trainer model (being trained, powered by TorchTitan)
207- - GPU 1: Reference model (frozen baseline, powered by TorchTitan)
208- - GPU 2: Policy model (scoring outputs, powered by vLLM)
209- - ** Monarch** orchestrates all three components
210- - ** TorchStore** handles weight synchronization from training to inference
211-
212- ## Understanding Configuration Files
213-
214- TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config:
215-
216- ``` yaml
217- # Example: apps/sft/llama3_8b.yaml
218- model :
219- name : meta-llama/Meta-Llama-3.1-8B-Instruct
220- path : /tmp/Meta-Llama-3.1-8B-Instruct
221-
222- training :
223- batch_size : 4
224- learning_rate : 1e-5
225- num_epochs : 10
226- gradient_accumulation_steps : 4
227-
228- distributed :
229- strategy : fsdp # Managed by TorchTitan
230- precision : bf16
231-
232- checkpointing :
233- save_interval : 1000
234- output_dir : /tmp/checkpoints
235- ` ` `
236-
237- **Key Sections:**
238- - **model**: Model path and settings
239- - **training**: Hyperparameters like batch size and learning rate
240- - **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan
241- - **checkpointing**: Where and when to save model checkpoints
242-
243- ## Next Steps
244-
245- Now that you have TorchForge installed and verified:
246-
247- 1. **Learn the Concepts**: Read {doc}` concepts` to understand TorchForge's architecture, including Monarch, Services, and TorchStore
248- 2. **Explore Examples** : Check the `apps/` directory for more training examples
249- 4. **Read Tutorials** : Follow {doc}`tutorials` for step-by-step guides
250- 5. **API Documentation** : Explore {doc}`api` for detailed API reference
251-
252- # # Getting Help
253-
254- If you encounter issues :
255-
256- 1. **Search Issues** : Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues)
257- 2. **File a Bug Report** : Create a new issue with:
258- - Your system configuration (output of diagnostic command below)
259- - Full error message
260- - Steps to reproduce
261- - Expected vs actual behavior
262-
263- **Diagnostic command:**
264- ` ` ` bash
265- python -c "
266- import torch
267- import forge
268-
269- try:
270- import monarch
271- monarch_status = 'OK'
272- except Exception as e:
273- monarch_status = str(e)
274-
275- try:
276- import vllm
277- vllm_version = vllm.__version__
278- except Exception as e:
279- vllm_version = str(e)
280-
281- print(f'PyTorch: {torch.__version__}')
282- print(f'TorchForge: {forge.__version__}')
283- print(f'Monarch: {monarch_status}')
284- print(f'vLLM: {vllm_version}')
285- print(f'CUDA: {torch.version.cuda}')
286- print(f'GPUs: {torch.cuda.device_count()}')
287- "
288- ` ` `
289-
290- Include this output in your bug reports!
291-
292- # # Additional Resources
293-
294- - **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
295- - **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
296- - **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch)
297- - **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai)
298- - **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)
299- # Getting Started
300-
301- This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job.
302-
303- # # System Requirements
304-
305- Before installing TorchForge, ensure your system meets the following requirements.
306-
307- | Component | Requirement | Notes |
308- |-----------|-------------|-------|
309- | **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported |
310- | **Python** | 3.10 or higher | Python 3.11 recommended |
311- | **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported |
312- | **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models |
313- | **CUDA** | 12.8 | Required for GPU training |
314- | **RAM** | 32GB+ recommended | Depends on model size |
315- | **Disk Space** | 50GB+ free | For models, datasets, and checkpoints |
316- | **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) |
317- | **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system |
318- | **vLLM** | v0.10.0+ | Fast inference with PagedAttention |
319- | **TorchTitan** | Latest | Production training infrastructure |
320-
321-
322- # # Prerequisites
323-
324- - **Conda or Miniconda**: For environment management
325- - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html)
326-
327- - **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies
328- - Install instructions : [github.com/cli/cli#installation](https://github.com/cli/cli#installation)
329- - After installing, authenticate with : ` gh auth login`
330- - You can use either HTTPS or SSH as the authentication protocol
331-
332- - **Git**: For cloning the repository
333- - Usually pre-installed on Linux systems
334- - Verify with : ` git --version`
335-
336-
33739** Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.
33840
33941## Installation
0 commit comments