|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# SFT Training - Interactive Configuration Notebook\n", |
| 7 | + "# 🚀 The SFT Training Story: From Configuration to Completion\n", |
8 | 8 | "\n", |
9 | | - "This notebook allows you to configure and run SFT training **without any YAML files**!\n", |
| 9 | + "Welcome to an interactive journey through **Supervised Fine-Tuning (SFT)** in Forge!\n", |
10 | 10 | "\n", |
11 | | - "## Benefits\n", |
| 11 | + "## What You'll Learn\n", |
12 | 12 | "\n", |
13 | | - "✅ No external YAML files needed \n", |
14 | | - "✅ Interactive configuration in separate cells \n", |
15 | | - "✅ Easy to modify and experiment \n", |
16 | | - "✅ All configuration visible in notebook \n", |
17 | | - "✅ Quick templates for common scenarios" |
| 13 | + "This notebook tells the complete story of how SFT training works:\n", |
| 14 | + "\n", |
| 15 | + "1. **🎭 The Actor Model** - Understanding TrainerActor\n", |
| 16 | + "2. **🔧 Setup Phase** - Loading models, data, and checkpoints\n", |
| 17 | + "3. **🏃 Training Loop** - Forward passes, backprop, optimization\n", |
| 18 | + "4. **📊 Validation** - Measuring progress on held-out data\n", |
| 19 | + "5. **🧹 Cleanup** - Saving checkpoints and releasing resources\n", |
| 20 | + "\n", |
| 21 | + "---\n", |
| 22 | + "\n", |
| 23 | + "## The Forge Actor Architecture\n", |
| 24 | + "\n", |
| 25 | + "### What is a TrainerActor?\n", |
| 26 | + "\n", |
| 27 | + "Think of a **TrainerActor** as the conductor of an orchestra:\n", |
| 28 | + "- 🎭 **Manages multiple processes** across GPUs or nodes\n", |
| 29 | + "- 🔧 **Controls the lifecycle** of training (setup → train → cleanup)\n", |
| 30 | + "- 📊 **Coordinates distributed training** with FSDP, tensor parallelism, etc.\n", |
| 31 | + "\n", |
| 32 | + "### The Training Journey\n", |
| 33 | + "\n", |
| 34 | + "```\n", |
| 35 | + "┌─────────────────────────────────────────┐\n", |
| 36 | + "│ 1. Configuration 📋 │ ← You define parameters\n", |
| 37 | + "│ (model, data, hyperparameters) │\n", |
| 38 | + "└──────────────┬──────────────────────────┘\n", |
| 39 | + " ↓\n", |
| 40 | + "┌─────────────────────────────────────────┐\n", |
| 41 | + "│ 2. Spawn Actor 🎭 │ ← Forge creates distributed processes\n", |
| 42 | + "│ (launch 8 GPU processes) │\n", |
| 43 | + "└──────────────┬──────────────────────────┘\n", |
| 44 | + " ↓\n", |
| 45 | + "┌─────────────────────────────────────────┐\n", |
| 46 | + "│ 3. Setup Phase 🔧 │ ← Load model, data, checkpoints\n", |
| 47 | + "│ - Initialize model with FSDP │\n", |
| 48 | + "│ - Load training dataset │\n", |
| 49 | + "│ - Load validation dataset │\n", |
| 50 | + "│ - Restore from checkpoint (if any) │\n", |
| 51 | + "└──────────────┬──────────────────────────┘\n", |
| 52 | + " ↓\n", |
| 53 | + "┌─────────────────────────────────────────┐\n", |
| 54 | + "│ 4. Training Loop 🔄 │ ← The main training process\n", |
| 55 | + "│ FOR each step: │\n", |
| 56 | + "│ → Get batch from dataloader │\n", |
| 57 | + "│ → Forward pass (compute loss) │\n", |
| 58 | + "│ → Backward pass (compute grads) │\n", |
| 59 | + "│ → Optimizer step (update weights) │\n", |
| 60 | + "│ → [Optional] Run validation │\n", |
| 61 | + "│ → [Optional] Save checkpoint │\n", |
| 62 | + "└──────────────┬──────────────────────────┘\n", |
| 63 | + " ↓\n", |
| 64 | + "┌─────────────────────────────────────────┐\n", |
| 65 | + "│ 5. Cleanup Phase 🧹 │ ← Save final state\n", |
| 66 | + "│ - Save final checkpoint │\n", |
| 67 | + "│ - Release GPU memory │\n", |
| 68 | + "│ - Stop all processes │\n", |
| 69 | + "└─────────────────────────────────────────┘\n", |
| 70 | + "```\n", |
| 71 | + "\n", |
| 72 | + "### Why This Architecture?\n", |
| 73 | + "\n", |
| 74 | + "✅ **Automatic Distribution** - Forge handles multi-GPU/multi-node complexity \n", |
| 75 | + "✅ **Fault Tolerance** - Checkpointing enables recovery from failures \n", |
| 76 | + "✅ **Flexibility** - Easy to switch between 1 GPU, 8 GPUs, or multiple nodes \n", |
| 77 | + "✅ **Production-Ready** - Used at Meta for large-scale training\n", |
| 78 | + "\n", |
| 79 | + "---\n", |
| 80 | + "\n", |
| 81 | + "Let's configure your training!" |
18 | 82 | ] |
19 | 83 | }, |
20 | 84 | { |
21 | 85 | "cell_type": "markdown", |
22 | 86 | "metadata": {}, |
23 | 87 | "source": [ |
24 | | - "## Step 1: Import Dependencies" |
| 88 | + "---\n", |
| 89 | + "\n", |
| 90 | + "# 📚 Part 1: Configuration\n", |
| 91 | + "\n", |
| 92 | + "## The Foundation - Defining Your Training\n", |
| 93 | + "\n", |
| 94 | + "Before we can train, we need to tell Forge:\n", |
| 95 | + "- **What model** to train (Llama3-8B, Qwen3-32B, etc.)\n", |
| 96 | + "- **What data** to use (datasets, batch sizes)\n", |
| 97 | + "- **How to train** (learning rate, optimizer, steps)\n", |
| 98 | + "- **Where to run** (GPUs, FSDP settings)\n", |
| 99 | + "\n", |
| 100 | + "Let's start by importing our tools..." |
| 101 | + ] |
| 102 | + }, |
| 103 | + { |
| 104 | + "cell_type": "markdown", |
| 105 | + "metadata": {}, |
| 106 | + "source": [ |
| 107 | + "## Import Dependencies\n", |
| 108 | + "\n", |
| 109 | + "These imports give us access to:\n", |
| 110 | + "- **OmegaConf**: Configuration management\n", |
| 111 | + "- **TrainerActor**: The main training orchestrator\n", |
| 112 | + "- **SpawnActor**: Helper for creating distributed actors" |
25 | 113 | ] |
26 | 114 | }, |
27 | 115 | { |
|
76 | 164 | "cell_type": "markdown", |
77 | 165 | "metadata": {}, |
78 | 166 | "source": [ |
79 | | - "## Step 2: Configure Model and Process Settings\n", |
| 167 | + "## Configure Model and Process Settings\n", |
80 | 168 | "\n", |
81 | 169 | "Define your model configuration and how many processes to use." |
82 | 170 | ] |
|
132 | 220 | "cell_type": "markdown", |
133 | 221 | "metadata": {}, |
134 | 222 | "source": [ |
135 | | - "## Step 3: Configure Optimizer and LR Scheduler" |
| 223 | + "## Configure Optimizer and LR Scheduler" |
136 | 224 | ] |
137 | 225 | }, |
138 | 226 | { |
|
184 | 272 | "cell_type": "markdown", |
185 | 273 | "metadata": {}, |
186 | 274 | "source": [ |
187 | | - "## Step 4: Configure Training Settings\n", |
| 275 | + "## Configure Training Settings\n", |
188 | 276 | "\n", |
189 | 277 | "**Key parameters to adjust for your experiment:**" |
190 | 278 | ] |
|
232 | 320 | "cell_type": "markdown", |
233 | 321 | "metadata": {}, |
234 | 322 | "source": [ |
235 | | - "## Step 5: Configure Parallelism Settings" |
| 323 | + "## Configure Parallelism Settings" |
236 | 324 | ] |
237 | 325 | }, |
238 | 326 | { |
|
280 | 368 | "cell_type": "markdown", |
281 | 369 | "metadata": {}, |
282 | 370 | "source": [ |
283 | | - "## Step 6: Configure Checkpoint and Activation Checkpointing" |
| 371 | + "## Configure Checkpoint and Activation Checkpointing" |
284 | 372 | ] |
285 | 373 | }, |
286 | 374 | { |
|
342 | 430 | "cell_type": "markdown", |
343 | 431 | "metadata": {}, |
344 | 432 | "source": [ |
345 | | - "## Step 7: Configure Communication Settings" |
| 433 | + "## Configure Communication Settings" |
346 | 434 | ] |
347 | 435 | }, |
348 | 436 | { |
|
379 | 467 | "cell_type": "markdown", |
380 | 468 | "metadata": {}, |
381 | 469 | "source": [ |
382 | | - "## Step 8: Combine All Configurations\n", |
| 470 | + "## Combine All Configurations\n", |
383 | 471 | "\n", |
384 | 472 | "Now let's merge everything into a complete configuration!" |
385 | 473 | ] |
|
475 | 563 | "cell_type": "markdown", |
476 | 564 | "metadata": {}, |
477 | 565 | "source": [ |
478 | | - "## Step 9: Run Training\n", |
479 | | - "\n", |
480 | | - "### Option A: Automatic Lifecycle Management (Recommended)\n", |
481 | | - "\n", |
482 | | - "Use `run_actor()` for automatic setup, training, and cleanup:" |
| 566 | + "## Part 2: Run Training\n" |
483 | 567 | ] |
484 | 568 | }, |
485 | 569 | { |
|
488 | 572 | "metadata": {}, |
489 | 573 | "outputs": [], |
490 | 574 | "source": [ |
491 | | - "# Run training with automatic lifecycle management\n", |
| 575 | + "\n", |
492 | 576 | "await run_actor(TrainerActor, cfg)" |
493 | 577 | ] |
494 | 578 | }, |
495 | 579 | { |
496 | 580 | "cell_type": "markdown", |
497 | 581 | "metadata": {}, |
498 | 582 | "source": [ |
499 | | - "## Alternative: Manual Lifecycle Control\n", |
| 583 | + "---\n", |
| 584 | + "\n", |
| 585 | + "# 🎭 Part 2: The Actor Lifecycle\n", |
| 586 | + "\n", |
| 587 | + "## Understanding Spawn, Setup, Train, and Cleanup\n", |
| 588 | + "\n", |
| 589 | + "### Phase 1: Spawn the Actor 🎭\n", |
| 590 | + "\n", |
| 591 | + "**What's happening:**\n", |
| 592 | + "- `SpawnActor` creates a launcher for `TrainerActor`\n", |
| 593 | + "- `spawn()` launches 8 Python processes (one per GPU)\n", |
| 594 | + "- Each process initializes:\n", |
| 595 | + " - CUDA device assignment (GPU 0, 1, 2, ...)\n", |
| 596 | + " - Distributed communication (NCCL)\n", |
| 597 | + " - Process group setup (RANK, LOCAL_RANK, WORLD_SIZE)\n", |
| 598 | + "\n", |
| 599 | + "**Behind the scenes:**\n", |
| 600 | + "```\n", |
| 601 | + "GPU 0: Process 0 (RANK=0, LOCAL_RANK=0)\n", |
| 602 | + "GPU 1: Process 1 (RANK=1, LOCAL_RANK=1)\n", |
| 603 | + "...\n", |
| 604 | + "GPU 7: Process 7 (RANK=7, LOCAL_RANK=7)\n", |
| 605 | + "```\n", |
| 606 | + "\n", |
| 607 | + "All processes are now waiting for instructions!\n", |
| 608 | + "### What Happens When You Run This?\n", |
500 | 609 | "\n", |
501 | | - "For more control, manage each phase separately.\n", |
| 610 | + "1. **Spawn** 🎭: Forge creates 8 GPU processes (based on `procs: 8`)\n", |
| 611 | + "2. **Setup** 🔧: Each process loads its shard of the model + data\n", |
| 612 | + "3. **Train** 🏃: Training loop runs for 1000 steps\n", |
| 613 | + "4. **Cleanup** 🧹: Final checkpoint saved, resources released\n", |
502 | 614 | "\n", |
503 | | - "### Create and Spawn the Actor" |
| 615 | + "Uncomment the line below to start training!" |
504 | 616 | ] |
505 | 617 | }, |
506 | 618 | { |
|
0 commit comments