Skip to content

Commit bf0839a

Browse files
author
Hossein Kavianihamedani
committed
Adding more explanation, narrating the overal story flow, and removing the extra steps
1 parent 95650dd commit bf0839a

File tree

1 file changed

+137
-25
lines changed

1 file changed

+137
-25
lines changed

apps/sft/interactive_config_notebook.ipynb

Lines changed: 137 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,112 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# SFT Training - Interactive Configuration Notebook\n",
7+
"# 🚀 The SFT Training Story: From Configuration to Completion\n",
88
"\n",
9-
"This notebook allows you to configure and run SFT training **without any YAML files**!\n",
9+
"Welcome to an interactive journey through **Supervised Fine-Tuning (SFT)** in Forge!\n",
1010
"\n",
11-
"## Benefits\n",
11+
"## What You'll Learn\n",
1212
"\n",
13-
"✅ No external YAML files needed \n",
14-
"✅ Interactive configuration in separate cells \n",
15-
"✅ Easy to modify and experiment \n",
16-
"✅ All configuration visible in notebook \n",
17-
"✅ Quick templates for common scenarios"
13+
"This notebook tells the complete story of how SFT training works:\n",
14+
"\n",
15+
"1. **🎭 The Actor Model** - Understanding TrainerActor\n",
16+
"2. **🔧 Setup Phase** - Loading models, data, and checkpoints\n",
17+
"3. **🏃 Training Loop** - Forward passes, backprop, optimization\n",
18+
"4. **📊 Validation** - Measuring progress on held-out data\n",
19+
"5. **🧹 Cleanup** - Saving checkpoints and releasing resources\n",
20+
"\n",
21+
"---\n",
22+
"\n",
23+
"## The Forge Actor Architecture\n",
24+
"\n",
25+
"### What is a TrainerActor?\n",
26+
"\n",
27+
"Think of a **TrainerActor** as the conductor of an orchestra:\n",
28+
"- 🎭 **Manages multiple processes** across GPUs or nodes\n",
29+
"- 🔧 **Controls the lifecycle** of training (setup → train → cleanup)\n",
30+
"- 📊 **Coordinates distributed training** with FSDP, tensor parallelism, etc.\n",
31+
"\n",
32+
"### The Training Journey\n",
33+
"\n",
34+
"```\n",
35+
"┌─────────────────────────────────────────┐\n",
36+
"│ 1. Configuration 📋 │ ← You define parameters\n",
37+
"│ (model, data, hyperparameters) │\n",
38+
"└──────────────┬──────────────────────────┘\n",
39+
" ↓\n",
40+
"┌─────────────────────────────────────────┐\n",
41+
"│ 2. Spawn Actor 🎭 │ ← Forge creates distributed processes\n",
42+
"│ (launch 8 GPU processes) │\n",
43+
"└──────────────┬──────────────────────────┘\n",
44+
" ↓\n",
45+
"┌─────────────────────────────────────────┐\n",
46+
"│ 3. Setup Phase 🔧 │ ← Load model, data, checkpoints\n",
47+
"│ - Initialize model with FSDP │\n",
48+
"│ - Load training dataset │\n",
49+
"│ - Load validation dataset │\n",
50+
"│ - Restore from checkpoint (if any) │\n",
51+
"└──────────────┬──────────────────────────┘\n",
52+
" ↓\n",
53+
"┌─────────────────────────────────────────┐\n",
54+
"│ 4. Training Loop 🔄 │ ← The main training process\n",
55+
"│ FOR each step: │\n",
56+
"│ → Get batch from dataloader │\n",
57+
"│ → Forward pass (compute loss) │\n",
58+
"│ → Backward pass (compute grads) │\n",
59+
"│ → Optimizer step (update weights) │\n",
60+
"│ → [Optional] Run validation │\n",
61+
"│ → [Optional] Save checkpoint │\n",
62+
"└──────────────┬──────────────────────────┘\n",
63+
" ↓\n",
64+
"┌─────────────────────────────────────────┐\n",
65+
"│ 5. Cleanup Phase 🧹 │ ← Save final state\n",
66+
"│ - Save final checkpoint │\n",
67+
"│ - Release GPU memory │\n",
68+
"│ - Stop all processes │\n",
69+
"└─────────────────────────────────────────┘\n",
70+
"```\n",
71+
"\n",
72+
"### Why This Architecture?\n",
73+
"\n",
74+
"✅ **Automatic Distribution** - Forge handles multi-GPU/multi-node complexity \n",
75+
"✅ **Fault Tolerance** - Checkpointing enables recovery from failures \n",
76+
"✅ **Flexibility** - Easy to switch between 1 GPU, 8 GPUs, or multiple nodes \n",
77+
"✅ **Production-Ready** - Used at Meta for large-scale training\n",
78+
"\n",
79+
"---\n",
80+
"\n",
81+
"Let's configure your training!"
1882
]
1983
},
2084
{
2185
"cell_type": "markdown",
2286
"metadata": {},
2387
"source": [
24-
"## Step 1: Import Dependencies"
88+
"---\n",
89+
"\n",
90+
"# 📚 Part 1: Configuration\n",
91+
"\n",
92+
"## The Foundation - Defining Your Training\n",
93+
"\n",
94+
"Before we can train, we need to tell Forge:\n",
95+
"- **What model** to train (Llama3-8B, Qwen3-32B, etc.)\n",
96+
"- **What data** to use (datasets, batch sizes)\n",
97+
"- **How to train** (learning rate, optimizer, steps)\n",
98+
"- **Where to run** (GPUs, FSDP settings)\n",
99+
"\n",
100+
"Let's start by importing our tools..."
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"metadata": {},
106+
"source": [
107+
"## Import Dependencies\n",
108+
"\n",
109+
"These imports give us access to:\n",
110+
"- **OmegaConf**: Configuration management\n",
111+
"- **TrainerActor**: The main training orchestrator\n",
112+
"- **SpawnActor**: Helper for creating distributed actors"
25113
]
26114
},
27115
{
@@ -76,7 +164,7 @@
76164
"cell_type": "markdown",
77165
"metadata": {},
78166
"source": [
79-
"## Step 2: Configure Model and Process Settings\n",
167+
"## Configure Model and Process Settings\n",
80168
"\n",
81169
"Define your model configuration and how many processes to use."
82170
]
@@ -132,7 +220,7 @@
132220
"cell_type": "markdown",
133221
"metadata": {},
134222
"source": [
135-
"## Step 3: Configure Optimizer and LR Scheduler"
223+
"## Configure Optimizer and LR Scheduler"
136224
]
137225
},
138226
{
@@ -184,7 +272,7 @@
184272
"cell_type": "markdown",
185273
"metadata": {},
186274
"source": [
187-
"## Step 4: Configure Training Settings\n",
275+
"## Configure Training Settings\n",
188276
"\n",
189277
"**Key parameters to adjust for your experiment:**"
190278
]
@@ -232,7 +320,7 @@
232320
"cell_type": "markdown",
233321
"metadata": {},
234322
"source": [
235-
"## Step 5: Configure Parallelism Settings"
323+
"## Configure Parallelism Settings"
236324
]
237325
},
238326
{
@@ -280,7 +368,7 @@
280368
"cell_type": "markdown",
281369
"metadata": {},
282370
"source": [
283-
"## Step 6: Configure Checkpoint and Activation Checkpointing"
371+
"## Configure Checkpoint and Activation Checkpointing"
284372
]
285373
},
286374
{
@@ -342,7 +430,7 @@
342430
"cell_type": "markdown",
343431
"metadata": {},
344432
"source": [
345-
"## Step 7: Configure Communication Settings"
433+
"## Configure Communication Settings"
346434
]
347435
},
348436
{
@@ -379,7 +467,7 @@
379467
"cell_type": "markdown",
380468
"metadata": {},
381469
"source": [
382-
"## Step 8: Combine All Configurations\n",
470+
"## Combine All Configurations\n",
383471
"\n",
384472
"Now let's merge everything into a complete configuration!"
385473
]
@@ -475,11 +563,7 @@
475563
"cell_type": "markdown",
476564
"metadata": {},
477565
"source": [
478-
"## Step 9: Run Training\n",
479-
"\n",
480-
"### Option A: Automatic Lifecycle Management (Recommended)\n",
481-
"\n",
482-
"Use `run_actor()` for automatic setup, training, and cleanup:"
566+
"## Part 2: Run Training\n"
483567
]
484568
},
485569
{
@@ -488,19 +572,47 @@
488572
"metadata": {},
489573
"outputs": [],
490574
"source": [
491-
"# Run training with automatic lifecycle management\n",
575+
"\n",
492576
"await run_actor(TrainerActor, cfg)"
493577
]
494578
},
495579
{
496580
"cell_type": "markdown",
497581
"metadata": {},
498582
"source": [
499-
"## Alternative: Manual Lifecycle Control\n",
583+
"---\n",
584+
"\n",
585+
"# 🎭 Part 2: The Actor Lifecycle\n",
586+
"\n",
587+
"## Understanding Spawn, Setup, Train, and Cleanup\n",
588+
"\n",
589+
"### Phase 1: Spawn the Actor 🎭\n",
590+
"\n",
591+
"**What's happening:**\n",
592+
"- `SpawnActor` creates a launcher for `TrainerActor`\n",
593+
"- `spawn()` launches 8 Python processes (one per GPU)\n",
594+
"- Each process initializes:\n",
595+
" - CUDA device assignment (GPU 0, 1, 2, ...)\n",
596+
" - Distributed communication (NCCL)\n",
597+
" - Process group setup (RANK, LOCAL_RANK, WORLD_SIZE)\n",
598+
"\n",
599+
"**Behind the scenes:**\n",
600+
"```\n",
601+
"GPU 0: Process 0 (RANK=0, LOCAL_RANK=0)\n",
602+
"GPU 1: Process 1 (RANK=1, LOCAL_RANK=1)\n",
603+
"...\n",
604+
"GPU 7: Process 7 (RANK=7, LOCAL_RANK=7)\n",
605+
"```\n",
606+
"\n",
607+
"All processes are now waiting for instructions!\n",
608+
"### What Happens When You Run This?\n",
500609
"\n",
501-
"For more control, manage each phase separately.\n",
610+
"1. **Spawn** 🎭: Forge creates 8 GPU processes (based on `procs: 8`)\n",
611+
"2. **Setup** 🔧: Each process loads its shard of the model + data\n",
612+
"3. **Train** 🏃: Training loop runs for 1000 steps\n",
613+
"4. **Cleanup** 🧹: Final checkpoint saved, resources released\n",
502614
"\n",
503-
"### Create and Spawn the Actor"
615+
"Uncomment the line below to start training!"
504616
]
505617
},
506618
{

0 commit comments

Comments
 (0)