Adding more explanation, narrating the overal story flow, and removing the extra steps

Hossein Kavianihamedani · Hossein Kavianihamedani · commit bf0839a3c14e · 2025-10-20T09:07:21.000-07:00
diff --git a/apps/sft/interactive_config_notebook.ipynb b/apps/sft/interactive_config_notebook.ipynb
@@ -4,24 +4,112 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "# SFT Training - Interactive Configuration Notebook\n",
+        "# 🚀 The SFT Training Story: From Configuration to Completion\n",
         "\n",
-        "This notebook allows you to configure and run SFT training **without any YAML files**!\n",
+        "Welcome to an interactive journey through **Supervised Fine-Tuning (SFT)** in Forge!\n",
         "\n",
-        "## Benefits\n",
+        "## What You'll Learn\n",
         "\n",
-        "✅ No external YAML files needed  \n",
-        "✅ Interactive configuration in separate cells  \n",
-        "✅ Easy to modify and experiment  \n",
-        "✅ All configuration visible in notebook  \n",
-        "✅ Quick templates for common scenarios"
+        "This notebook tells the complete story of how SFT training works:\n",
+        "\n",
+        "1. **🎭 The Actor Model** - Understanding TrainerActor\n",
+        "2. **🔧 Setup Phase** - Loading models, data, and checkpoints\n",
+        "3. **🏃 Training Loop** - Forward passes, backprop, optimization\n",
+        "4. **📊 Validation** - Measuring progress on held-out data\n",
+        "5. **🧹 Cleanup** - Saving checkpoints and releasing resources\n",
+        "\n",
+        "---\n",
+        "\n",
+        "## The Forge Actor Architecture\n",
+        "\n",
+        "### What is a TrainerActor?\n",
+        "\n",
+        "Think of a **TrainerActor** as the conductor of an orchestra:\n",
+        "- 🎭 **Manages multiple processes** across GPUs or nodes\n",
+        "- 🔧 **Controls the lifecycle** of training (setup → train → cleanup)\n",
+        "- 📊 **Coordinates distributed training** with FSDP, tensor parallelism, etc.\n",
+        "\n",
+        "### The Training Journey\n",
+        "\n",
+        "```\n",
+        "┌─────────────────────────────────────────┐\n",
+        "│  1. Configuration 📋                    │  ← You define parameters\n",
+        "│     (model, data, hyperparameters)      │\n",
+        "└──────────────┬──────────────────────────┘\n",
+        "               ↓\n",
+        "┌─────────────────────────────────────────┐\n",
+        "│  2. Spawn Actor 🎭                      │  ← Forge creates distributed processes\n",
+        "│     (launch 8 GPU processes)            │\n",
+        "└──────────────┬──────────────────────────┘\n",
+        "               ↓\n",
+        "┌─────────────────────────────────────────┐\n",
+        "│  3. Setup Phase 🔧                      │  ← Load model, data, checkpoints\n",
+        "│     - Initialize model with FSDP        │\n",
+        "│     - Load training dataset             │\n",
+        "│     - Load validation dataset           │\n",
+        "│     - Restore from checkpoint (if any)  │\n",
+        "└──────────────┬──────────────────────────┘\n",
+        "               ↓\n",
+        "┌─────────────────────────────────────────┐\n",
+        "│  4. Training Loop 🔄                    │  ← The main training process\n",
+        "│     FOR each step:                      │\n",
+        "│       → Get batch from dataloader       │\n",
+        "│       → Forward pass (compute loss)     │\n",
+        "│       → Backward pass (compute grads)   │\n",
+        "│       → Optimizer step (update weights) │\n",
+        "│       → [Optional] Run validation       │\n",
+        "│       → [Optional] Save checkpoint      │\n",
+        "└──────────────┬──────────────────────────┘\n",
+        "               ↓\n",
+        "┌─────────────────────────────────────────┐\n",
+        "│  5. Cleanup Phase 🧹                    │  ← Save final state\n",
+        "│     - Save final checkpoint             │\n",
+        "│     - Release GPU memory                │\n",
+        "│     - Stop all processes                │\n",
+        "└─────────────────────────────────────────┘\n",
+        "```\n",
+        "\n",
+        "### Why This Architecture?\n",
+        "\n",
+        "✅ **Automatic Distribution** - Forge handles multi-GPU/multi-node complexity  \n",
+        "✅ **Fault Tolerance** - Checkpointing enables recovery from failures  \n",
+        "✅ **Flexibility** - Easy to switch between 1 GPU, 8 GPUs, or multiple nodes  \n",
+        "✅ **Production-Ready** - Used at Meta for large-scale training\n",
+        "\n",
+        "---\n",
+        "\n",
+        "Let's configure your training!"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 1: Import Dependencies"
+        "---\n",
+        "\n",
+        "# 📚 Part 1: Configuration\n",
+        "\n",
+        "## The Foundation - Defining Your Training\n",
+        "\n",
+        "Before we can train, we need to tell Forge:\n",
+        "- **What model** to train (Llama3-8B, Qwen3-32B, etc.)\n",
+        "- **What data** to use (datasets, batch sizes)\n",
+        "- **How to train** (learning rate, optimizer, steps)\n",
+        "- **Where to run** (GPUs, FSDP settings)\n",
+        "\n",
+        "Let's start by importing our tools..."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Import Dependencies\n",
+        "\n",
+        "These imports give us access to:\n",
+        "- **OmegaConf**: Configuration management\n",
+        "- **TrainerActor**: The main training orchestrator\n",
+        "- **SpawnActor**: Helper for creating distributed actors"
       ]
     },
     {
@@ -76,7 +164,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 2: Configure Model and Process Settings\n",
+        "## Configure Model and Process Settings\n",
         "\n",
         "Define your model configuration and how many processes to use."
       ]
@@ -132,7 +220,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 3: Configure Optimizer and LR Scheduler"
+        "## Configure Optimizer and LR Scheduler"
       ]
     },
     {
@@ -184,7 +272,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 4: Configure Training Settings\n",
+        "## Configure Training Settings\n",
         "\n",
         "**Key parameters to adjust for your experiment:**"
       ]
@@ -232,7 +320,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 5: Configure Parallelism Settings"
+        "## Configure Parallelism Settings"
       ]
     },
     {
@@ -280,7 +368,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 6: Configure Checkpoint and Activation Checkpointing"
+        "## Configure Checkpoint and Activation Checkpointing"
       ]
     },
     {
@@ -342,7 +430,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 7: Configure Communication Settings"
+        "## Configure Communication Settings"
       ]
     },
     {
@@ -379,7 +467,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 8: Combine All Configurations\n",
+        "## Combine All Configurations\n",
         "\n",
         "Now let's merge everything into a complete configuration!"
       ]
@@ -475,11 +563,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Step 9: Run Training\n",
-        "\n",
-        "### Option A: Automatic Lifecycle Management (Recommended)\n",
-        "\n",
-        "Use `run_actor()` for automatic setup, training, and cleanup:"
+        "## Part 2: Run Training\n"
       ]
     },
     {
@@ -488,19 +572,47 @@
       "metadata": {},
       "outputs": [],
       "source": [
-        "# Run training with automatic lifecycle management\n",
+        "\n",
         "await run_actor(TrainerActor, cfg)"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Alternative: Manual Lifecycle Control\n",
+        "---\n",
+        "\n",
+        "# 🎭 Part 2: The Actor Lifecycle\n",
+        "\n",
+        "## Understanding Spawn, Setup, Train, and Cleanup\n",
+        "\n",
+        "### Phase 1: Spawn the Actor 🎭\n",
+        "\n",
+        "**What's happening:**\n",
+        "- `SpawnActor` creates a launcher for `TrainerActor`\n",
+        "- `spawn()` launches 8 Python processes (one per GPU)\n",
+        "- Each process initializes:\n",
+        "  - CUDA device assignment (GPU 0, 1, 2, ...)\n",
+        "  - Distributed communication (NCCL)\n",
+        "  - Process group setup (RANK, LOCAL_RANK, WORLD_SIZE)\n",
+        "\n",
+        "**Behind the scenes:**\n",
+        "```\n",
+        "GPU 0: Process 0 (RANK=0, LOCAL_RANK=0)\n",
+        "GPU 1: Process 1 (RANK=1, LOCAL_RANK=1)\n",
+        "...\n",
+        "GPU 7: Process 7 (RANK=7, LOCAL_RANK=7)\n",
+        "```\n",
+        "\n",
+        "All processes are now waiting for instructions!\n",
+        "### What Happens When You Run This?\n",
         "\n",
-        "For more control, manage each phase separately.\n",
+        "1. **Spawn** 🎭: Forge creates 8 GPU processes (based on `procs: 8`)\n",
+        "2. **Setup** 🔧: Each process loads its shard of the model + data\n",
+        "3. **Train** 🏃: Training loop runs for 1000 steps\n",
+        "4. **Cleanup** 🧹: Final checkpoint saved, resources released\n",
         "\n",
-        "### Create and Spawn the Actor"
+        "Uncomment the line below to start training!"
       ]
     },
     {