Adding refrences

Hossein Kavianihamedani · Hossein Kavianihamedani · commit ff6dd19fe5a4 · 2025-10-20T13:55:19.000-07:00
diff --git a/apps/sft/interactive_config_notebook.ipynb b/apps/sft/interactive_config_notebook.ipynb
@@ -12,7 +12,7 @@
         "\n",
         "This notebook tells the complete story of how SFT training works:\n",
         "\n",
-        "1. **🎭 The Actor Model** - Understanding TrainerActor\n",
+        "1. **🎭 The Actor Model** - Understanding TrainerActor (built on Monarch)\n",
         "2. **🔧 Setup Phase** - Loading models, data, and checkpoints\n",
         "3. **🏃 Training Loop** - Forward passes, backprop, optimization\n",
         "5. **🧹 Cleanup** - Saving checkpoints and releasing resources\n",
@@ -21,14 +21,28 @@
         "\n",
         "## The Forge Actor Architecture\n",
         "\n",
+        "### What is Monarch?\n",
+        "\n",
+        "**Monarch** is Meta's distributed actor framework that powers Forge:\n",
+        "- 🌐 **Distributed by design** - Built for multi-node, multi-GPU training\n",
+        "- 🎭 **Actor model** - Encapsulates distributed processes as actors\n",
+        "- 📡 **Remote communication** - Seamless RPC between actors\n",
+        "- 🔧 **Lifecycle management** - Spawn → Setup → Run → Cleanup pattern\n",
+        "\n",
+        "Forge leverages Monarch to abstract away distributed training complexity!\n",
+        "\n",
+        "For more information on Monarch, visit https://github.com/meta-pytorch/monarch/tree/main/docs\n",
+        "\n",
         "### What is a TrainerActor?\n",
         "\n",
-        "Think of a **TrainerActor** as the conductor of an orchestra:\n",
+        "A **TrainerActor** is Forge's Monarch actor for training:\n",
         "- 🎭 **Manages multiple processes** across GPUs or nodes\n",
-        "- 🔧 **Controls the lifecycle** of training (setup → train → cleanup)\n",
+        "- 🔧 **Controls the lifecycle** using Monarch's actor pattern\n",
         "- 📊 **Coordinates distributed training** with FSDP, tensor parallelism, etc.\n",
         "\n",
-        "### The Training Journey\n",
+        "Think of it as the conductor of an orchestra - coordinating 8 GPU processes working together!\n",
+        "\n",
+        "### The Training Journey (Monarch Actor Lifecycle)\n",
         "\n",
         "```\n",
         "┌─────────────────────────────────────────┐\n",
@@ -37,19 +51,19 @@
         "└──────────────┬──────────────────────────┘\n",
         "               ↓\n",
         "┌─────────────────────────────────────────┐\n",
-        "│  2. Spawn Actor 🎭                      │  ← Forge creates distributed processes\n",
+        "│  2. Spawn Actor 🎭 [MONARCH]            │  ← Monarch creates distributed processes\n",
         "│     (launch 8 GPU processes)            │\n",
         "└──────────────┬──────────────────────────┘\n",
         "               ↓\n",
         "┌─────────────────────────────────────────┐\n",
-        "│  3. Setup Phase 🔧                      │  ← Load model, data, checkpoints\n",
+        "│  3. Setup Phase 🔧 [MONARCH]            │  ← Actor.setup() endpoint\n",
         "│     - Initialize model with FSDP        │\n",
-        "│     - Load training dataset             │           │\n",
+        "│     - Load training dataset             │\n",
         "│     - Restore from checkpoint (if any)  │\n",
         "└──────────────┬──────────────────────────┘\n",
         "               ↓\n",
         "┌─────────────────────────────────────────┐\n",
-        "│  4. Training Loop 🔄                    │  ← The main training process\n",
+        "│  4. Training Loop 🔄 [MONARCH]          │  ← Actor.train() endpoint\n",
         "│     FOR each step:                      │\n",
         "│       → Get batch from dataloader       │\n",
         "│       → Forward pass (compute loss)     │\n",
@@ -60,7 +74,7 @@
         "└──────────────┬──────────────────────────┘\n",
         "               ↓\n",
         "┌─────────────────────────────────────────┐\n",
-        "│  5. Cleanup Phase 🧹                    │  ← Save final state\n",
+        "│  5. Cleanup Phase 🧹 [MONARCH]          │  ← Actor.cleanup() endpoint\n",
         "│     - Save final checkpoint             │\n",
         "│     - Release GPU memory                │\n",
         "│     - Stop all processes                │\n",
@@ -69,11 +83,13 @@
         "\n",
         "### Why This Architecture?\n",
         "\n",
-        "✅ **Automatic Distribution** - Forge handles multi-GPU/multi-node complexity  \n",
+        "✅ **Automatic Distribution** - Monarch handles multi-GPU/multi-node complexity  \n",
         "✅ **Fault Tolerance** - Checkpointing enables recovery from failures  \n",
         "✅ **Flexibility** - Easy to switch between 1 GPU, 8 GPUs, or multiple nodes  \n",
-        "✅ **Production-Ready** - Used at Meta for large-scale training\n",
+        "✅ **Production-Ready** - Used at Meta for large-scale training  \n",
+        "✅ **Actor Pattern** - Clean separation of concerns with lifecycle methods\n",
         "\n",
+        "#### For more information regarding Forge visit: https://github.com/meta-pytorch/torchforge/tree/main/docs\n",
         "---\n",
         "\n",
         "Let's configure your training!"
@@ -502,23 +518,32 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "### Phase 2: Setup 🔧\n",
+        "### Phase 2: Setup 🔧 [Monarch Endpoint]\n",
         "\n",
         "**What's happening:**\n",
-        "- **Model Loading**: Each process loads its shard of the model\n",
-        "  - With FSDP, GPU 0 might get layers 0-10\n",
+        "\n",
+        "Monarch calls the `@endpoint` decorated `setup()` method on all 8 actor instances:\n",
+        "\n",
+        "```python\n",
+        "class TrainerActor:\n",
+        "    @endpoint\n",
+        "    async def setup(self):\n",
+        "        # This runs on all 8 GPUs simultaneously\n",
+        "        ...\n",
+        "```\n",
+        "\n",
+        "Each actor instance:\n",
+        "- **Loads its shard of the model**: With FSDP, each GPU only loads ~1/8th\n",
+        "  - GPU 0 might get layers 0-10\n",
         "  - GPU 1 gets layers 11-20, etc.\n",
-        "  - Each GPU only holds ~1/8th of the full model\n",
-        "- **Dataset Loading**: Training and validation dataloaders created\n",
-        "  - Same dataset, but different random seeds per GPU\n",
-        "  - Ensures each GPU sees different data\n",
-        "- **Checkpoint Loading**: If resuming, restore training state\n",
-        "  - Model weights, optimizer state, current step number\n",
+        "- **Creates dataloaders**: Same dataset, different random seeds per GPU\n",
+        "- **Restores checkpoint**: If resuming, loads saved state\n",
         "\n",
         "**What `setup()` does internally:**\n",
         "```python\n",
-        "def setup(self):\n",
-        "    # 1. Initialize model with FSDP\n",
+        "@endpoint\n",
+        "async def setup(self):\n",
+        "    # 1. Initialize model with FSDP sharding\n",
         "    self.model = load_model_with_fsdp(cfg.model)\n",
         "    \n",
         "    # 2. Create training dataloader\n",
@@ -537,18 +562,18 @@
         "    self.checkpointer.load(step=self.current_step)\n",
         "```\n",
         "\n",
-        "After setup, all 8 GPUs are synchronized and ready to train!"
+        "**Monarch magic:**\n",
+        "- The `@endpoint` decorator makes this method callable remotely\n",
+        "- Monarch ensures all 8 actors complete setup before proceeding\n",
+        "- Distributed state (model shards) automatically synchronized\n",
+        "\n",
+        "After setup, all 8 GPU actors are synchronized and ready to train!"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
-      "metadata": {
-        "output": {
-          "id": 693658349895675,
-          "loadingStatus": "loaded"
-        }
-      },
+      "metadata": {},
       "outputs": [],
       "source": [
         "# Setup (load data, checkpoints, etc.)\n",
@@ -560,38 +585,40 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "### Phase 3: Training Loop 🔄\n",
+        "### Phase 3: Training Loop 🔄 [Monarch Endpoint]\n",
         "\n",
         "**What's happening:**\n",
         "\n",
-        "The training loop runs for `cfg.training.steps` iterations. Each step:\n",
+        "Monarch calls the `@endpoint` decorated `train()` method, which runs the training loop for `cfg.training.steps` iterations. Each step:\n",
         "\n",
         "```python\n",
-        "for step in range(current_step, max_steps):\n",
-        "    # 1. Get next batch from dataloader\n",
-        "    batch = next(train_dataloader)\n",
-        "    # Shape: [batch_size, seq_len] per GPU\n",
-        "    \n",
-        "    # 2. Forward pass - compute predictions and loss\n",
-        "    outputs = model(batch['input_ids'])\n",
-        "    loss = compute_loss(outputs, batch['labels'])\n",
-        "    \n",
-        "    # 3. Backward pass - compute gradients\n",
-        "    loss.backward()\n",
-        "    # FSDP automatically synchronizes gradients across all GPUs!\n",
-        "    \n",
-        "    # 4. Optimizer step - update model weights\n",
-        "    optimizer.step()\n",
-        "    optimizer.zero_grad()\n",
-        "    \n",
-        "    # 5. Periodic validation (if enabled)\n",
-        "    if validation_enabled and step % eval_interval == 0:\n",
-        "        val_metrics = evaluate()\n",
-        "        log(f\"Step {step}: Val Loss = {val_metrics['val_loss']}\")\n",
-        "    \n",
-        "    # 6. Periodic checkpointing\n",
-        "    if step % checkpoint_interval == 0:\n",
-        "        save_checkpoint(step)\n",
+        "@endpoint\n",
+        "async def train(self):\n",
+        "    for step in range(current_step, max_steps):\n",
+        "        # 1. Get next batch from dataloader\n",
+        "        batch = next(train_dataloader)\n",
+        "        # Shape: [batch_size, seq_len] per GPU\n",
+        "\n",
+        "        # 2. Forward pass - compute predictions and loss\n",
+        "        outputs = model(batch['input_ids'])\n",
+        "        loss = compute_loss(outputs, batch['labels'])\n",
+        "\n",
+        "        # 3. Backward pass - compute gradients\n",
+        "        loss.backward()\n",
+        "        # FSDP automatically synchronizes gradients across all GPUs!\n",
+        "\n",
+        "        # 4. Optimizer step - update model weights\n",
+        "        optimizer.step()\n",
+        "        optimizer.zero_grad()\n",
+        "\n",
+        "        # 5. Periodic validation (if enabled)\n",
+        "        if validation_enabled and step % eval_interval == 0:\n",
+        "            val_metrics = evaluate()\n",
+        "            log(f\"Step {step}: Val Loss = {val_metrics['val_loss']}\")\n",
+        "\n",
+        "        # 6. Periodic checkpointing\n",
+        "        if step % checkpoint_interval == 0:\n",
+        "            save_checkpoint(step)\n",
         "```\n",
         "\n",
         "**Key insights:**\n",
@@ -604,18 +631,18 @@
         "- Training loss decreasing over time\n",
         "- Periodic validation metrics (if enabled)\n",
         "- Checkpoint saves at regular intervals\n",
-        "- Step timing information (seconds per step)"
+        "- Step timing information (seconds per step)\n",
+        "\n",
+        "**Monarch magic:**\n",
+        "- The `@endpoint` decorator makes this long-running training loop remotely callable\n",
+        "- All 8 actor instances run training in sync\n",
+        "- Monarch handles any RPC timeouts for long-running operations"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
-      "metadata": {
-        "output": {
-          "id": 4257826794454822,
-          "loadingStatus": "loaded"
-        }
-      },
+      "metadata": {},
       "outputs": [],
       "source": [
         "# Run training\n",