merge notebook updates from Jamie

h-guo18 · h-guo18 · commit 293d65915ef5 · 2025-09-16T01:30:37.000Z
Signed-off-by: h-guo18 &lt;67671475+h-guo18@users.noreply.github.com&gt;
diff --git a/examples/speculative_decoding/example.ipynb b/examples/speculative_decoding/example.ipynb
@@ -21,8 +21,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Convert Model\n",
-    "Let's load the base model and convert it to EAGLE3 Model"
+    "## Convert Model for Speculative Decoding\n",
+    "Here, we'll adapt our base model for speculative decoding by attaching a smaller EAGLE Head. The upcoming code first loads meta-llama/Llama-3.2-1B as our base model and then configures the new draft head. To ensure compatibility, the draft head's dimensions must match the target model. Finally, the modelopt toolkit attaches this new, untrained head, leaving us with a combined model that is ready for the training phase later."
    ]
   },
   {
@@ -76,8 +76,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Train Draft Model On Daring-Anteater\n",
-    "Then we can start training the eagle model with HF trainer."
+    "## Train Draft Head On Daring-Anteater\n",
+    "We will fine-tune the draft head on the Daring-Anteater dataset using the standard Hugging Face Trainer. Note that only the draft model's weights are updated during this process; the original target model remains unchanged. After training, our speculative decoding model will be ready for export and deployment. Note that the time to train will be significantly dependent on the epochs (default=4) and the hardware being used."
    ]
   },
   {
@@ -106,7 +106,7 @@
     "\n",
     "training_args = TrainingArguments(\n",
     "    output_dir=\"/tmp/eagle_bf16\",\n",
-    "    num_train_epochs=2,\n",
+    "    num_train_epochs=4,\n",
     "    per_device_train_batch_size=1,\n",
     "    per_device_eval_batch_size=1,\n",
     ")\n",
@@ -156,7 +156,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Deployment\n",
+    "## Deploying on TensorRT-LLM\n",
     "\n",
     "Here we show an example to deploy on TRT-LLM with `trtllm-serve` and [TRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release). See [Deployment](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/speculative_decoding#deployment) section for more info.  \n",
     "\n",
@@ -288,7 +288,7 @@
     "    \"model\": base_model,\n",
     "    \"messages\": [\n",
     "        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
-    "        {\"role\": \"user\", \"content\": \"Hi, write me a story about a cat\"},\n",
+    "        {\"role\": \"user\", \"content\": \"Tell me about speculative decoding.\"},\n",
     "    ],\n",
     "    \"max_tokens\": 512,\n",
     "    \"temperature\": 0,\n",
@@ -319,6 +319,175 @@
    "source": [
     "!docker rm -f trtllm_serve_spec"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deploying on SGLang\n",
+    "Here, we deploy our trained model using SGLang. The following code defines the command needed to run the SGLang server with our specific configuration for speculative decoding."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# SGLang server launch command shell script\n",
+    "sglang_serve_script = f\"\"\"python3 -m sglang.launch_server \\\\\n",
+    "    --model {base_model} \\\\\n",
+    "    --host 0.0.0.0 \\\\\n",
+    "    --port 30000 \\\\\n",
+    "    --speculative-algorithm EAGLE3 \\\\\n",
+    "    --speculative-eagle-topk 8 \\\\\n",
+    "    --speculative-draft-model-path /tmp/hf_ckpt \\\\\n",
+    "    --speculative-num-draft-tokens 3 \\\\\n",
+    "    --speculative-num-steps 3 \\\\\n",
+    "    --mem-fraction 0.6 \\\\\n",
+    "    --cuda-graph-max-bs 2 \\\\\n",
+    "    --dtype float16\n",
+    "\"\"\"\n",
+    "\n",
+    "with open(\"/tmp/sglang_serve.sh\", \"w\") as f:\n",
+    "    f.write(sglang_serve_script)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Launch the SGLang server inside a Docker container as a background process."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import subprocess\n",
+    "import threading\n",
+    "\n",
+    "container_name = \"sglang_serve_spec\"\n",
+    "home_dir = os.path.expanduser(\"~\")\n",
+    "hf_cache_dir = os.path.join(home_dir, \".cache\", \"huggingface\")\n",
+    "\n",
+    "# Ensure the Hugging Face cache directory exists. This directory should exist as ~/.cache/huggingface, when the model files for meta-llama/Llama-3.2-1B were downloaded earlier.\n",
+    "os.makedirs(hf_cache_dir, exist_ok=True)\n",
+    "\n",
+    "docker_cmd = [\n",
+    "    \"docker\",\n",
+    "    \"run\",\n",
+    "    \"--rm\",\n",
+    "    \"--net\",\n",
+    "    \"host\",\n",
+    "    \"--shm-size=32g\",\n",
+    "    \"--gpus\",\n",
+    "    \"all\",\n",
+    "    \"-v\",\n",
+    "    f\"{hf_cache_dir}:/root/.cache/huggingface\",\n",
+    "    \"-v\",\n",
+    "    \"/tmp:/tmp\",\n",
+    "    \"--ipc=host\",\n",
+    "    \"--name\",\n",
+    "    container_name,\n",
+    "    \"lmsysorg/sglang:latest\",\n",
+    "    \"bash\",\n",
+    "    \"-c\",\n",
+    "    \"bash /tmp/sglang_serve.sh\",\n",
+    "]\n",
+    "\n",
+    "# Launch the Docker container\n",
+    "proc = subprocess.Popen(\n",
+    "    docker_cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1\n",
+    ")\n",
+    "\n",
+    "\n",
+    "# Stream the process output\n",
+    "def stream_output(pipe):\n",
+    "    for line in iter(pipe.readline, \"\"):\n",
+    "        print(line, end=\"\")\n",
+    "\n",
+    "\n",
+    "# Use a thread to stream the output in without blocking the notebook\n",
+    "thread = threading.Thread(target=stream_output, args=(proc.stdout,))\n",
+    "thread.daemon = True\n",
+    "thread.start()\n",
+    "\n",
+    "print(\n",
+    "    f\"Starting SGLang server in Docker (PID: {proc.pid}, container name: {container_name}) in the background:\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As with TRT-LLM, please wait for the service to fully start inside the container.   \n",
+    "Once you see the message `INFO:     Application startup complete.`, you can proceed to send requests to the service:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "import requests\n",
+    "\n",
+    "payload = {\n",
+    "    \"model\": base_model,\n",
+    "    \"messages\": [\n",
+    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
+    "        {\"role\": \"user\", \"content\": \"Tell me about speculative decoding.\"},\n",
+    "    ],\n",
+    "    \"max_tokens\": 512,\n",
+    "    \"temperature\": 0,\n",
+    "}\n",
+    "headers = {\"Content-Type\": \"application/json\", \"Accept\": \"application/json\"}\n",
+    "\n",
+    "# Send request to the SGLang server\n",
+    "response = requests.post(\n",
+    "    \"http://localhost:30000/v1/chat/completions\", headers=headers, data=json.dumps(payload)\n",
+    ")\n",
+    "output = response.json()\n",
+    "\n",
+    "print(output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Clean up the container"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!docker rm -f sglang_serve_spec"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deploying on vLLM (Coming Soon)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While vLLM is another extremely popular, high-performance inference server, direct support for speculative decoding with this demo notebook is still under active development. This notebook will be updated once deployment is possible."
+   ]
   }
  ],
  "metadata": {