diff --git a/usage-cookbook/Nemotron-3-Super/README.md b/usage-cookbook/Nemotron-3-Super/README.md
index 10a6c358..8763d5cb 100644
--- a/usage-cookbook/Nemotron-3-Super/README.md
+++ b/usage-cookbook/Nemotron-3-Super/README.md
@@ -13,6 +13,7 @@ These notebooks provide end-to-end recipes for deploying and customizing Nemotro
 - **[vllm_cookbook.ipynb](vllm_cookbook.ipynb)** — Deploy Nemotron-3-Super with vLLM.
 - **[sglang_cookbook.ipynb](sglang_cookbook.ipynb)** — Deploy Nemotron-3-Super with SGLang.
 - **[trtllm_cookbook.ipynb](trtllm_cookbook.ipynb)** — Deploy Nemotron-3-Super with TensorRT-LLM.
+- **[batch_throughput_cookbook.ipynb](batch_throughput_cookbook.ipynb)** — High-throughput batch processing with vLLM: offline batch inference, concurrent server requests, bulk document classification, and throughput benchmarking.
 - **{doc}`AdvancedDeploymentGuide <AdvancedDeploymentGuide/README>`** — Production deployment configurations for vLLM, SGLang, and TRT-LLM across GPU topologies (GB200, B200, DGX Spark), including MTP speculative decoding, expert parallelism, and tuning guidance.
 
 ### Fine-Tuning
diff --git a/usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb b/usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb
new file mode 100644
index 00000000..68c24941
--- /dev/null
+++ b/usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb
@@ -0,0 +1,755 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# High-Throughput Batch Processing with Nemotron 3 Super\n",
+    "\n",
+    "This notebook demonstrates how to run **high-throughput batch inference** with `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B` using [vLLM](https://docs.vllm.ai).\n",
+    "\n",
+    "Nemotron 3 Super achieves 5× throughput over the previous Nemotron Super and 7.5× over Qwen3.5-122B on long-output workloads. This cookbook shows how to unlock that throughput for real batch workloads like document classification, summarization, and data extraction.\n",
+    "\n",
+    "**What you'll learn:**\n",
+    "1. Configuring vLLM for maximum throughput (CUTLASS backend, EP, batch sizing)\n",
+    "2. Offline batch inference with vLLM's `LLM` class\n",
+    "3. Concurrent server requests with async Python\n",
+    "4. Practical use case: bulk document classification with structured JSON output\n",
+    "5. Measuring and comparing throughput across configurations\n",
+    "\n",
+    "For model details, see the [model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16). For latency-optimized single-request usage, see [vllm_cookbook.ipynb](./vllm_cookbook.ipynb).\n",
+    "\n",
+    "Prerequisites:\n",
+    "- NVIDIA GPU with recent drivers (>= 264 GB VRAM for BF16, >= 160 GB VRAM for FP8, >= 80 GB VRAM for NVFP4) and CUDA 12.x.\n",
+    "- Python 3.10+"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#If pip not found\n",
+    "!python -m ensurepip --default-pip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -U vllm==0.17.1 torch==2.10.0 flashinfer-python==0.6.4 flashinfer-cubin==0.6.4 'nvidia-cutlass-dsl>=4.4.0.dev1' openai httpx --extra-index-url https://download.pytorch.org/whl/cu128"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Verify GPU\n",
+    "\n",
+    "Confirm CUDA is available and your GPU is visible to PyTorch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GPU environment check\n",
+    "import torch\n",
+    "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
+    "print(f\"Num GPUs: {torch.cuda.device_count()}\")\n",
+    "if torch.cuda.is_available():\n",
+    "    for i in range(torch.cuda.device_count()):\n",
+    "        print(f\"GPU[{i}]: {torch.cuda.get_device_name(i)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 1: Throughput-Optimized Server Configuration\n",
+    "\n",
+    "vLLM supports two MoE backend modes for Nemotron 3 Super:\n",
+    "\n",
+    "| Mode | Env Var | Backend | Best For |\n",
+    "|------|---------|---------|----------|\n",
+    "| **Latency** | `VLLM_FLASHINFER_MOE_BACKEND=latency` | TRT-LLM Gen kernels | Online serving, interactive chat |\n",
+    "| **Throughput** | `VLLM_FLASHINFER_MOE_BACKEND=throughput` | CUTLASS | Offline batch jobs, bulk processing |\n",
+    "\n",
+    "The `throughput` backend trades per-request latency for higher aggregate tokens/sec across many concurrent requests.\n",
+    "\n",
+    "### Latency-optimized (default from vllm_cookbook)\n",
+    "\n",
+    "```bash\n",
+    "VLLM_FLASHINFER_MOE_BACKEND=latency \\\n",
+    "vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n",
+    "  --tensor-parallel-size 4 \\\n",
+    "  --trust-remote-code \\\n",
+    "  --gpu-memory-utilization 0.9 \\\n",
+    "  --enable-expert-parallel\n",
+    "```\n",
+    "\n",
+    "### Throughput-optimized (for batch workloads)\n",
+    "\n",
+    "```bash\n",
+    "VLLM_FLASHINFER_MOE_BACKEND=throughput \\\n",
+    "VLLM_USE_FLASHINFER_MOE_FP8=1 \\\n",
+    "vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n",
+    "  --tensor-parallel-size 4 \\\n",
+    "  --trust-remote-code \\\n",
+    "  --gpu-memory-utilization 0.95 \\\n",
+    "  --enable-expert-parallel \\\n",
+    "  --max-num-seqs 256 \\\n",
+    "  --disable-log-requests\n",
+    "```\n",
+    "\n",
+    "Key differences for throughput:\n",
+    "- `VLLM_FLASHINFER_MOE_BACKEND=throughput` selects CUTLASS kernels optimized for batch processing\n",
+    "- `--gpu-memory-utilization 0.95` allocates more memory for KV cache (more concurrent sequences)\n",
+    "- `--max-num-seqs 256` allows more concurrent sequences in a batch\n",
+    "- `--disable-log-requests` reduces logging overhead during high-volume processing\n",
+    "\n",
+    "See the [Advanced Deployment Guide](./AdvancedDeploymentGuide/README.md) for GPU-specific configurations (GB200, B200, DGX Spark)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 2: Offline Batch Inference\n",
+    "\n",
+    "For bulk processing without running a server, use vLLM's `LLM` class directly.\n",
+    "This is the simplest way to process a large dataset: load the model once, pass all prompts, collect results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vllm import LLM, SamplingParams\n",
+    "from transformers import AutoTokenizer\n",
+    "import time\n",
+    "\n",
+    "# Choose your precision/GPU configuration\n",
+    "model_id = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\"\n",
+    "# model_id = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8\"\n",
+    "# model_id = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4\"\n",
+    "\n",
+    "llm = LLM(\n",
+    "    model=model_id,\n",
+    "    trust_remote_code=True,\n",
+    "    dtype=\"auto\",\n",
+    "    tensor_parallel_size=4,  # TP=4 for BF16, TP=2 for FP8, TP=1 for NVFP4\n",
+    "    kv_cache_dtype=\"fp8\",\n",
+    "    gpu_memory_utilization=0.95,\n",
+    ")\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
+    "print(\"Model ready for batch inference\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Prepare a batch of prompts\n",
+    "\n",
+    "We'll generate a sample dataset of 50 documents to classify. In practice, this would be your CSV, JSONL, or database rows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sample documents for classification\n",
+    "sample_documents = [\n",
+    "    \"A new CRISPR-based gene therapy shows promise in treating sickle cell disease in clinical trials.\",\n",
+    "    \"Tesla stock surged 8% after reporting record quarterly deliveries of 500,000 vehicles.\",\n",
+    "    \"Researchers at MIT developed a neural network that can predict protein folding 100x faster.\",\n",
+    "    \"The Federal Reserve held interest rates steady, signaling a wait-and-see approach to inflation.\",\n",
+    "    \"SpaceX successfully launched its 50th Starlink mission this year, deploying 60 satellites.\",\n",
+    "    \"A longitudinal study found that Mediterranean diets reduce cardiovascular risk by 30%.\",\n",
+    "    \"NVIDIA announced its next-generation Blackwell GPU architecture at GTC 2026.\",\n",
+    "    \"The European Central Bank cut rates by 25 basis points to stimulate economic growth.\",\n",
+    "    \"Google DeepMind published a breakthrough in mathematical theorem proving using LLMs.\",\n",
+    "    \"Apple's Vision Pro headset sold 1 million units in its first quarter of availability.\",\n",
+    "    \"New evidence suggests that gut microbiome diversity is linked to cognitive performance.\",\n",
+    "    \"Amazon Web Services reported 35% year-over-year growth in cloud computing revenue.\",\n",
+    "    \"A team at Stanford created a robotic system that can perform autonomous surgery.\",\n",
+    "    \"Bitcoin reached a new all-time high above $150,000 amid institutional adoption.\",\n",
+    "    \"Moderna announced a combined flu-COVID vaccine with 90% efficacy in Phase 3 trials.\",\n",
+    "    \"Meta released its latest open-source model, achieving state-of-the-art on coding benchmarks.\",\n",
+    "    \"Climate scientists reported that 2025 was the hottest year in recorded history.\",\n",
+    "    \"Toyota unveiled a solid-state battery EV with 900-mile range for 2027 production.\",\n",
+    "    \"A new antibiotic compound was discovered that kills drug-resistant bacteria in mice.\",\n",
+    "    \"Microsoft Azure announced native support for NVIDIA NIM microservices.\",\n",
+    "]\n",
+    "\n",
+    "# Duplicate to create a larger batch (adjust multiplier for your workload)\n",
+    "documents = sample_documents * 5  # 100 documents\n",
+    "print(f\"Total documents to classify: {len(documents)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build classification prompts\n",
+    "SYSTEM_PROMPT = \"\"\"You are a document classifier. For each document, output exactly one JSON object with these fields:\n",
+    "- \"category\": one of [\"Science\", \"Technology\", \"Finance\", \"Health\", \"Business\"]\n",
+    "- \"confidence\": a float between 0 and 1\n",
+    "- \"summary\": a one-sentence summary (max 20 words)\n",
+    "\n",
+    "Output ONLY the JSON object, no other text.\"\"\"\n",
+    "\n",
+    "prompts = [\n",
+    "    tokenizer.apply_chat_template(\n",
+    "        [\n",
+    "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "            {\"role\": \"user\", \"content\": f\"Classify this document:\\n\\n{doc}\"},\n",
+    "        ],\n",
+    "        tokenize=False,\n",
+    "        add_generation_prompt=True,\n",
+    "    )\n",
+    "    for doc in documents\n",
+    "]\n",
+    "\n",
+    "print(f\"Prepared {len(prompts)} prompts\")\n",
+    "print(f\"\\nExample prompt (first 200 chars):\\n{prompts[0][:200]}...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run batch inference and measure throughput"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "params = SamplingParams(\n",
+    "    temperature=0.0,  # Deterministic for classification\n",
+    "    max_tokens=150,   # JSON output is compact\n",
+    ")\n",
+    "\n",
+    "# Run the batch\n",
+    "start = time.perf_counter()\n",
+    "outputs = llm.generate(prompts, sampling_params=params)\n",
+    "elapsed = time.perf_counter() - start\n",
+    "\n",
+    "# Calculate throughput metrics\n",
+    "total_output_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)\n",
+    "total_input_tokens = sum(len(o.prompt_token_ids) for o in outputs)\n",
+    "\n",
+    "print(f\"Batch size:          {len(prompts)}\")\n",
+    "print(f\"Wall-clock time:     {elapsed:.2f}s\")\n",
+    "print(f\"Total input tokens:  {total_input_tokens:,}\")\n",
+    "print(f\"Total output tokens: {total_output_tokens:,}\")\n",
+    "print(f\"Output tok/s:        {total_output_tokens / elapsed:.1f}\")\n",
+    "print(f\"Docs/sec:            {len(prompts) / elapsed:.1f}\")\n",
+    "print(f\"Avg time/doc:        {elapsed / len(prompts) * 1000:.1f}ms\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "\n",
+    "# Parse and display results\n",
+    "results = []\n",
+    "parse_errors = 0\n",
+    "\n",
+    "for i, output in enumerate(outputs):\n",
+    "    text = output.outputs[0].text.strip()\n",
+    "    try:\n",
+    "        parsed = json.loads(text)\n",
+    "        results.append(parsed)\n",
+    "    except json.JSONDecodeError:\n",
+    "        parse_errors += 1\n",
+    "        results.append({\"raw\": text, \"error\": \"parse_failed\"})\n",
+    "\n",
+    "print(f\"Successfully parsed: {len(results) - parse_errors}/{len(results)}\")\n",
+    "print(f\"Parse errors: {parse_errors}\")\n",
+    "print()\n",
+    "\n",
+    "# Show first 5 results\n",
+    "for i in range(min(5, len(results))):\n",
+    "    print(f\"Doc {i+1}: {documents[i][:80]}...\")\n",
+    "    print(f\"  -> {json.dumps(results[i])}\")\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Release GPU memory before server mode\n",
+    "\n",
+    "The offline LLM object holds GPU memory. Release it before starting a vLLM server in the next section."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gc\n",
+    "\n",
+    "del llm\n",
+    "del tokenizer\n",
+    "gc.collect()\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    torch.cuda.empty_cache()\n",
+    "    torch.cuda.ipc_collect()\n",
+    "\n",
+    "print(\"GPU memory released. Ready for server mode.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 3: Concurrent Server Requests\n",
+    "\n",
+    "For production workloads, you'll typically run vLLM as a server and send requests concurrently.\n",
+    "This is useful when:\n",
+    "- Your data arrives as a stream (files, database rows, API calls)\n",
+    "- You need to integrate with existing pipelines\n",
+    "- You want to control concurrency independently from model batching\n",
+    "\n",
+    "### Start the server\n",
+    "\n",
+    "Run this in a terminal before continuing:\n",
+    "\n",
+    "```bash\n",
+    "VLLM_FLASHINFER_MOE_BACKEND=throughput \\\n",
+    "VLLM_USE_FLASHINFER_MOE_FP8=1 \\\n",
+    "vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n",
+    "  --tensor-parallel-size 4 \\\n",
+    "  --trust-remote-code \\\n",
+    "  --gpu-memory-utilization 0.95 \\\n",
+    "  --enable-expert-parallel \\\n",
+    "  --max-num-seqs 256 \\\n",
+    "  --disable-log-requests \\\n",
+    "  --port 5000\n",
+    "```\n",
+    "\n",
+    "Wait until you see `Uvicorn running on http://0.0.0.0:5000` before running the next cells."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "import httpx\n",
+    "import json\n",
+    "import time\n",
+    "from typing import Any\n",
+    "\n",
+    "SERVER_URL = \"http://127.0.0.1:5000/v1/chat/completions\"\n",
+    "MODEL_NAME = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\"\n",
+    "\n",
+    "SYSTEM_PROMPT = \"\"\"You are a document classifier. For each document, output exactly one JSON object with these fields:\n",
+    "- \"category\": one of [\"Science\", \"Technology\", \"Finance\", \"Health\", \"Business\"]\n",
+    "- \"confidence\": a float between 0 and 1\n",
+    "- \"summary\": a one-sentence summary (max 20 words)\n",
+    "\n",
+    "Output ONLY the JSON object, no other text.\"\"\"\n",
+    "\n",
+    "\n",
+    "async def classify_one(\n",
+    "    client: httpx.AsyncClient,\n",
+    "    document: str,\n",
+    "    semaphore: asyncio.Semaphore,\n",
+    ") -> dict[str, Any]:\n",
+    "    \"\"\"Send a single classification request.\"\"\"\n",
+    "    async with semaphore:\n",
+    "        payload = {\n",
+    "            \"model\": MODEL_NAME,\n",
+    "            \"messages\": [\n",
+    "                {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "                {\"role\": \"user\", \"content\": f\"Classify this document:\\n\\n{document}\"},\n",
+    "            ],\n",
+    "            \"temperature\": 0.0,\n",
+    "            \"max_tokens\": 150,\n",
+    "        }\n",
+    "        t0 = time.perf_counter()\n",
+    "        resp = await client.post(SERVER_URL, json=payload, timeout=120.0)\n",
+    "        latency = time.perf_counter() - t0\n",
+    "\n",
+    "        data = resp.json()\n",
+    "        text = data[\"choices\"][0][\"message\"][\"content\"]\n",
+    "        usage = data.get(\"usage\", {})\n",
+    "\n",
+    "        return {\n",
+    "            \"text\": text,\n",
+    "            \"latency\": latency,\n",
+    "            \"prompt_tokens\": usage.get(\"prompt_tokens\", 0),\n",
+    "            \"completion_tokens\": usage.get(\"completion_tokens\", 0),\n",
+    "        }\n",
+    "\n",
+    "\n",
+    "async def batch_classify(\n",
+    "    documents: list[str],\n",
+    "    concurrency: int = 32,\n",
+    ") -> list[dict[str, Any]]:\n",
+    "    \"\"\"Classify all documents concurrently with bounded concurrency.\"\"\"\n",
+    "    semaphore = asyncio.Semaphore(concurrency)\n",
+    "    async with httpx.AsyncClient() as client:\n",
+    "        tasks = [\n",
+    "            classify_one(client, doc, semaphore)\n",
+    "            for doc in documents\n",
+    "        ]\n",
+    "        return await asyncio.gather(*tasks)\n",
+    "\n",
+    "\n",
+    "print(\"Batch classification functions defined.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run concurrent classification"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reuse the same sample documents from Part 2\n",
+    "sample_documents = [\n",
+    "    \"A new CRISPR-based gene therapy shows promise in treating sickle cell disease in clinical trials.\",\n",
+    "    \"Tesla stock surged 8% after reporting record quarterly deliveries of 500,000 vehicles.\",\n",
+    "    \"Researchers at MIT developed a neural network that can predict protein folding 100x faster.\",\n",
+    "    \"The Federal Reserve held interest rates steady, signaling a wait-and-see approach to inflation.\",\n",
+    "    \"SpaceX successfully launched its 50th Starlink mission this year, deploying 60 satellites.\",\n",
+    "    \"A longitudinal study found that Mediterranean diets reduce cardiovascular risk by 30%.\",\n",
+    "    \"NVIDIA announced its next-generation Blackwell GPU architecture at GTC 2026.\",\n",
+    "    \"The European Central Bank cut rates by 25 basis points to stimulate economic growth.\",\n",
+    "    \"Google DeepMind published a breakthrough in mathematical theorem proving using LLMs.\",\n",
+    "    \"Apple's Vision Pro headset sold 1 million units in its first quarter of availability.\",\n",
+    "    \"New evidence suggests that gut microbiome diversity is linked to cognitive performance.\",\n",
+    "    \"Amazon Web Services reported 35% year-over-year growth in cloud computing revenue.\",\n",
+    "    \"A team at Stanford created a robotic system that can perform autonomous surgery.\",\n",
+    "    \"Bitcoin reached a new all-time high above $150,000 amid institutional adoption.\",\n",
+    "    \"Moderna announced a combined flu-COVID vaccine with 90% efficacy in Phase 3 trials.\",\n",
+    "    \"Meta released its latest open-source model, achieving state-of-the-art on coding benchmarks.\",\n",
+    "    \"Climate scientists reported that 2025 was the hottest year in recorded history.\",\n",
+    "    \"Toyota unveiled a solid-state battery EV with 900-mile range for 2027 production.\",\n",
+    "    \"A new antibiotic compound was discovered that kills drug-resistant bacteria in mice.\",\n",
+    "    \"Microsoft Azure announced native support for NVIDIA NIM microservices.\",\n",
+    "]\n",
+    "\n",
+    "documents = sample_documents * 5  # 100 documents\n",
+    "\n",
+    "# Run batch classification\n",
+    "start = time.perf_counter()\n",
+    "results = await batch_classify(documents, concurrency=32)\n",
+    "total_time = time.perf_counter() - start\n",
+    "\n",
+    "# Aggregate metrics\n",
+    "total_prompt_tokens = sum(r[\"prompt_tokens\"] for r in results)\n",
+    "total_completion_tokens = sum(r[\"completion_tokens\"] for r in results)\n",
+    "latencies = [r[\"latency\"] for r in results]\n",
+    "\n",
+    "print(f\"Documents processed: {len(results)}\")\n",
+    "print(f\"Concurrency:         32\")\n",
+    "print(f\"Wall-clock time:     {total_time:.2f}s\")\n",
+    "print(f\"Docs/sec:            {len(results) / total_time:.1f}\")\n",
+    "print(f\"Total output tokens: {total_completion_tokens:,}\")\n",
+    "print(f\"Output tok/s:        {total_completion_tokens / total_time:.1f}\")\n",
+    "print(f\"Avg latency/req:     {sum(latencies) / len(latencies) * 1000:.0f}ms\")\n",
+    "print(f\"P50 latency:         {sorted(latencies)[len(latencies)//2] * 1000:.0f}ms\")\n",
+    "print(f\"P99 latency:         {sorted(latencies)[int(len(latencies)*0.99)] * 1000:.0f}ms\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Effect of concurrency on throughput\n",
+    "\n",
+    "Higher concurrency lets vLLM batch more requests together, improving throughput up to a point.\n",
+    "Beyond the server's `--max-num-seqs`, additional requests queue without benefit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "concurrency_levels = [1, 4, 16, 32, 64, 128]\n",
+    "concurrency_results = []\n",
+    "\n",
+    "for c in concurrency_levels:\n",
+    "    start = time.perf_counter()\n",
+    "    res = await batch_classify(documents[:50], concurrency=c)  # Use 50 docs per test\n",
+    "    elapsed = time.perf_counter() - start\n",
+    "    tokens = sum(r[\"completion_tokens\"] for r in res)\n",
+    "    concurrency_results.append({\n",
+    "        \"concurrency\": c,\n",
+    "        \"wall_time\": elapsed,\n",
+    "        \"docs_per_sec\": 50 / elapsed,\n",
+    "        \"tok_per_sec\": tokens / elapsed,\n",
+    "    })\n",
+    "    print(f\"Concurrency {c:>3d}: {elapsed:6.2f}s | {50/elapsed:5.1f} docs/s | {tokens/elapsed:6.1f} tok/s\")\n",
+    "\n",
+    "print(\"\\nHigher concurrency improves throughput until the server's batch capacity is saturated.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 4: Processing Files from Disk\n",
+    "\n",
+    "In real workloads, your data comes from files. Here's a pattern for processing a JSONL file\n",
+    "line-by-line with concurrent requests and writing results back to disk."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Create a sample input file\n",
+    "input_path = Path(\"/tmp/nemotron_batch_input.jsonl\")\n",
+    "output_path = Path(\"/tmp/nemotron_batch_output.jsonl\")\n",
+    "\n",
+    "with open(input_path, \"w\") as f:\n",
+    "    for i, doc in enumerate(documents):\n",
+    "        f.write(json.dumps({\"id\": i, \"text\": doc}) + \"\\n\")\n",
+    "\n",
+    "print(f\"Wrote {len(documents)} documents to {input_path}\")\n",
+    "\n",
+    "\n",
+    "async def process_jsonl(\n",
+    "    input_file: Path,\n",
+    "    output_file: Path,\n",
+    "    concurrency: int = 32,\n",
+    ") -> dict:\n",
+    "    \"\"\"Process a JSONL file: classify each document and write results.\"\"\"\n",
+    "    # Read input\n",
+    "    records = []\n",
+    "    with open(input_file) as f:\n",
+    "        for line in f:\n",
+    "            records.append(json.loads(line))\n",
+    "\n",
+    "    # Classify all documents\n",
+    "    texts = [r[\"text\"] for r in records]\n",
+    "    start = time.perf_counter()\n",
+    "    results = await batch_classify(texts, concurrency=concurrency)\n",
+    "    elapsed = time.perf_counter() - start\n",
+    "\n",
+    "    # Write output\n",
+    "    with open(output_file, \"w\") as f:\n",
+    "        for record, result in zip(records, results):\n",
+    "            try:\n",
+    "                classification = json.loads(result[\"text\"])\n",
+    "            except json.JSONDecodeError:\n",
+    "                classification = {\"raw\": result[\"text\"], \"error\": \"parse_failed\"}\n",
+    "            output = {\n",
+    "                **record,\n",
+    "                \"classification\": classification,\n",
+    "                \"latency_ms\": round(result[\"latency\"] * 1000),\n",
+    "            }\n",
+    "            f.write(json.dumps(output) + \"\\n\")\n",
+    "\n",
+    "    return {\n",
+    "        \"total_docs\": len(records),\n",
+    "        \"wall_time\": elapsed,\n",
+    "        \"docs_per_sec\": len(records) / elapsed,\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "stats = await process_jsonl(input_path, output_path, concurrency=32)\n",
+    "print(f\"\\nProcessed {stats['total_docs']} docs in {stats['wall_time']:.2f}s ({stats['docs_per_sec']:.1f} docs/s)\")\n",
+    "print(f\"Results written to {output_path}\")\n",
+    "\n",
+    "# Show first 3 output records\n",
+    "with open(output_path) as f:\n",
+    "    for i, line in enumerate(f):\n",
+    "        if i >= 3:\n",
+    "            break\n",
+    "        record = json.loads(line)\n",
+    "        print(f\"\\nDoc {record['id']}: {record['text'][:60]}...\")\n",
+    "        print(f\"  -> {json.dumps(record['classification'])}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 5: Throughput Benchmarking\n",
+    "\n",
+    "Use `vllm bench serve` for standardized throughput measurement.\n",
+    "This is the recommended way to benchmark before and after configuration changes.\n",
+    "\n",
+    "### Quick benchmark against your running server\n",
+    "\n",
+    "```bash\n",
+    "# Synthetic benchmark: fixed input/output lengths\n",
+    "vllm bench serve \\\n",
+    "  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n",
+    "  --base-url http://127.0.0.1:5000 \\\n",
+    "  --num-prompts 100 \\\n",
+    "  --random-input-len 512 \\\n",
+    "  --random-output-len 256 \\\n",
+    "  --request-rate inf\n",
+    "```\n",
+    "\n",
+    "### Compare backend modes\n",
+    "\n",
+    "Run the benchmark twice - once with each backend - to see the difference:\n",
+    "\n",
+    "| Configuration | Best for | Expected throughput profile |\n",
+    "|--------------|----------|---------------------------|\n",
+    "| `VLLM_FLASHINFER_MOE_BACKEND=latency` | Interactive, low-latency serving | Lower TTFT, lower aggregate tok/s |\n",
+    "| `VLLM_FLASHINFER_MOE_BACKEND=throughput` | Batch jobs, offline processing | Higher aggregate tok/s, higher per-request latency |\n",
+    "\n",
+    "The throughput backend can deliver significantly higher aggregate tokens/sec when many requests are in flight simultaneously,\n",
+    "at the cost of higher per-request latency. For batch workloads where you care about total wall-clock time (not individual response speed),\n",
+    "the throughput backend is the right choice."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Scaling estimates\n",
+    "\n",
+    "Use your benchmark results to estimate processing time for larger workloads:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Replace with your measured docs/sec from Part 2 or Part 3\n",
+    "measured_docs_per_sec = stats[\"docs_per_sec\"]\n",
+    "\n",
+    "workloads = [\n",
+    "    (\"Small dataset\", 1_000),\n",
+    "    (\"Medium dataset\", 10_000),\n",
+    "    (\"Large dataset\", 100_000),\n",
+    "    (\"Patent-scale (3.5M)\", 3_500_000),\n",
+    "]\n",
+    "\n",
+    "print(f\"Based on measured throughput: {measured_docs_per_sec:.1f} docs/sec\\n\")\n",
+    "print(f\"{'Workload':<25} {'Documents':>12} {'Est. Time':>15}\")\n",
+    "print(\"-\" * 55)\n",
+    "for name, count in workloads:\n",
+    "    seconds = count / measured_docs_per_sec\n",
+    "    if seconds < 60:\n",
+    "        time_str = f\"{seconds:.0f}s\"\n",
+    "    elif seconds < 3600:\n",
+    "        time_str = f\"{seconds/60:.1f} min\"\n",
+    "    else:\n",
+    "        time_str = f\"{seconds/3600:.1f} hours\"\n",
+    "    print(f\"{name:<25} {count:>12,} {time_str:>15}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cleanup and shutdown\n",
+    "\n",
+    "To free resources after this notebook:\n",
+    "\n",
+    "1. Release in-process model memory (run the next cell).\n",
+    "2. Stop the `vllm serve` process in the terminal where it was started (`Ctrl+C`).\n",
+    "3. If needed, restart the kernel to ensure a clean state."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gc\n",
+    "import torch\n",
+    "\n",
+    "# Release in-process model/tokenizer objects if present\n",
+    "for name in (\"llm\", \"tokenizer\"):\n",
+    "    if name in globals():\n",
+    "        del globals()[name]\n",
+    "\n",
+    "gc.collect()\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    torch.cuda.empty_cache()\n",
+    "    torch.cuda.ipc_collect()\n",
+    "\n",
+    "# Clean up temp files\n",
+    "from pathlib import Path\n",
+    "for p in [Path(\"/tmp/nemotron_batch_input.jsonl\"), Path(\"/tmp/nemotron_batch_output.jsonl\")]:\n",
+    "    p.unlink(missing_ok=True)\n",
+    "\n",
+    "print(\"Cleanup complete.\")\n",
+    "print(\"If vllm serve is running in a terminal, stop it with Ctrl+C.\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}