diff --git a/usage-cookbook/Nemotron-3-Super/README.md b/usage-cookbook/Nemotron-3-Super/README.md index 10a6c358..8763d5cb 100644 --- a/usage-cookbook/Nemotron-3-Super/README.md +++ b/usage-cookbook/Nemotron-3-Super/README.md @@ -13,6 +13,7 @@ These notebooks provide end-to-end recipes for deploying and customizing Nemotro - **[vllm_cookbook.ipynb](vllm_cookbook.ipynb)** — Deploy Nemotron-3-Super with vLLM. - **[sglang_cookbook.ipynb](sglang_cookbook.ipynb)** — Deploy Nemotron-3-Super with SGLang. - **[trtllm_cookbook.ipynb](trtllm_cookbook.ipynb)** — Deploy Nemotron-3-Super with TensorRT-LLM. +- **[batch_throughput_cookbook.ipynb](batch_throughput_cookbook.ipynb)** — High-throughput batch processing with vLLM: offline batch inference, concurrent server requests, bulk document classification, and throughput benchmarking. - **{doc}`AdvancedDeploymentGuide `** — Production deployment configurations for vLLM, SGLang, and TRT-LLM across GPU topologies (GB200, B200, DGX Spark), including MTP speculative decoding, expert parallelism, and tuning guidance. ### Fine-Tuning diff --git a/usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb b/usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb new file mode 100644 index 00000000..68c24941 --- /dev/null +++ b/usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb @@ -0,0 +1,755 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# High-Throughput Batch Processing with Nemotron 3 Super\n", + "\n", + "This notebook demonstrates how to run **high-throughput batch inference** with `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B` using [vLLM](https://docs.vllm.ai).\n", + "\n", + "Nemotron 3 Super achieves 5× throughput over the previous Nemotron Super and 7.5× over Qwen3.5-122B on long-output workloads. This cookbook shows how to unlock that throughput for real batch workloads like document classification, summarization, and data extraction.\n", + "\n", + "**What you'll learn:**\n", + "1. Configuring vLLM for maximum throughput (CUTLASS backend, EP, batch sizing)\n", + "2. Offline batch inference with vLLM's `LLM` class\n", + "3. Concurrent server requests with async Python\n", + "4. Practical use case: bulk document classification with structured JSON output\n", + "5. Measuring and comparing throughput across configurations\n", + "\n", + "For model details, see the [model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16). For latency-optimized single-request usage, see [vllm_cookbook.ipynb](./vllm_cookbook.ipynb).\n", + "\n", + "Prerequisites:\n", + "- NVIDIA GPU with recent drivers (>= 264 GB VRAM for BF16, >= 160 GB VRAM for FP8, >= 80 GB VRAM for NVFP4) and CUDA 12.x.\n", + "- Python 3.10+" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#If pip not found\n", + "!python -m ensurepip --default-pip" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -U vllm==0.17.1 torch==2.10.0 flashinfer-python==0.6.4 flashinfer-cubin==0.6.4 'nvidia-cutlass-dsl>=4.4.0.dev1' openai httpx --extra-index-url https://download.pytorch.org/whl/cu128" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Verify GPU\n", + "\n", + "Confirm CUDA is available and your GPU is visible to PyTorch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# GPU environment check\n", + "import torch\n", + "print(f\"CUDA available: {torch.cuda.is_available()}\")\n", + "print(f\"Num GPUs: {torch.cuda.device_count()}\")\n", + "if torch.cuda.is_available():\n", + " for i in range(torch.cuda.device_count()):\n", + " print(f\"GPU[{i}]: {torch.cuda.get_device_name(i)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 1: Throughput-Optimized Server Configuration\n", + "\n", + "vLLM supports two MoE backend modes for Nemotron 3 Super:\n", + "\n", + "| Mode | Env Var | Backend | Best For |\n", + "|------|---------|---------|----------|\n", + "| **Latency** | `VLLM_FLASHINFER_MOE_BACKEND=latency` | TRT-LLM Gen kernels | Online serving, interactive chat |\n", + "| **Throughput** | `VLLM_FLASHINFER_MOE_BACKEND=throughput` | CUTLASS | Offline batch jobs, bulk processing |\n", + "\n", + "The `throughput` backend trades per-request latency for higher aggregate tokens/sec across many concurrent requests.\n", + "\n", + "### Latency-optimized (default from vllm_cookbook)\n", + "\n", + "```bash\n", + "VLLM_FLASHINFER_MOE_BACKEND=latency \\\n", + "vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n", + " --tensor-parallel-size 4 \\\n", + " --trust-remote-code \\\n", + " --gpu-memory-utilization 0.9 \\\n", + " --enable-expert-parallel\n", + "```\n", + "\n", + "### Throughput-optimized (for batch workloads)\n", + "\n", + "```bash\n", + "VLLM_FLASHINFER_MOE_BACKEND=throughput \\\n", + "VLLM_USE_FLASHINFER_MOE_FP8=1 \\\n", + "vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n", + " --tensor-parallel-size 4 \\\n", + " --trust-remote-code \\\n", + " --gpu-memory-utilization 0.95 \\\n", + " --enable-expert-parallel \\\n", + " --max-num-seqs 256 \\\n", + " --disable-log-requests\n", + "```\n", + "\n", + "Key differences for throughput:\n", + "- `VLLM_FLASHINFER_MOE_BACKEND=throughput` selects CUTLASS kernels optimized for batch processing\n", + "- `--gpu-memory-utilization 0.95` allocates more memory for KV cache (more concurrent sequences)\n", + "- `--max-num-seqs 256` allows more concurrent sequences in a batch\n", + "- `--disable-log-requests` reduces logging overhead during high-volume processing\n", + "\n", + "See the [Advanced Deployment Guide](./AdvancedDeploymentGuide/README.md) for GPU-specific configurations (GB200, B200, DGX Spark)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 2: Offline Batch Inference\n", + "\n", + "For bulk processing without running a server, use vLLM's `LLM` class directly.\n", + "This is the simplest way to process a large dataset: load the model once, pass all prompts, collect results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from vllm import LLM, SamplingParams\n", + "from transformers import AutoTokenizer\n", + "import time\n", + "\n", + "# Choose your precision/GPU configuration\n", + "model_id = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\"\n", + "# model_id = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8\"\n", + "# model_id = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4\"\n", + "\n", + "llm = LLM(\n", + " model=model_id,\n", + " trust_remote_code=True,\n", + " dtype=\"auto\",\n", + " tensor_parallel_size=4, # TP=4 for BF16, TP=2 for FP8, TP=1 for NVFP4\n", + " kv_cache_dtype=\"fp8\",\n", + " gpu_memory_utilization=0.95,\n", + ")\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", + "print(\"Model ready for batch inference\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare a batch of prompts\n", + "\n", + "We'll generate a sample dataset of 50 documents to classify. In practice, this would be your CSV, JSONL, or database rows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sample documents for classification\n", + "sample_documents = [\n", + " \"A new CRISPR-based gene therapy shows promise in treating sickle cell disease in clinical trials.\",\n", + " \"Tesla stock surged 8% after reporting record quarterly deliveries of 500,000 vehicles.\",\n", + " \"Researchers at MIT developed a neural network that can predict protein folding 100x faster.\",\n", + " \"The Federal Reserve held interest rates steady, signaling a wait-and-see approach to inflation.\",\n", + " \"SpaceX successfully launched its 50th Starlink mission this year, deploying 60 satellites.\",\n", + " \"A longitudinal study found that Mediterranean diets reduce cardiovascular risk by 30%.\",\n", + " \"NVIDIA announced its next-generation Blackwell GPU architecture at GTC 2026.\",\n", + " \"The European Central Bank cut rates by 25 basis points to stimulate economic growth.\",\n", + " \"Google DeepMind published a breakthrough in mathematical theorem proving using LLMs.\",\n", + " \"Apple's Vision Pro headset sold 1 million units in its first quarter of availability.\",\n", + " \"New evidence suggests that gut microbiome diversity is linked to cognitive performance.\",\n", + " \"Amazon Web Services reported 35% year-over-year growth in cloud computing revenue.\",\n", + " \"A team at Stanford created a robotic system that can perform autonomous surgery.\",\n", + " \"Bitcoin reached a new all-time high above $150,000 amid institutional adoption.\",\n", + " \"Moderna announced a combined flu-COVID vaccine with 90% efficacy in Phase 3 trials.\",\n", + " \"Meta released its latest open-source model, achieving state-of-the-art on coding benchmarks.\",\n", + " \"Climate scientists reported that 2025 was the hottest year in recorded history.\",\n", + " \"Toyota unveiled a solid-state battery EV with 900-mile range for 2027 production.\",\n", + " \"A new antibiotic compound was discovered that kills drug-resistant bacteria in mice.\",\n", + " \"Microsoft Azure announced native support for NVIDIA NIM microservices.\",\n", + "]\n", + "\n", + "# Duplicate to create a larger batch (adjust multiplier for your workload)\n", + "documents = sample_documents * 5 # 100 documents\n", + "print(f\"Total documents to classify: {len(documents)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Build classification prompts\n", + "SYSTEM_PROMPT = \"\"\"You are a document classifier. For each document, output exactly one JSON object with these fields:\n", + "- \"category\": one of [\"Science\", \"Technology\", \"Finance\", \"Health\", \"Business\"]\n", + "- \"confidence\": a float between 0 and 1\n", + "- \"summary\": a one-sentence summary (max 20 words)\n", + "\n", + "Output ONLY the JSON object, no other text.\"\"\"\n", + "\n", + "prompts = [\n", + " tokenizer.apply_chat_template(\n", + " [\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": f\"Classify this document:\\n\\n{doc}\"},\n", + " ],\n", + " tokenize=False,\n", + " add_generation_prompt=True,\n", + " )\n", + " for doc in documents\n", + "]\n", + "\n", + "print(f\"Prepared {len(prompts)} prompts\")\n", + "print(f\"\\nExample prompt (first 200 chars):\\n{prompts[0][:200]}...\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run batch inference and measure throughput" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "params = SamplingParams(\n", + " temperature=0.0, # Deterministic for classification\n", + " max_tokens=150, # JSON output is compact\n", + ")\n", + "\n", + "# Run the batch\n", + "start = time.perf_counter()\n", + "outputs = llm.generate(prompts, sampling_params=params)\n", + "elapsed = time.perf_counter() - start\n", + "\n", + "# Calculate throughput metrics\n", + "total_output_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)\n", + "total_input_tokens = sum(len(o.prompt_token_ids) for o in outputs)\n", + "\n", + "print(f\"Batch size: {len(prompts)}\")\n", + "print(f\"Wall-clock time: {elapsed:.2f}s\")\n", + "print(f\"Total input tokens: {total_input_tokens:,}\")\n", + "print(f\"Total output tokens: {total_output_tokens:,}\")\n", + "print(f\"Output tok/s: {total_output_tokens / elapsed:.1f}\")\n", + "print(f\"Docs/sec: {len(prompts) / elapsed:.1f}\")\n", + "print(f\"Avg time/doc: {elapsed / len(prompts) * 1000:.1f}ms\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "\n", + "# Parse and display results\n", + "results = []\n", + "parse_errors = 0\n", + "\n", + "for i, output in enumerate(outputs):\n", + " text = output.outputs[0].text.strip()\n", + " try:\n", + " parsed = json.loads(text)\n", + " results.append(parsed)\n", + " except json.JSONDecodeError:\n", + " parse_errors += 1\n", + " results.append({\"raw\": text, \"error\": \"parse_failed\"})\n", + "\n", + "print(f\"Successfully parsed: {len(results) - parse_errors}/{len(results)}\")\n", + "print(f\"Parse errors: {parse_errors}\")\n", + "print()\n", + "\n", + "# Show first 5 results\n", + "for i in range(min(5, len(results))):\n", + " print(f\"Doc {i+1}: {documents[i][:80]}...\")\n", + " print(f\" -> {json.dumps(results[i])}\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Release GPU memory before server mode\n", + "\n", + "The offline LLM object holds GPU memory. Release it before starting a vLLM server in the next section." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import gc\n", + "\n", + "del llm\n", + "del tokenizer\n", + "gc.collect()\n", + "\n", + "if torch.cuda.is_available():\n", + " torch.cuda.empty_cache()\n", + " torch.cuda.ipc_collect()\n", + "\n", + "print(\"GPU memory released. Ready for server mode.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 3: Concurrent Server Requests\n", + "\n", + "For production workloads, you'll typically run vLLM as a server and send requests concurrently.\n", + "This is useful when:\n", + "- Your data arrives as a stream (files, database rows, API calls)\n", + "- You need to integrate with existing pipelines\n", + "- You want to control concurrency independently from model batching\n", + "\n", + "### Start the server\n", + "\n", + "Run this in a terminal before continuing:\n", + "\n", + "```bash\n", + "VLLM_FLASHINFER_MOE_BACKEND=throughput \\\n", + "VLLM_USE_FLASHINFER_MOE_FP8=1 \\\n", + "vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n", + " --tensor-parallel-size 4 \\\n", + " --trust-remote-code \\\n", + " --gpu-memory-utilization 0.95 \\\n", + " --enable-expert-parallel \\\n", + " --max-num-seqs 256 \\\n", + " --disable-log-requests \\\n", + " --port 5000\n", + "```\n", + "\n", + "Wait until you see `Uvicorn running on http://0.0.0.0:5000` before running the next cells." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import httpx\n", + "import json\n", + "import time\n", + "from typing import Any\n", + "\n", + "SERVER_URL = \"http://127.0.0.1:5000/v1/chat/completions\"\n", + "MODEL_NAME = \"nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16\"\n", + "\n", + "SYSTEM_PROMPT = \"\"\"You are a document classifier. For each document, output exactly one JSON object with these fields:\n", + "- \"category\": one of [\"Science\", \"Technology\", \"Finance\", \"Health\", \"Business\"]\n", + "- \"confidence\": a float between 0 and 1\n", + "- \"summary\": a one-sentence summary (max 20 words)\n", + "\n", + "Output ONLY the JSON object, no other text.\"\"\"\n", + "\n", + "\n", + "async def classify_one(\n", + " client: httpx.AsyncClient,\n", + " document: str,\n", + " semaphore: asyncio.Semaphore,\n", + ") -> dict[str, Any]:\n", + " \"\"\"Send a single classification request.\"\"\"\n", + " async with semaphore:\n", + " payload = {\n", + " \"model\": MODEL_NAME,\n", + " \"messages\": [\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": f\"Classify this document:\\n\\n{document}\"},\n", + " ],\n", + " \"temperature\": 0.0,\n", + " \"max_tokens\": 150,\n", + " }\n", + " t0 = time.perf_counter()\n", + " resp = await client.post(SERVER_URL, json=payload, timeout=120.0)\n", + " latency = time.perf_counter() - t0\n", + "\n", + " data = resp.json()\n", + " text = data[\"choices\"][0][\"message\"][\"content\"]\n", + " usage = data.get(\"usage\", {})\n", + "\n", + " return {\n", + " \"text\": text,\n", + " \"latency\": latency,\n", + " \"prompt_tokens\": usage.get(\"prompt_tokens\", 0),\n", + " \"completion_tokens\": usage.get(\"completion_tokens\", 0),\n", + " }\n", + "\n", + "\n", + "async def batch_classify(\n", + " documents: list[str],\n", + " concurrency: int = 32,\n", + ") -> list[dict[str, Any]]:\n", + " \"\"\"Classify all documents concurrently with bounded concurrency.\"\"\"\n", + " semaphore = asyncio.Semaphore(concurrency)\n", + " async with httpx.AsyncClient() as client:\n", + " tasks = [\n", + " classify_one(client, doc, semaphore)\n", + " for doc in documents\n", + " ]\n", + " return await asyncio.gather(*tasks)\n", + "\n", + "\n", + "print(\"Batch classification functions defined.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run concurrent classification" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Reuse the same sample documents from Part 2\n", + "sample_documents = [\n", + " \"A new CRISPR-based gene therapy shows promise in treating sickle cell disease in clinical trials.\",\n", + " \"Tesla stock surged 8% after reporting record quarterly deliveries of 500,000 vehicles.\",\n", + " \"Researchers at MIT developed a neural network that can predict protein folding 100x faster.\",\n", + " \"The Federal Reserve held interest rates steady, signaling a wait-and-see approach to inflation.\",\n", + " \"SpaceX successfully launched its 50th Starlink mission this year, deploying 60 satellites.\",\n", + " \"A longitudinal study found that Mediterranean diets reduce cardiovascular risk by 30%.\",\n", + " \"NVIDIA announced its next-generation Blackwell GPU architecture at GTC 2026.\",\n", + " \"The European Central Bank cut rates by 25 basis points to stimulate economic growth.\",\n", + " \"Google DeepMind published a breakthrough in mathematical theorem proving using LLMs.\",\n", + " \"Apple's Vision Pro headset sold 1 million units in its first quarter of availability.\",\n", + " \"New evidence suggests that gut microbiome diversity is linked to cognitive performance.\",\n", + " \"Amazon Web Services reported 35% year-over-year growth in cloud computing revenue.\",\n", + " \"A team at Stanford created a robotic system that can perform autonomous surgery.\",\n", + " \"Bitcoin reached a new all-time high above $150,000 amid institutional adoption.\",\n", + " \"Moderna announced a combined flu-COVID vaccine with 90% efficacy in Phase 3 trials.\",\n", + " \"Meta released its latest open-source model, achieving state-of-the-art on coding benchmarks.\",\n", + " \"Climate scientists reported that 2025 was the hottest year in recorded history.\",\n", + " \"Toyota unveiled a solid-state battery EV with 900-mile range for 2027 production.\",\n", + " \"A new antibiotic compound was discovered that kills drug-resistant bacteria in mice.\",\n", + " \"Microsoft Azure announced native support for NVIDIA NIM microservices.\",\n", + "]\n", + "\n", + "documents = sample_documents * 5 # 100 documents\n", + "\n", + "# Run batch classification\n", + "start = time.perf_counter()\n", + "results = await batch_classify(documents, concurrency=32)\n", + "total_time = time.perf_counter() - start\n", + "\n", + "# Aggregate metrics\n", + "total_prompt_tokens = sum(r[\"prompt_tokens\"] for r in results)\n", + "total_completion_tokens = sum(r[\"completion_tokens\"] for r in results)\n", + "latencies = [r[\"latency\"] for r in results]\n", + "\n", + "print(f\"Documents processed: {len(results)}\")\n", + "print(f\"Concurrency: 32\")\n", + "print(f\"Wall-clock time: {total_time:.2f}s\")\n", + "print(f\"Docs/sec: {len(results) / total_time:.1f}\")\n", + "print(f\"Total output tokens: {total_completion_tokens:,}\")\n", + "print(f\"Output tok/s: {total_completion_tokens / total_time:.1f}\")\n", + "print(f\"Avg latency/req: {sum(latencies) / len(latencies) * 1000:.0f}ms\")\n", + "print(f\"P50 latency: {sorted(latencies)[len(latencies)//2] * 1000:.0f}ms\")\n", + "print(f\"P99 latency: {sorted(latencies)[int(len(latencies)*0.99)] * 1000:.0f}ms\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Effect of concurrency on throughput\n", + "\n", + "Higher concurrency lets vLLM batch more requests together, improving throughput up to a point.\n", + "Beyond the server's `--max-num-seqs`, additional requests queue without benefit." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "concurrency_levels = [1, 4, 16, 32, 64, 128]\n", + "concurrency_results = []\n", + "\n", + "for c in concurrency_levels:\n", + " start = time.perf_counter()\n", + " res = await batch_classify(documents[:50], concurrency=c) # Use 50 docs per test\n", + " elapsed = time.perf_counter() - start\n", + " tokens = sum(r[\"completion_tokens\"] for r in res)\n", + " concurrency_results.append({\n", + " \"concurrency\": c,\n", + " \"wall_time\": elapsed,\n", + " \"docs_per_sec\": 50 / elapsed,\n", + " \"tok_per_sec\": tokens / elapsed,\n", + " })\n", + " print(f\"Concurrency {c:>3d}: {elapsed:6.2f}s | {50/elapsed:5.1f} docs/s | {tokens/elapsed:6.1f} tok/s\")\n", + "\n", + "print(\"\\nHigher concurrency improves throughput until the server's batch capacity is saturated.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 4: Processing Files from Disk\n", + "\n", + "In real workloads, your data comes from files. Here's a pattern for processing a JSONL file\n", + "line-by-line with concurrent requests and writing results back to disk." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pathlib import Path\n", + "\n", + "# Create a sample input file\n", + "input_path = Path(\"/tmp/nemotron_batch_input.jsonl\")\n", + "output_path = Path(\"/tmp/nemotron_batch_output.jsonl\")\n", + "\n", + "with open(input_path, \"w\") as f:\n", + " for i, doc in enumerate(documents):\n", + " f.write(json.dumps({\"id\": i, \"text\": doc}) + \"\\n\")\n", + "\n", + "print(f\"Wrote {len(documents)} documents to {input_path}\")\n", + "\n", + "\n", + "async def process_jsonl(\n", + " input_file: Path,\n", + " output_file: Path,\n", + " concurrency: int = 32,\n", + ") -> dict:\n", + " \"\"\"Process a JSONL file: classify each document and write results.\"\"\"\n", + " # Read input\n", + " records = []\n", + " with open(input_file) as f:\n", + " for line in f:\n", + " records.append(json.loads(line))\n", + "\n", + " # Classify all documents\n", + " texts = [r[\"text\"] for r in records]\n", + " start = time.perf_counter()\n", + " results = await batch_classify(texts, concurrency=concurrency)\n", + " elapsed = time.perf_counter() - start\n", + "\n", + " # Write output\n", + " with open(output_file, \"w\") as f:\n", + " for record, result in zip(records, results):\n", + " try:\n", + " classification = json.loads(result[\"text\"])\n", + " except json.JSONDecodeError:\n", + " classification = {\"raw\": result[\"text\"], \"error\": \"parse_failed\"}\n", + " output = {\n", + " **record,\n", + " \"classification\": classification,\n", + " \"latency_ms\": round(result[\"latency\"] * 1000),\n", + " }\n", + " f.write(json.dumps(output) + \"\\n\")\n", + "\n", + " return {\n", + " \"total_docs\": len(records),\n", + " \"wall_time\": elapsed,\n", + " \"docs_per_sec\": len(records) / elapsed,\n", + " }\n", + "\n", + "\n", + "stats = await process_jsonl(input_path, output_path, concurrency=32)\n", + "print(f\"\\nProcessed {stats['total_docs']} docs in {stats['wall_time']:.2f}s ({stats['docs_per_sec']:.1f} docs/s)\")\n", + "print(f\"Results written to {output_path}\")\n", + "\n", + "# Show first 3 output records\n", + "with open(output_path) as f:\n", + " for i, line in enumerate(f):\n", + " if i >= 3:\n", + " break\n", + " record = json.loads(line)\n", + " print(f\"\\nDoc {record['id']}: {record['text'][:60]}...\")\n", + " print(f\" -> {json.dumps(record['classification'])}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 5: Throughput Benchmarking\n", + "\n", + "Use `vllm bench serve` for standardized throughput measurement.\n", + "This is the recommended way to benchmark before and after configuration changes.\n", + "\n", + "### Quick benchmark against your running server\n", + "\n", + "```bash\n", + "# Synthetic benchmark: fixed input/output lengths\n", + "vllm bench serve \\\n", + " --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \\\n", + " --base-url http://127.0.0.1:5000 \\\n", + " --num-prompts 100 \\\n", + " --random-input-len 512 \\\n", + " --random-output-len 256 \\\n", + " --request-rate inf\n", + "```\n", + "\n", + "### Compare backend modes\n", + "\n", + "Run the benchmark twice - once with each backend - to see the difference:\n", + "\n", + "| Configuration | Best for | Expected throughput profile |\n", + "|--------------|----------|---------------------------|\n", + "| `VLLM_FLASHINFER_MOE_BACKEND=latency` | Interactive, low-latency serving | Lower TTFT, lower aggregate tok/s |\n", + "| `VLLM_FLASHINFER_MOE_BACKEND=throughput` | Batch jobs, offline processing | Higher aggregate tok/s, higher per-request latency |\n", + "\n", + "The throughput backend can deliver significantly higher aggregate tokens/sec when many requests are in flight simultaneously,\n", + "at the cost of higher per-request latency. For batch workloads where you care about total wall-clock time (not individual response speed),\n", + "the throughput backend is the right choice." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Scaling estimates\n", + "\n", + "Use your benchmark results to estimate processing time for larger workloads:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Replace with your measured docs/sec from Part 2 or Part 3\n", + "measured_docs_per_sec = stats[\"docs_per_sec\"]\n", + "\n", + "workloads = [\n", + " (\"Small dataset\", 1_000),\n", + " (\"Medium dataset\", 10_000),\n", + " (\"Large dataset\", 100_000),\n", + " (\"Patent-scale (3.5M)\", 3_500_000),\n", + "]\n", + "\n", + "print(f\"Based on measured throughput: {measured_docs_per_sec:.1f} docs/sec\\n\")\n", + "print(f\"{'Workload':<25} {'Documents':>12} {'Est. Time':>15}\")\n", + "print(\"-\" * 55)\n", + "for name, count in workloads:\n", + " seconds = count / measured_docs_per_sec\n", + " if seconds < 60:\n", + " time_str = f\"{seconds:.0f}s\"\n", + " elif seconds < 3600:\n", + " time_str = f\"{seconds/60:.1f} min\"\n", + " else:\n", + " time_str = f\"{seconds/3600:.1f} hours\"\n", + " print(f\"{name:<25} {count:>12,} {time_str:>15}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cleanup and shutdown\n", + "\n", + "To free resources after this notebook:\n", + "\n", + "1. Release in-process model memory (run the next cell).\n", + "2. Stop the `vllm serve` process in the terminal where it was started (`Ctrl+C`).\n", + "3. If needed, restart the kernel to ensure a clean state." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import gc\n", + "import torch\n", + "\n", + "# Release in-process model/tokenizer objects if present\n", + "for name in (\"llm\", \"tokenizer\"):\n", + " if name in globals():\n", + " del globals()[name]\n", + "\n", + "gc.collect()\n", + "\n", + "if torch.cuda.is_available():\n", + " torch.cuda.empty_cache()\n", + " torch.cuda.ipc_collect()\n", + "\n", + "# Clean up temp files\n", + "from pathlib import Path\n", + "for p in [Path(\"/tmp/nemotron_batch_input.jsonl\"), Path(\"/tmp/nemotron_batch_output.jsonl\")]:\n", + " p.unlink(missing_ok=True)\n", + "\n", + "print(\"Cleanup complete.\")\n", + "print(\"If vllm serve is running in a terminal, stop it with Ctrl+C.\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}