Add high-throughput batch processing cookbook for Nemotron 3 Super

**Problem:**
Nemotron 3 Super achieves 5x throughput over the previous Nemotron Super and 7.5x over Qwen3.5-122B, but the repository has no cookbook demonstrating how to actually run high-throughput batch inference. The Advanced Deployment Guide mentions the `throughput` backend mode for "offline batch jobs" but never demonstrates it.

The community is already building batch workloads organically (3.5M patent classification on a single RTX 5090, bulk code review at 12.5s per file), but there's no official guidance for optimal configuration.

**Proposed Solution:**
Add `usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb` demonstrating:

1. Server configuration for throughput (CUTLASS backend, EP, batch size tuning)
2. Offline batch inference with vLLM's `LLM` class
3. Async concurrent requests against an OpenAI-compatible server
4. Practical use case: bulk document classification with structured JSON output
5. Throughput measurement and latency vs throughput backend comparison

The notebook follows the existing `vllm_cookbook.ipynb` pattern and requires no external API keys.

**Why now:**
With the Super 3 launch and GTC next week, community interest in throughput optimization is at its peak. Official guidance would validate NVIDIA's throughput claims with reproducible benchmarks.

I'm willing to implement this. Happy to adjust based on feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add high-throughput batch processing cookbook for Nemotron 3 Super #107

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add high-throughput batch processing cookbook for Nemotron 3 Super #107

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions