Add high-throughput batch processing cookbook for Nemotron 3 Super by mvanhorn · Pull Request #109 · NVIDIA-NeMo/Nemotron

mvanhorn · 2026-03-12T06:57:27Z

Fixes #107

Summary

Adds usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb demonstrating high-throughput batch inference with Nemotron 3 Super via vLLM.

What's covered:

Throughput-optimized server configuration (CUTLASS backend, EP, batch sizing) vs latency-optimized defaults
Offline batch inference with vLLM's LLM class for zero-server batch processing
Async concurrent requests with httpx and bounded concurrency
Practical use case: bulk document classification with structured JSON output
Concurrency scaling analysis showing throughput vs concurrency level
JSONL file-to-file processing pattern for production pipelines
Throughput benchmarking guidance with vllm bench serve
Scaling estimates from measured throughput to larger workloads

Also updates the Nemotron-3-Super cookbook README with the new entry.

Research & Context

Community Signals

Reddit r/LocalLLaMA: A user classified 3.5M US patents on a single RTX 5090 with Nemotron 9B (401 upvotes) - the most viral Nemotron community content
Greptile tested Nemotron 3 Super for bulk code review and reported it "punches far above its weight class" at 12.5 seconds per review
The Advanced Deployment Guide mentions VLLM_FLASHINFER_MOE_BACKEND=throughput for "offline batch jobs" but never demonstrates it

Gap

Every existing cookbook demonstrates single-request or streaming use cases
No cookbook shows how to run high-throughput batch inference despite throughput being the headline stat (5x over previous Nemotron Super, 7.5x over Qwen3.5-122B)
vLLM's own docs cover offline batch generically but not Nemotron-specific optimizations (EP, latent MoE, NVFP4)

Competitive Context

Qwen 3.5 has detailed batch processing examples in its docs
No competing project combines Nemotron-specific configs with practical batch workloads

This contribution was developed with AI assistance (Claude Code).

Demonstrates offline batch inference, concurrent server requests, bulk document classification, and throughput benchmarking with vLLM. Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

mvanhorn mentioned this pull request Mar 12, 2026

Add high-throughput batch processing cookbook for Nemotron 3 Super #107

Open

Add high-throughput batch processing cookbook for Nemotron 3 Super

d0ca395

Demonstrates offline batch inference, concurrent server requests, bulk document classification, and throughput benchmarking with vLLM. Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

mvanhorn force-pushed the osc/107-batch-throughput-cookbook branch from d40e08f to d0ca395 Compare March 12, 2026 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add high-throughput batch processing cookbook for Nemotron 3 Super#109

Add high-throughput batch processing cookbook for Nemotron 3 Super#109
mvanhorn wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
mvanhorn:osc/107-batch-throughput-cookbook

mvanhorn commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented Mar 12, 2026

Summary

Community Signals

Gap

Competitive Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant