Skip to content

Add high-throughput batch processing cookbook for Nemotron 3 Super#109

Open
mvanhorn wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
mvanhorn:osc/107-batch-throughput-cookbook
Open

Add high-throughput batch processing cookbook for Nemotron 3 Super#109
mvanhorn wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
mvanhorn:osc/107-batch-throughput-cookbook

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Fixes #107

Summary

Adds usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb demonstrating high-throughput batch inference with Nemotron 3 Super via vLLM.

What's covered:

  • Throughput-optimized server configuration (CUTLASS backend, EP, batch sizing) vs latency-optimized defaults
  • Offline batch inference with vLLM's LLM class for zero-server batch processing
  • Async concurrent requests with httpx and bounded concurrency
  • Practical use case: bulk document classification with structured JSON output
  • Concurrency scaling analysis showing throughput vs concurrency level
  • JSONL file-to-file processing pattern for production pipelines
  • Throughput benchmarking guidance with vllm bench serve
  • Scaling estimates from measured throughput to larger workloads

Also updates the Nemotron-3-Super cookbook README with the new entry.

Research & Context

Community Signals

  • Reddit r/LocalLLaMA: A user classified 3.5M US patents on a single RTX 5090 with Nemotron 9B (401 upvotes) - the most viral Nemotron community content
  • Greptile tested Nemotron 3 Super for bulk code review and reported it "punches far above its weight class" at 12.5 seconds per review
  • The Advanced Deployment Guide mentions VLLM_FLASHINFER_MOE_BACKEND=throughput for "offline batch jobs" but never demonstrates it

Gap

  • Every existing cookbook demonstrates single-request or streaming use cases
  • No cookbook shows how to run high-throughput batch inference despite throughput being the headline stat (5x over previous Nemotron Super, 7.5x over Qwen3.5-122B)
  • vLLM's own docs cover offline batch generically but not Nemotron-specific optimizations (EP, latent MoE, NVFP4)

Competitive Context

  • Qwen 3.5 has detailed batch processing examples in its docs
  • No competing project combines Nemotron-specific configs with practical batch workloads

This contribution was developed with AI assistance (Claude Code).

Demonstrates offline batch inference, concurrent server requests,
bulk document classification, and throughput benchmarking with vLLM.

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@mvanhorn mvanhorn force-pushed the osc/107-batch-throughput-cookbook branch from d40e08f to d0ca395 Compare March 12, 2026 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add high-throughput batch processing cookbook for Nemotron 3 Super

1 participant