-
Notifications
You must be signed in to change notification settings - Fork 183
Add high-throughput batch processing cookbook for Nemotron 3 Super #107
Description
Problem:
Nemotron 3 Super achieves 5x throughput over the previous Nemotron Super and 7.5x over Qwen3.5-122B, but the repository has no cookbook demonstrating how to actually run high-throughput batch inference. The Advanced Deployment Guide mentions the throughput backend mode for "offline batch jobs" but never demonstrates it.
The community is already building batch workloads organically (3.5M patent classification on a single RTX 5090, bulk code review at 12.5s per file), but there's no official guidance for optimal configuration.
Proposed Solution:
Add usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb demonstrating:
- Server configuration for throughput (CUTLASS backend, EP, batch size tuning)
- Offline batch inference with vLLM's
LLMclass - Async concurrent requests against an OpenAI-compatible server
- Practical use case: bulk document classification with structured JSON output
- Throughput measurement and latency vs throughput backend comparison
The notebook follows the existing vllm_cookbook.ipynb pattern and requires no external API keys.
Why now:
With the Super 3 launch and GTC next week, community interest in throughput optimization is at its peak. Official guidance would validate NVIDIA's throughput claims with reproducible benchmarks.
I'm willing to implement this. Happy to adjust based on feedback.