Skip to content

Add high-throughput batch processing cookbook for Nemotron 3 Super #107

@mvanhorn

Description

@mvanhorn

Problem:
Nemotron 3 Super achieves 5x throughput over the previous Nemotron Super and 7.5x over Qwen3.5-122B, but the repository has no cookbook demonstrating how to actually run high-throughput batch inference. The Advanced Deployment Guide mentions the throughput backend mode for "offline batch jobs" but never demonstrates it.

The community is already building batch workloads organically (3.5M patent classification on a single RTX 5090, bulk code review at 12.5s per file), but there's no official guidance for optimal configuration.

Proposed Solution:
Add usage-cookbook/Nemotron-3-Super/batch_throughput_cookbook.ipynb demonstrating:

  1. Server configuration for throughput (CUTLASS backend, EP, batch size tuning)
  2. Offline batch inference with vLLM's LLM class
  3. Async concurrent requests against an OpenAI-compatible server
  4. Practical use case: bulk document classification with structured JSON output
  5. Throughput measurement and latency vs throughput backend comparison

The notebook follows the existing vllm_cookbook.ipynb pattern and requires no external API keys.

Why now:
With the Super 3 launch and GTC next week, community interest in throughput optimization is at its peak. Official guidance would validate NVIDIA's throughput claims with reproducible benchmarks.

I'm willing to implement this. Happy to adjust based on feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions