Skip to content

Conversation

farshadghodsian
Copy link
Contributor

@farshadghodsian farshadghodsian commented Aug 29, 2025

What does this PR do?

Type of change: New example

Overview:
Adding a QAT Jupyter Notebook example that walks user through how to apply Quantization Aware Training (QAT) to an LLM, Meta's Llama-3.1-8b, and serve it via TensorRT-LLM Docker container.

Usage

See new QAT Walkthrough Notebook in ./examples/llm_qat/notebooks for usage instructions

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: No

Additional Information

Summary by CodeRabbit

  • New Features

    • End-to-end Quantization Aware Training walkthrough for large models: NVFP4 calibration, quantization, QAT training, checkpointing, export, and TensorRT-LLM deployment with example inference.
  • Documentation

    • Step-by-step notebook covering prerequisites, model/dataset setup, training/calibration/quantization workflow, sample outputs, config notes, and Docker-based deployment/serving with an example request.
  • Chores

    • Added notebook dependencies: ipywidgets, nvidia-modelopt[all], and trl.

@farshadghodsian farshadghodsian requested a review from a team as a code owner August 29, 2025 20:55
Copy link

copy-pr-bot bot commented Aug 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@farshadghodsian farshadghodsian force-pushed the QAT-Walkthrough-Notebook branch 2 times, most recently from 5d1bf5a to e587a06 Compare August 29, 2025 20:56
@kevalmorabia97 kevalmorabia97 requested review from realAsma and removed request for Edwardf0t1 August 30, 2025 06:12
@farshadghodsian farshadghodsian force-pushed the QAT-Walkthrough-Notebook branch 4 times, most recently from 1b37baa to 01b1a65 Compare September 3, 2025 23:17
@farshadghodsian farshadghodsian requested review from a team as code owners September 3, 2025 23:17
@farshadghodsian farshadghodsian force-pushed the QAT-Walkthrough-Notebook branch 2 times, most recently from 374d9a6 to a531dcb Compare September 5, 2025 22:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (4)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (4)

346-355: Pass tokenizer via tokenizer= or rely on auto-detection; don’t use processing_class for tokenizers.

Keeps behavior aligned with TRL expectations and avoids subtle tokenization issues.

 trainer = SFTTrainer(
     model=model,
     args=training_args,
     train_dataset=dataset[script_args.dataset_train_split],
     eval_dataset=dataset[script_args.dataset_test_split],
-    processing_class=tokenizer,
+    tokenizer=tokenizer,  # or remove entirely to rely on auto-detection
 )

26-41: Fix “Dependancies” typos and align dependency list with requirements.

Multiple typos and omissions (datasets/accelerate/peft). Also tweak the path sentence.

-## Installing Prerequisites and Dependancies
+## Installing Prerequisites and Dependencies
-If you haven't already, install the required dependencies for this notebook. Key dependancies include:
+If you haven't already, install the required dependencies for this notebook. Key dependencies include:
 - nvidia-modelopt
 - torch
 - transformers
-- jupyterlab
+- datasets
+- accelerate
+- peft
+- jupyterlab
-
-This repo contains a `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependancies.
+This repository contains `examples/llm_qat/notebooks/requirements.txt` to install all required dependencies.

374-391: Calibration forward loop should be eval+no-grad and move tensors to model device.

Prevents unnecessary grad tracking and device mismatch errors.

 def forward_loop(model):
-    for data in data_loader:
-        model(**data)
+    model.eval()
+    # best-effort device selection
+    try:
+        device = next(model.parameters()).device
+    except StopIteration:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    with torch.inference_mode():
+        for data in data_loader:
+            data = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in data.items()}
+            _ = model(**data)

16-19: QAT is training-time (not post-training) + grammar fix.

Clarify definition and fix subject-verb and tense issues.

-**Quantization Aware Training (QAT)** is a method that learn the effects of quantization during neural network post-training to preserve accuracy when deploying models in very-low-precision formats. QAT inserts quantizer nodes into the computational graph, mimicking the rounding and clamping operations that occur during actual quantization. This allows the model to adapt its weights and activations to mitigate accuracy loss.
-
-This notebook demonstrates how to apply Quantization Aware Training (QAT) to an LLM, Qwen3-8b in this example, with NVIDIA's TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrates on a small eval subset, applying NVFP4 quantization and finally deploying the quantized model to TensorRT-LLM.
+**Quantization Aware Training (QAT)** simulates quantization during training (not post‑training) so the model adapts to low‑precision rounding and clamping, preserving accuracy at deployment.
+
+This notebook applies QAT to Qwen/Qwen3‑8B using NVIDIA’s TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrating on a small eval subset, applying NVFP4 quantization, and finally deploying the quantized model to TensorRT‑LLM.
🧹 Nitpick comments (5)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (5)

611-617: Make the Docker run command more portable.

Mount the notebook requirements dir explicitly; also keep the image tag as a placeholder to avoid encouraging RCs.

-docker run --rm --ipc=host -it \
+docker run --rm --ipc=host -it \
   --ulimit stack=67108864   --ulimit memlock=-1 \
   --gpus all   -p 8000:8000   -e TRTLLM_ENABLE_PDL=1 \
   -v ~/.cache:/root/.cache:rw --name tensorrt_llm \
   -v $(pwd)/qwen3-8b-qat-multilingual-reasoner/:/app/tensorrt_llm/qat \
-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2  /bin/bash
+  nvcr.io/nvidia/tensorrt-llm/release:<LATEST_TAG> /bin/bash

713-718: Parameterize tensor/pipeline parallelism or document GPU requirement.

--tp_size 8 will fail on machines with <8 visible GPUs.

-trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/  \
-  --max_batch_size 1 --max_num_tokens 1024 \
-  --max_seq_len 4096 --tp_size 8 --pp_size 1 \
+trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/  \
+  --max_batch_size 1 --max_num_tokens 1024 \
+  --max_seq_len 4096 --tp_size ${TP_SIZE:-1} --pp_size ${PP_SIZE:-1} \
   --host 0.0.0.0 --port 8000 \
   --kv_cache_free_gpu_memory_fraction 0.95
# Note: set TP_SIZE/PP_SIZE according to available GPUs and engine build.

774-786: Align “model” field in curl with the served model name.

Reduce confusion by matching the folder name (or use “default” if the server ignores it).

-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-    "model": "Qwen3/qwen3-8b-qat-multilingual-reasoner",
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+    "model": "saved_models_checkpoint-450_nvfp4_hf",
     "messages": [
         {
             "role": "user",
             "content": "What is NVIDIAs advantage for inference?"
         }
     ],
     "max_tokens": 1024,
     "top_p": 0.9
 }' -w "\n"

205-212: Optional: pass use_fast=True when available.

Tokenizer perf can improve with fast tokenizers.

-tokenizer = AutoTokenizer.from_pretrained(
-    model_args.model_name_or_path,
-)
+tokenizer = AutoTokenizer.from_pretrained(
+    model_args.model_name_or_path,
+    use_fast=True,
+)

401-411: Quantization call: clarify that some configs don’t need calibration.

Minor doc tweak to reduce confusion about when forward_loop is used.

-mtq.quantize(model, quantization_config, forward_loop)
+mtq.quantize(model, quantization_config, forward_loop)  # pass forward_loop only for configs that require calibration
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 374d9a6 and a531dcb.

📒 Files selected for processing (2)
  • examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1 hunks)
  • examples/llm_qat/notebooks/requirements.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm_qat/notebooks/requirements.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-05T21:48:21.342Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-Model-Optimizer#278
File: examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb:346-355
Timestamp: 2025-09-05T21:48:21.342Z
Learning: SFTTrainer from the TRL library can automatically detect and use tokenizers when they are already present in the model directory or model configuration, making explicit tokenizer parameter specification optional in such cases.

Applied to files:

  • examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
🔇 Additional comments (1)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1)

286-301: Use correct SFTConfig parameter names
Replace the invalid max_length with max_seq_length and eval_strategy with evaluation_strategy in your SFTConfig call.

 training_args = SFTConfig(
     output_dir="qwen3-8b-qat-multilingual-reasoner",
     num_train_epochs=1,
     learning_rate=2e-5,
     per_device_train_batch_size=1,
     per_device_eval_batch_size=1,
     gradient_accumulation_steps=2,
-    max_length=4096,
+    max_seq_length=4096,
     warmup_ratio=0.03,
-    eval_strategy="steps",
+    evaluation_strategy="steps",
     eval_on_start=True,
     logging_steps=50,
     save_steps=450,
     eval_steps=50,
     save_total_limit=2,
 )

(max_seq_length is the supported truncation parameter in SFTConfig) (huggingface.co)
(use evaluation_strategy to set evaluation intervals) (huggingface.co)

@kevalmorabia97
Copy link
Collaborator

@farshadghodsian your commits are still not verified with an ssh key. Please refer to the steps here: https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=contributing-ov-file#%EF%B8%8F-signing-your-work

@farshadghodsian
Copy link
Contributor Author

@farshadghodsian your commits are still not verified with an ssh key. Please refer to the steps here: https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=contributing-ov-file#%EF%B8%8F-signing-your-work

I see my issue now. I forgot to add my signing key to my Github account. My commits should now be verified. ✅

@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) September 8, 2025 16:50
Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.88%. Comparing base (358b0c6) to head (4926fa7).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #278   +/-   ##
=======================================
  Coverage   73.88%   73.88%           
=======================================
  Files         172      172           
  Lines       17444    17444           
=======================================
  Hits        12888    12888           
  Misses       4556     4556           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97
Copy link
Collaborator

Code quality checks are failing:
https://github.com/NVIDIA/TensorRT-Model-Optimizer/actions/runs/17557846268/job/49867165107?pr=278
You can check CONTRIBUTING.md for steps to fix this

@farshadghodsian farshadghodsian force-pushed the QAT-Walkthrough-Notebook branch 5 times, most recently from 5ca089e to 0a95934 Compare September 10, 2025 21:38
@farshadghodsian farshadghodsian force-pushed the QAT-Walkthrough-Notebook branch 3 times, most recently from 2bc96c7 to 638b8dd Compare September 10, 2025 22:23
@kevalmorabia97 kevalmorabia97 enabled auto-merge (squash) September 11, 2025 03:37
@kevalmorabia97
Copy link
Collaborator

/ok to test 4926fa7

@kevalmorabia97 kevalmorabia97 merged commit 76e8ce2 into NVIDIA:main Sep 11, 2025
22 checks passed
benchislett pushed a commit that referenced this pull request Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants