-
Notifications
You must be signed in to change notification settings - Fork 162
Add QAT Walkthrough Notebook example #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QAT Walkthrough Notebook example #278
Conversation
5d1bf5a
to
e587a06
Compare
1b37baa
to
01b1a65
Compare
374d9a6
to
a531dcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (4)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (4)
346-355
: Pass tokenizer viatokenizer=
or rely on auto-detection; don’t useprocessing_class
for tokenizers.Keeps behavior aligned with TRL expectations and avoids subtle tokenization issues.
trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset[script_args.dataset_train_split], eval_dataset=dataset[script_args.dataset_test_split], - processing_class=tokenizer, + tokenizer=tokenizer, # or remove entirely to rely on auto-detection )
26-41
: Fix “Dependancies” typos and align dependency list with requirements.Multiple typos and omissions (datasets/accelerate/peft). Also tweak the path sentence.
-## Installing Prerequisites and Dependancies +## Installing Prerequisites and Dependencies -If you haven't already, install the required dependencies for this notebook. Key dependancies include: +If you haven't already, install the required dependencies for this notebook. Key dependencies include: - nvidia-modelopt - torch - transformers -- jupyterlab +- datasets +- accelerate +- peft +- jupyterlab - -This repo contains a `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependancies. +This repository contains `examples/llm_qat/notebooks/requirements.txt` to install all required dependencies.
374-391
: Calibration forward loop should be eval+no-grad and move tensors to model device.Prevents unnecessary grad tracking and device mismatch errors.
def forward_loop(model): - for data in data_loader: - model(**data) + model.eval() + # best-effort device selection + try: + device = next(model.parameters()).device + except StopIteration: + device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + with torch.inference_mode(): + for data in data_loader: + data = {k: (v.to(device) if hasattr(v, "to") else v) for k, v in data.items()} + _ = model(**data)
16-19
: QAT is training-time (not post-training) + grammar fix.Clarify definition and fix subject-verb and tense issues.
-**Quantization Aware Training (QAT)** is a method that learn the effects of quantization during neural network post-training to preserve accuracy when deploying models in very-low-precision formats. QAT inserts quantizer nodes into the computational graph, mimicking the rounding and clamping operations that occur during actual quantization. This allows the model to adapt its weights and activations to mitigate accuracy loss. - -This notebook demonstrates how to apply Quantization Aware Training (QAT) to an LLM, Qwen3-8b in this example, with NVIDIA's TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrates on a small eval subset, applying NVFP4 quantization and finally deploying the quantized model to TensorRT-LLM. +**Quantization Aware Training (QAT)** simulates quantization during training (not post‑training) so the model adapts to low‑precision rounding and clamping, preserving accuracy at deployment. + +This notebook applies QAT to Qwen/Qwen3‑8B using NVIDIA’s TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrating on a small eval subset, applying NVFP4 quantization, and finally deploying the quantized model to TensorRT‑LLM.
🧹 Nitpick comments (5)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (5)
611-617
: Make the Docker run command more portable.Mount the notebook requirements dir explicitly; also keep the image tag as a placeholder to avoid encouraging RCs.
-docker run --rm --ipc=host -it \ +docker run --rm --ipc=host -it \ --ulimit stack=67108864 --ulimit memlock=-1 \ --gpus all -p 8000:8000 -e TRTLLM_ENABLE_PDL=1 \ -v ~/.cache:/root/.cache:rw --name tensorrt_llm \ -v $(pwd)/qwen3-8b-qat-multilingual-reasoner/:/app/tensorrt_llm/qat \ - nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2 /bin/bash + nvcr.io/nvidia/tensorrt-llm/release:<LATEST_TAG> /bin/bash
713-718
: Parameterize tensor/pipeline parallelism or document GPU requirement.
--tp_size 8
will fail on machines with <8 visible GPUs.-trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/ \ - --max_batch_size 1 --max_num_tokens 1024 \ - --max_seq_len 4096 --tp_size 8 --pp_size 1 \ +trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/ \ + --max_batch_size 1 --max_num_tokens 1024 \ + --max_seq_len 4096 --tp_size ${TP_SIZE:-1} --pp_size ${PP_SIZE:-1} \ --host 0.0.0.0 --port 8000 \ --kv_cache_free_gpu_memory_fraction 0.95 # Note: set TP_SIZE/PP_SIZE according to available GPUs and engine build.
774-786
: Align “model” field in curl with the served model name.Reduce confusion by matching the folder name (or use “default” if the server ignores it).
-curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "Qwen3/qwen3-8b-qat-multilingual-reasoner", +curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "saved_models_checkpoint-450_nvfp4_hf", "messages": [ { "role": "user", "content": "What is NVIDIAs advantage for inference?" } ], "max_tokens": 1024, "top_p": 0.9 }' -w "\n"
205-212
: Optional: passuse_fast=True
when available.Tokenizer perf can improve with fast tokenizers.
-tokenizer = AutoTokenizer.from_pretrained( - model_args.model_name_or_path, -) +tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, + use_fast=True, +)
401-411
: Quantization call: clarify that some configs don’t need calibration.Minor doc tweak to reduce confusion about when
forward_loop
is used.-mtq.quantize(model, quantization_config, forward_loop) +mtq.quantize(model, quantization_config, forward_loop) # pass forward_loop only for configs that require calibration
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
(1 hunks)examples/llm_qat/notebooks/requirements.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm_qat/notebooks/requirements.txt
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-05T21:48:21.342Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-Model-Optimizer#278
File: examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb:346-355
Timestamp: 2025-09-05T21:48:21.342Z
Learning: SFTTrainer from the TRL library can automatically detect and use tokenizers when they are already present in the model directory or model configuration, making explicit tokenizer parameter specification optional in such cases.
Applied to files:
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
🔇 Additional comments (1)
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1)
286-301
: Use correct SFTConfig parameter names
Replace the invalidmax_length
withmax_seq_length
andeval_strategy
withevaluation_strategy
in yourSFTConfig
call.training_args = SFTConfig( output_dir="qwen3-8b-qat-multilingual-reasoner", num_train_epochs=1, learning_rate=2e-5, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=2, - max_length=4096, + max_seq_length=4096, warmup_ratio=0.03, - eval_strategy="steps", + evaluation_strategy="steps", eval_on_start=True, logging_steps=50, save_steps=450, eval_steps=50, save_total_limit=2, )(max_seq_length is the supported truncation parameter in SFTConfig) (huggingface.co)
(use evaluation_strategy to set evaluation intervals) (huggingface.co)
a531dcb
to
e0caa1c
Compare
@farshadghodsian your commits are still not verified with an ssh key. Please refer to the steps here: https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=contributing-ov-file#%EF%B8%8F-signing-your-work |
I see my issue now. I forgot to add my signing key to my Github account. My commits should now be verified. ✅ |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #278 +/- ##
=======================================
Coverage 73.88% 73.88%
=======================================
Files 172 172
Lines 17444 17444
=======================================
Hits 12888 12888
Misses 4556 4556 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Code quality checks are failing: |
5ca089e
to
0a95934
Compare
2bc96c7
to
638b8dd
Compare
Signed-off-by: Farshad Ghodsian <[email protected]>
Signed-off-by: Farshad Ghodsian <[email protected]>
638b8dd
to
4926fa7
Compare
/ok to test 4926fa7 |
Signed-off-by: Farshad Ghodsian <[email protected]>
What does this PR do?
Type of change: New example
Overview:
Adding a QAT Jupyter Notebook example that walks user through how to apply Quantization Aware Training (QAT) to an LLM, Meta's Llama-3.1-8b, and serve it via TensorRT-LLM Docker container.
Usage
See new QAT Walkthrough Notebook in
./examples/llm_qat/notebooks
for usage instructionsTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Documentation
Chores