diff --git a/examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb b/examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
new file mode 100644
index 00000000..bc271325
--- /dev/null
+++ b/examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
@@ -0,0 +1,853 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b54dd4ac-5b4f-42fb-af1a-8f3001abd08a",
+   "metadata": {},
+   "source": [
+    "# NVIDIA ModelOpt Quantization Aware Training (QAT) Walkthrough"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a695be45-1472-42bc-824e-5c992a487fa7",
+   "metadata": {},
+   "source": [
+    "**Quantization Aware Training (QAT)** is a method that learn the effects of quantization during neural network post-training to preserve accuracy when deploying models in very-low-precision formats. QAT inserts quantizer nodes into the computational graph, mimicking the rounding and clamping operations that occur during actual quantization. This allows the model to adapt its weights and activations to mitigate accuracy loss.\n",
+    "\n",
+    "This notebook demonstrates how to apply Quantization Aware Training (QAT) to an LLM, Qwen3-8b in this example, with NVIDIA's TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrates on a small eval subset, applying NVFP4 quantization and finally deploying the quantized model to TensorRT-LLM."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3f7f931-ac38-494e-aea8-ca2cd6d05794",
+   "metadata": {},
+   "source": [
+    "## Installing Prerequisites and Dependancies"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7d4f25f-e569-42cf-8022-bb7cc6f9ea6e",
+   "metadata": {},
+   "source": [
+    "If you haven't already, install the required dependencies for this notebook. Key dependancies include:\n",
+    "\n",
+    "- nvidia-modelopt\n",
+    "- torch\n",
+    "- transformers\n",
+    "- jupyterlab\n",
+    "\n",
+    "This repo contains a `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependancies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ab464a07-8a19-43a9-a715-81ccef350253",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -r requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99c6ca5d-0d08-4b6c-814f-b8a92a8469f2",
+   "metadata": {},
+   "source": [
+    "## Setting HuggingFace Token and Model for Download (Optional)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09f3c6a7-1c9c-4254-9524-7e253528d9d7",
+   "metadata": {},
+   "source": [
+    "If your model requires authentication *(not required for Qwen3-8b)* set the HF_TOKEN environment variable making sure to update it to include you token (eg. `%env HF_TOKEN=hf_abdxyz...`). Be careful to remove your token from this notebook before checking in your code to any public repository."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb2071be-df85-4961-92b0-567830a37d71",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "%env HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4eab1f6a-5855-4f2a-a982-27b7e756deca",
+   "metadata": {},
+   "source": [
+    "We will use **Qwen/Qwen3‑8B** in this example, but you can change the model name to what suits your needs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6d25c2b1-a68b-4748-ac29-e8a893ce1762",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"Qwen/Qwen3-8B\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41b7c23f-748b-4c9e-8883-d6ca24af46ed",
+   "metadata": {},
+   "source": [
+    "## Import Required Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "6f45b026-0beb-4249-87f5-1263033d6832",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ModelOpt save/restore enabled for `transformers` library.\n",
+      "ModelOpt save/restore enabled for `diffusers` library.\n",
+      "ModelOpt save/restore enabled for `peft` library.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import modelopt.torch.opt as mto\n",
+    "\n",
+    "# Enable automatic save/load of modelopt state huggingface checkpointing\n",
+    "# modelopt state will be saved automatically to \"modelopt_state.pth\"\n",
+    "mto.enable_huggingface_checkpointing()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "71c0278d-eeb6-47a0-9630-15783ef684aa",
+   "metadata": {},
+   "source": [
+    "## Model Configuration\n",
+    "\n",
+    "Configure the model parameters including the model path, attention implementation, and data type. Set up the model configuration and prepare the model loading arguments."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6af94af-1de6-4cb1-959b-98fb3f4e1932",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import ModelConfig\n",
+    "\n",
+    "model_args = ModelConfig(\n",
+    "    model_name_or_path=model_name,\n",
+    "    attn_implementation=\"eager\",\n",
+    "    torch_dtype=\"bfloat16\",\n",
+    ")\n",
+    "model_kwargs = {\n",
+    "    \"revision\": model_args.model_revision,\n",
+    "    \"trust_remote_code\": model_args.trust_remote_code,\n",
+    "    \"attn_implementation\": model_args.attn_implementation,\n",
+    "    \"torch_dtype\": model_args.torch_dtype,\n",
+    "    \"use_cache\": False,\n",
+    "    \"device_map\": \"auto\",\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "37218e5f-f707-4e72-92e7-3712047ca283",
+   "metadata": {},
+   "source": [
+    "## Load the Model and Tokenizer\n",
+    "\n",
+    "Load the pre-trained model and tokenizer with the specified configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "5427ffdc-ee1f-4a81-b30b-ee06d978c4fe",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d2b24577690f4639bfe5b203146315ab",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\n",
+    "    model_args.model_name_or_path,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2d66d66-3ec3-480b-b1d4-af21daa4085a",
+   "metadata": {},
+   "source": [
+    "## Dataset Configuration\n",
+    "\n",
+    "Set up the dataset parameters for training and evaluation. This includes specifying the dataset name, train/test splits, and test size ratio."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "ac8a8e2a-fba1-4377-a528-da7808e71cfd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import ScriptArguments\n",
+    "\n",
+    "script_args = ScriptArguments(\n",
+    "    dataset_name=\"HuggingFaceH4/Multilingual-Thinking\",\n",
+    "    dataset_train_split=\"train\",\n",
+    "    dataset_test_split=\"test\",\n",
+    ")\n",
+    "test_size = 0.1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b3aec2c0-1b16-4b56-b172-59e87558cc23",
+   "metadata": {},
+   "source": [
+    "## Load and Prepare Dataset\n",
+    "\n",
+    "Load the dataset and split it into training and evaluation sets. The dataset is split with the specified test size ratio and random seed for reproducibility."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "8eac6b5a-185c-4ea8-820e-2b3fcb593077",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(script_args.dataset_name)\n",
+    "# split the dataset into train and test\n",
+    "dataset = dataset[script_args.dataset_train_split].train_test_split(test_size=test_size, seed=42)\n",
+    "train_dataset = dataset[script_args.dataset_train_split]\n",
+    "eval_dataset = dataset[script_args.dataset_test_split]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c65bc6ca-5d04-4f69-9d5e-8981b3bc865d",
+   "metadata": {},
+   "source": [
+    "## Training Configuration\n",
+    "\n",
+    "Configure the training parameters including epochs, batch sizes, learning rate, gradient accumulation, and evaluation strategy. This sets up the SFT configuration for supervised fine-tuning."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "0bf60614-99a0-48b0-85a8-1d88cd7c72ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import SFTConfig\n",
+    "\n",
+    "training_args = SFTConfig(\n",
+    "    output_dir=\"qwen3-8b-qat-multilingual-reasoner\",\n",
+    "    num_train_epochs=1,\n",
+    "    learning_rate=2e-5,\n",
+    "    per_device_train_batch_size=1,\n",
+    "    per_device_eval_batch_size=1,\n",
+    "    gradient_accumulation_steps=2,\n",
+    "    max_length=4096,\n",
+    "    warmup_ratio=0.03,\n",
+    "    eval_strategy=\"steps\",\n",
+    "    eval_on_start=True,\n",
+    "    logging_steps=50,\n",
+    "    save_steps=450,\n",
+    "    eval_steps=50,\n",
+    "    save_total_limit=2,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8dfafa8-0a1b-46e7-9305-4053735b41a5",
+   "metadata": {},
+   "source": [
+    "## Initialize Trainer\n",
+    "\n",
+    "Set up the SFT trainer with the model, dataset, and training configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "05fdfbed-43ff-4f85-8084-a5cace7ca8ab",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2025-09-04 09:34:55,478] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/usr/bin/ld: cannot find -laio: No such file or directory\n",
+      "collect2: error: ld returned 1 exit status\n",
+      "/usr/bin/ld: cannot find -lcufile: No such file or directory\n",
+      "collect2: error: ld returned 1 exit status\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2025-09-04 09:34:56,319] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False\n"
+     ]
+    }
+   ],
+   "source": [
+    "from trl import SFTTrainer\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    train_dataset=dataset[script_args.dataset_train_split],\n",
+    "    eval_dataset=dataset[script_args.dataset_test_split],\n",
+    "    processing_class=tokenizer,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4992bb6-7cb5-4d08-b352-7fad10446352",
+   "metadata": {},
+   "source": [
+    "## Quantization Aware Training\n",
+    "\n",
+    "Configure the quantization parameters and prepare the calibration dataset. This step sets up the quantization configuration, creates a calibration subset from the evaluation dataset, and defines a forward loop function for model calibration. The calibration process helps determine optimal quantization scales for the model weights and activations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "0483291f-60ee-4647-aff5-85f7905ebfd0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "\n",
+    "import modelopt.torch.quantization as mtq\n",
+    "\n",
+    "# Some configs don't need calibration, but other quantization configurations may require it.\n",
+    "quantization_config = mtq.NVFP4_DEFAULT_CFG\n",
+    "calib_size = 128\n",
+    "\n",
+    "dataset = torch.utils.data.Subset(\n",
+    "    trainer.eval_dataset, list(range(min(len(trainer.eval_dataset), calib_size)))\n",
+    ")\n",
+    "data_loader = trainer.get_eval_dataloader(dataset)\n",
+    "\n",
+    "\n",
+    "def forward_loop(model):\n",
+    "    for data in data_loader:\n",
+    "        model(**data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd4fdefd-1536-4a8e-ad56-316abd262cc8",
+   "metadata": {},
+   "source": [
+    "Apply quantization to the model using the prepared configuration and calibration data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "b8611ea6-526f-4761-b456-8340abf56d0a",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "mtq.quantize(model, quantization_config, forward_loop)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e471ef6c-1346-4e5e-8782-5e9f2bc38f8a",
+   "metadata": {},
+   "source": [
+    "Once you have quantized the model you can now start the post-training process for QAT.  The training process will calculate validation loss at 50 step intervals and save the model. These can be controled by adjusting the `eval_steps` and `output_dir` above along with other `training_args`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "0f8af903-cde2-4fcd-b725-044876b73e8a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='450' max='450' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [450/450 09:41, Epoch 1/1]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "      <th>Validation Loss</th>\n",
+       "      <th>Entropy</th>\n",
+       "      <th>Num Tokens</th>\n",
+       "      <th>Mean Token Accuracy</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <td>0</td>\n",
+       "      <td>No log</td>\n",
+       "      <td>2.010943</td>\n",
+       "      <td>0.638744</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.653663</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>50</td>\n",
+       "      <td>1.140700</td>\n",
+       "      <td>0.940151</td>\n",
+       "      <td>0.973159</td>\n",
+       "      <td>50336.000000</td>\n",
+       "      <td>0.741705</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>100</td>\n",
+       "      <td>0.879300</td>\n",
+       "      <td>0.899297</td>\n",
+       "      <td>0.885603</td>\n",
+       "      <td>108411.000000</td>\n",
+       "      <td>0.746180</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>150</td>\n",
+       "      <td>0.931500</td>\n",
+       "      <td>0.887055</td>\n",
+       "      <td>0.896838</td>\n",
+       "      <td>162282.000000</td>\n",
+       "      <td>0.748110</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>200</td>\n",
+       "      <td>0.902700</td>\n",
+       "      <td>0.877437</td>\n",
+       "      <td>0.890031</td>\n",
+       "      <td>215991.000000</td>\n",
+       "      <td>0.747941</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>250</td>\n",
+       "      <td>0.889600</td>\n",
+       "      <td>0.875439</td>\n",
+       "      <td>0.881631</td>\n",
+       "      <td>266630.000000</td>\n",
+       "      <td>0.749481</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>300</td>\n",
+       "      <td>0.898200</td>\n",
+       "      <td>0.874025</td>\n",
+       "      <td>0.868418</td>\n",
+       "      <td>321852.000000</td>\n",
+       "      <td>0.751207</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>350</td>\n",
+       "      <td>0.929300</td>\n",
+       "      <td>0.871051</td>\n",
+       "      <td>0.879694</td>\n",
+       "      <td>379927.000000</td>\n",
+       "      <td>0.749429</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>400</td>\n",
+       "      <td>0.884500</td>\n",
+       "      <td>0.873047</td>\n",
+       "      <td>0.881471</td>\n",
+       "      <td>433364.000000</td>\n",
+       "      <td>0.750523</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <td>450</td>\n",
+       "      <td>0.925100</td>\n",
+       "      <td>0.874171</td>\n",
+       "      <td>0.879648</td>\n",
+       "      <td>489988.000000</td>\n",
+       "      <td>0.749288</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Saved ModelOpt state to qwen3-8b-qat-multilingual-reasoner/checkpoint-450/modelopt_state.pth\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "TrainOutput(global_step=450, training_loss=0.9311961958143447, metrics={'train_runtime': 598.9058, 'train_samples_per_second': 1.503, 'train_steps_per_second': 0.751, 'total_flos': 2.225056725656371e+16, 'train_loss': 0.9311961958143447, 'epoch': 1.0})"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10acc50c-c876-41d5-8f7e-00dab8842ccd",
+   "metadata": {},
+   "source": [
+    "**Note:** The QAT checkpoint for `nvfp4` config can be created by using `--quant_cfg NVFP4_DEFAULT_CFG` in QAT example.\n",
+    "\n",
+    "See more details on deployment of quantized model [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/cb98b221e1b1730226257e20b4c81ebb259fc2d6/examples/llm_ptq/README.md)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9741ead-937c-42a8-9c06-0cb547605ebf",
+   "metadata": {},
+   "source": [
+    "# Deploying the QAT Model with TensorRT-LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99b10dfe-dacc-4d8d-96ea-afb6ac9e5bc7",
+   "metadata": {},
+   "source": [
+    "Once you have completed the above QAT workflow you should now have a model in the checkpoint folder `./qwen3-8b-qat-multilingual-reasoner/checkpoint-450` which contains the model files including the checkpoints and tokenizer. You can use this folder to serve the QAT NVFP4 model in TensorRT-LLM via Docker."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c40a9b0a-49ba-4860-80d6-eabe06d06e5e",
+   "metadata": {},
+   "source": [
+    "## Running TensorRT-LLM Docker"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0bcf77bb-b8ac-4969-a11e-c697e3b5760b",
+   "metadata": {},
+   "source": [
+    "The easiest way to get started with TensorRT-LLM is to run a TensorRT-LLM docker container. Visit the [NGC TensorRT-LLM Release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) to find the most up-to-date NGC container image to use.\n",
+    "\n",
+    "Open a new bash shell and run the following Docker command to start the TensorRT-LLM container in interactive mode (change the image tag to match latest release):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2b736ebf-5970-4157-8fe1-c2242d3c3950",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sh # [run in command line outside notebook]\n",
+    "\n",
+    "docker run --rm --ipc=host -it \\\n",
+    "  --ulimit stack=67108864   --ulimit memlock=-1 \\\n",
+    "  --gpus all   -p 8000:8000   -e TRTLLM_ENABLE_PDL=1 \\\n",
+    "  -v ~/.cache:/root/.cache:rw --name tensorrt_llm \\\n",
+    "  -v $(pwd)/qwen3-8b-qat-multilingual-reasoner/:/app/tensorrt_llm/qat \\\n",
+    "  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2  /bin/bash"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0509e8ee-519d-4905-8663-c49e02712403",
+   "metadata": {},
+   "source": [
+    "## Exporting Quantized Model for deployment\n",
+    "Before deploying the model with TensorRT-LLM you will need to export the model checkpoint files. This is similar to the step you take for a quantized PTQ Model. To export the unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM Pytorch, vLLM and SGLang you will need to run the [huggingface_example.sh](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/scripts/huggingface_example.sh) script found in the TensorRT Model Optimizer repo. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "515f0e2b-0b27-44b7-bec7-2fcda65df140",
+   "metadata": {},
+   "source": [
+    "**Clone the TensorRT-LLM Model Optimizer repo inside the docker container**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "363b64fc-6d18-4b40-bd65-e90e55305b03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sh # [run in TensorRT-LLM container]\n",
+    "\n",
+    "git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67dbd985-f7ae-4f33-a450-64f4ce5b2c4a",
+   "metadata": {},
+   "source": [
+    "**Install Model Opt prerequisites**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "44be970c-3567-4108-8009-25b6312c1eb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sh # [run in TensorRT-LLM container]\n",
+    "\n",
+    "cd TensorRT-Model-Optimizer/\n",
+    "pip install -e ."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6962b7ef-d8fc-4f71-a841-82045774df2d",
+   "metadata": {},
+   "source": [
+    "**Run HuggingFace checkpoint conversion script**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cff980a0-b6e6-439d-9ee6-633e9978f97d",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "%%sh # [run in TensorRT-LLM container]\n",
+    "\n",
+    "# set export path for converted checkpoints. The script saves the converted checkpoint in ${ROOT_SAVE_PATH}/saved_models_${MODEL_FULL_NAME}\n",
+    "export ROOT_SAVE_PATH=/app/tensorrt_llm\n",
+    "\n",
+    "# run conversion script\n",
+    "cd ..\n",
+    "bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh --model $(pwd)/qat/checkpoint-450/ --quant nvfp4 --export_fmt hf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "156b1088-7cc6-4fcc-9e4d-0f6dd464bb20",
+   "metadata": {},
+   "source": [
+    "## Serving the Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01b5b230-47af-49bb-8227-d8c5d676e685",
+   "metadata": {},
+   "source": [
+    "Run the following trtllm-serve command to launch the inference server"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3994c997-e1a8-472e-ada7-96cf97156dd0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sh # [run in TensorRT-LLM container]\n",
+    "\n",
+    "trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/  \\\n",
+    "  --max_batch_size 1 --max_num_tokens 1024 \\\n",
+    "  --max_seq_len 4096 --tp_size 8 --pp_size 1 \\\n",
+    "  --host 0.0.0.0 --port 8000 \\\n",
+    "  --kv_cache_free_gpu_memory_fraction 0.95"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94ce91ae-045f-4e6d-bf0b-9b29f52ffb66",
+   "metadata": {},
+   "source": [
+    "## Sending an Inference Request to TensorRT-LLM Server"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fba7c89-1ce0-4f7c-822e-b72786a25199",
+   "metadata": {},
+   "source": [
+    "In another terminal or in the below cell run the example curl command to send an inference request to the server."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "1c989f9a-1b51-476d-8e30-a8f8327b96fe",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
+      "100   240    0     0  100   240      0    108  0:00:02  0:00:02 --:--:--   108"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{\"id\":\"chatcmpl-6ec07862eaaf4487b8ea85d5d37c90ac\",\"object\":\"chat.completion\",\"created\":1757004926,\"model\":\"Qwen3/qwen3-8b-qat-multilingual-reasoner\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"<think>\\n\\n</think>\\n\\nNVIDIA's advantage in **inference** (the process of using a trained machine learning model to make predictions or decisions) lies in its **dedicated hardware and software ecosystem**, which provides **high performance, efficiency, and scalability** for AI workloads. Here's a breakdown of key advantages:\\n\\n---\\n\\n### **1. Hardware Specialization: GPUs (Graphics Processing Units)**\\n- **High Parallelism**: NVIDIA GPUs are optimized for parallel processing, making them ideal for the massive computations required in inference (e.g., image recognition, NLP, autonomous driving).\\n- **Tensor Cores**: NVIDIA's **Tensor Cores** (introduced in the Volta and Turing architectures) accelerate matrix operations critical for deep learning, enabling faster inference with lower power consumption.\\n- **NVIDIA A100/H100 GPUs**: These are among the most powerful GPUs for inference, offering **high throughput** and **low latency** for real-time applications (e.g., video streaming, robotics).\\n\\n---\\n\\n### **2. Software Ecosystem: CUDA and cuDNN**\\n- **CUDA (Compute Unified Device Architecture)**: A parallel computing platform that allows developers to harness GPU power for inference tasks.\\n- **cuDNN (CUDA Deep Neural Network Library)**: A library optimized for deep learning operations, enabling **high-performance inference** with minimal code changes.\\n- **TensorRT**: A high-performance deep learning inference optimizer and runtime that:\\n  - **Optimizes models** (e.g., pruning, quantization) for faster execution.\\n  - **Supports multiple frameworks** (TensorFlow, PyTorch, ONNX).\\n  - **Leverages NVIDIA's hardware** (e.g., Tensor Cores) for maximum efficiency.\\n\\n---\\n\\n### **3. Cloud and Edge Solutions**\\n- **NVIDIA Cloud**: Offers scalable GPU resources for inference workloads, enabling businesses to deploy AI models without managing hardware.\\n- **NVIDIA Edge AI**: Hardware like the **Jetson series** (Jetson AGX, Jetson Nano) provides **low-power, compact inference solutions** for edge devices (e.g., drones, IoT sensors).\\n- **NVIDIA Clara**: A platform for healthcare AI, enabling inference in medical imaging and diagnostics.\\n\\n---\\n\\n### **4. Integration with AI Frameworks**\\n- **Framework Support**: TensorRT integrates seamlessly with popular frameworks (TensorFlow, PyTorch, ONNX), allowing developers to deploy models without rewriting code.\\n- **Model Optimization**: Tools like **TensorRT-LLM** (for large language models) and **TensorRT-MLI** (for mobile) enable efficient inference on diverse hardware.\\n\\n---\\n\\n### **5. Cost and Efficiency**\\n- **Energy Efficiency**: NVIDIA's hardware is designed to deliver **high performance per watt**, reducing operational costs for data centers and edge devices.\\n- **Scalability**: From small-scale edge devices to large-scale cloud clusters, NVIDIA's ecosystem supports **flexible deployment**.\\n\\n---\\n\\n### **6. Real-World Applications**\\n- **Autonomous Vehicles**: Real-time object detection and decision-making using NVIDIA DRIVE platforms.\\n- **Healthcare**: AI-driven diagnostics and imaging analysis with NVIDIA Clara.\\n- **Retail**: Customer behavior analysis and personalized recommendations using edge AI.\\n\\n---\\n\\n### **7. Ecosystem and Community**\\n- **Developer Tools**: NVIDIA provides extensive documentation, SDKs, and community support for developers.\\n- **Partnerships**: Collaborations with cloud providers (AWS, Azure, Google Cloud) and industry leaders to accelerate adoption.\\n\\n---\\n\\n### **Key Competitors and Differentiation**\\n- While competitors like **AMD** and **Intel** also offer GPU/TPU solutions, NVIDIA's **deep integration of hardware, software, and ecosystem** creates a **comprehensive advantage** for inference workloads. For example:\\n  - **TensorRT** is a **proprietary tool** that outperforms open-source alternatives in optimization.\\n  - **NVIDIA's A100/H100 GPUs** are **industry benchmarks** for high-performance inference.\\n\\n---\\n\\n### **Challenges**\\n- **Cost**: High-end NVIDIA hardware can be expensive.\\n- **Complexity**: Setting up the full ecosystem (CUDA, TensorRT, etc.) requires technical expertise.\\n\\n---\\n\\n### **Conclusion**\\nNVIDIA's **dedicated hardware (GPUs)**, **optimized software (TensorRT, cuDNN)**, and **end-to-end ecosystem** make it the **leading choice for inference** in AI applications. Its ability to balance **performance, efficiency, and scalability** across cloud, edge, and on-premise environments solidifies its position as a key player in the AI industry.\\n</think>\\n\\nNVIDIA's advantage in **inference** (the application of trained machine learning models to make predictions) stems from its **dedicated hardware, optimized software, and ecosystem**, which deliver **high performance, efficiency, and scalability**. Here's a structured summary of its key strengths:\\n\\n---\\n\\n### **1. Hardware Specialization**\\n- **GPUs (Graphics Processing Units)**: \\n\",\"reasoning_content\":null,\"reasoning\":null,\"tool_calls\":[]},\"logprobs\":null,\"finish_reason\":\"length\",\"stop_reason\":null,\"mm_embedding_handle\":null,\"disaggregated_params\":null,\"avg_decoded_tokens_per_iter\":1.0}],\"usage\":{\"prompt_tokens\":18,\"total_tokens\":1042,\"completion_tokens\":1024},\"prompt_token_ids\":null}"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100  5699  100  5459  100   240   2294    100  0:00:02  0:00:02 --:--:--  2395\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%sh\n",
+    "curl localhost:8000/v1/chat/completions -H \"Content-Type: application/json\" -d '{\n",
+    "    \"model\": \"Qwen3/qwen3-8b-qat-multilingual-reasoner\",\n",
+    "    \"messages\": [\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": \"What is NVIDIAs advantage for inference?\"\n",
+    "        }\n",
+    "    ],\n",
+    "    \"max_tokens\": 1024,\n",
+    "    \"top_p\": 0.9\n",
+    "}' -w \"\\n\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eca645fb-d8d2-4c98-9cb2-afbae7d30d6c",
+   "metadata": {},
+   "source": [
+    "## Stop the TensorRT-LLM Docker contrainer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "acaeaaf3-fe9d-4592-9a5d-fee9721e507d",
+   "metadata": {},
+   "source": [
+    "Finally, clean up the TensorRT-LLM server by stopping and exiting the Docker container. Alternatively you can run the below cell to stop the running container. The container should automatically delete itself once stopped as it was started with the `--rm` flag."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fb78741b-30cb-46f2-a292-c5192cbca9ed",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensorrt_llm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!docker stop tensorrt_llm"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/llm_qat/notebooks/requirements.txt b/examples/llm_qat/notebooks/requirements.txt
new file mode 100644
index 00000000..2b20f7b1
--- /dev/null
+++ b/examples/llm_qat/notebooks/requirements.txt
@@ -0,0 +1,3 @@
+ipywidgets
+nvidia-modelopt[all]
+trl