Skip to content

Conversation

cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Sep 5, 2025

What does this PR do?

Clean up

Overview: ?

As TRT-LLM is moving to the torch backend, we also removed all the TRT-LLM's TRT backend in our examples.

It includes

  1. By default we export quantized checkpoint to the HF format
  2. Delete trtllm-build and related wrappers
  3. Deprecate vlm_eval in examples which depends on trtllm's TRT backend. We plan to re-introduce it once Nvidia's eval tool is available.

Summary by CodeRabbit

  • New Features

    • ONNX quantization now defaults high-precision weights to FP16.
    • Upgraded TensorRT-LLM dependency to 1.1.0rc2 (Linux docs to 1.1.0rc2.post2).
  • Refactor

    • Simplified, quant-first workflows: unified HF checkpoint export by default, removed engine_dir usage, consolidated SAVE_PATH, streamlined scripts and CLI options.
  • Deprecations/Bug Fixes

    • Deprecated TRT-LLM backend paths in examples; removed/retired VLM eval utilities and int8_sq format in VLM PTQ; --export_fmt flag deprecated/ignored.
  • Documentation

    • Updated READMEs, notebooks, and changelog to reflect new defaults and flags; added low_memory_mode guidance.
  • Chores

    • Updated PyTorch requirement in examples; adjusted CODEOWNERS.

Copy link

coderabbitai bot commented Sep 5, 2025

Walkthrough

The changes deprecate TRT-LLM engine build paths, unify HF checkpoint export in PTQ, remove VLM evaluation scripts, refactor LLM deployment to use TRT-LLM public API, update tests to the new quant/kv-cache flows, and adjust docs, notebooks, and requirements to reflect the streamlined quantization-first workflow.

Changes

Cohort / File(s) Summary
Changelog & Docs
CHANGELOG.rst, docs/source/getting_started/_installation_for_Linux.rst, examples/llm_eval/README.md, examples/llm_ptq/README.md, examples/vlm_ptq/README.md, examples/llm_sparsity/README.md
Reclassify deprecations; document unified HF export and low-memory mode; update TensorRT-LLM version to 1.1.0rc2(.post2); replace engine_dir with quantized checkpoint dir in examples; simplify sparsity guidance.
LLM PTQ Core & Scripts
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/run_tensorrt_llm.py, examples/llm_ptq/scripts/huggingface_example.sh, examples/llm_ptq/scripts/parser.sh, examples/llm_ptq/modelopt_to_tensorrt_llm.py
Unify export to HF by default; deprecate/ignore --export_fmt; select export path by model support/sparsity; always generate context logits; simplify CLI/scripts to quant-first; remove TRT-LLM engine builder script; tighten sparsity/QFORMAT validation.
VLM Eval Removal
examples/vlm_eval/* (README.md, convert_gqa_for_eval.py, gqa.sh, model_gqa_loader.py, requirements-vila.txt, vlm_eval_utils.py), .github/CODEOWNERS
Remove VLM GQA evaluation pipeline, utilities, and extra deps; drop CODEOWNERS entry for vlm_eval path.
VLM PTQ Streamlining
examples/vlm_ptq/scripts/huggingface_example.sh, examples/vlm_ptq/vlm_run.py, examples/vlm_ptq/vlm_visual_engine.py
Switch to quantization-first flow with simplified paths; add optional TRT-LLM quickstart; remove standalone VLM run/engine-build CLIs.
LLM Deployment Refactor
modelopt/deploy/llm/generate.py
LLM now subclasses TRTLLM; initialize from checkpoint config.json; add version-gated context logits/stop-words; configure KV cache, CUDA graphs, and MoE EP; route calls through public API.
Tests: PTQ and Eval
tests/examples/llm_ptq/test_llm_ptq.py, tests/examples/llm_eval/test_llm_eval.py, tests/_test_utils/ptq_utils.py, tests/_test_utils/examples/run_command.py, tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
Drop HF/export flags and FP16 variants; default tasks to quant; add max_sm/batch fields; adjust llama eval to batched run and cleanup; remove explicit --export_fmt from invocations.
Tests: VLM PTQ
tests/examples/vlm_ptq/test_qwen_vl.py, tests/_test_utils/model.py
Switch from LLAVA to QWEN2-VL path constant; broaden quantization variants; simplify command (remove type/tp).
Notebooks & Examples
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb, examples/speculative_decoding/example.ipynb, tests/examples/speculative_decoding/test_medusa.py
Remove --export_fmt from PTQ invocations; consolidate speculative decoding PTQ to HF-only path.
Dependency Update
examples/gpt-oss/requirements.txt
Change PyTorch requirement from "torch >= 2.8.0" to "torch>2.7.1".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant PTQ as hf_ptq.py
  participant V as Validator
  participant E as Exporter
  Note over PTQ: Unified export selection
  U->>PTQ: Invoke with model/quant args (\\--export_fmt deprecated)
  PTQ->>V: Validate qformat/kv_cache_qformat
  V-->>PTQ: Validation result
  alt Model unsupported by HF runtime or sparsity != dense
    PTQ->>E: export_tensorrt_llm_checkpoint(checkpoint_dir)
    Note over E: Emits deprecation/runtime warnings
  else
    PTQ->>E: export_hf_checkpoint(checkpoint_dir)
  end
  PTQ-->>U: Save artifacts to SAVE_PATH
Loading
sequenceDiagram
  autonumber
  participant App as Application
  participant L as LLM (TRTLLM)
  participant FS as Filesystem
  participant TRT as tensorrt_llm
  Note over L: Initialize from HF checkpoint
  App->>L: __init__(checkpoint_dir, tp, max_batch_size, ...)
  L->>FS: Read config.json
  FS-->>L: Config (seq len, experts, etc.)
  L->>TRT: Version check (>= 1.1.0rc2)
  alt Version supports context logits/stop-words
    Note over L: enable gather_context_logits
  else
    Note over L: disable context logits/stop-words
  end
  App->>L: generate(inputs, stop_words?)
  L-->>App: Outputs (with version-gated features)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Poem

A hop, a skip—goodbye old engines’ maze,
I nibble on checkpoints in quantized haze.
HF paths aligned, my whiskers twitch with glee,
Context logits? Only if the versions agree.
VLM scripts fade like footprints in dew—
Thump-thump! New flows, sleek and true. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Deprecate TRTLLM-build in examples" succinctly and accurately captures the primary change in the diff—deprecating/removing the TRT-LLM engine-build flow and related example scripts and wrappers while shifting example exports toward Hugging Face checkpoints. It is a single, concise sentence that directly relates to the main deletions and deprecations (e.g., removal of modelopt_to_tensorrt_llm.py, adjustments across llm/vlm examples, and README/CHANGELOG updates). The phrasing is clear and specific enough for a teammate scanning PR history to understand the primary intent.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/deprecate_trtllm_build_2

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1885e81 and be09664.

📒 Files selected for processing (1)
  • CHANGELOG.rst (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv cjluo-nv requested a review from Edwardf0t1 September 5, 2025 22:05
), "One or more quantization formats provided are not supported for tensorrt llm export"
assert all(
qformat
in [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can pull this list of supported qformats as a variable and reuse in other places (in the auto quantize section)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. This PR is pretty large. I hope we can move the improvements to a following up

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/llm_ptq/scripts/huggingface_example.sh (2)

150-159: Logic bug: only the last qformat is checked for bf16/fp16.

$qformat here is whatever remained from a prior loop. If any qformat is bf16/fp16, do the symlink path.

-    if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then
+    needs_symlink=false
+    IFS=','; for q in $QFORMAT; do
+        if [ "$q" = "bf16" ] || [ "$q" = "fp16" ]; then needs_symlink=true; fi
+    done; IFS=' '
+    if $needs_symlink; then
         if [ -d "$MODEL_PATH" ]; then
             MODEL_CONFIG_EXIST=true
             MODEL_CONFIG=$MODEL_PATH/config.json
-            for file in $MODEL_PATH/*; do ln -sf "$file" $SAVE_PATH/; done
+            for file in "$MODEL_PATH"/*; do ln -sf "$file" "$SAVE_PATH/"; done
         else
             echo "Please use the model directory where the config.json file is present."
             exit 1
         fi
     fi

301-314: Subshell exit won’t stop the script; make the log-wait loop run in current shell.

exit 1 inside the pipeline’s subshell won’t abort the parent. Use process substitution and read -r.

-    tail -f $SAVE_PATH/serve.txt | while read line; do
+    while read -r line; do
         if echo "$line" | grep -q "Application startup complete"; then
             echo "Application startup complete."
             break
         fi
         if ! kill -0 $SERVE_PID 2>/dev/null; then
             echo "trtllm-serve has exited."
             exit 1
         fi
-    done
+    done < <(tail -n +1 -f "$SAVE_PATH/serve.txt")

Optionally, add a timeout to avoid indefinite hangs.

🧹 Nitpick comments (19)
CHANGELOG.rst (1)

9-12: Tighten deprecation bullets and clarify default behavior.

Suggest minor wording for precision and consistency.

-- TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``.
- - ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
- - ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.
+ - Deprecate TRT-LLM's TensorRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``.
+ - Remove the ``--export_fmt`` flag in ``examples/llm_ptq``; exports default to the unified Hugging Face checkpoint format.
+ - Deprecate ``examples/vlm_eval`` (depends on the deprecated TensorRT backend).
examples/llm_ptq/hf_ptq.py (1)

92-105: De-duplicate allowed qformats (single source of truth).

The hardcoded list risks drifting from other checks. Define a constant and reuse it here and below.

-    assert all(
-        qformat
-        in [
-            "fp8",
-            "int4_awq",
-            "nvfp4",
-            "nvfp4_awq",
-            "w4a8_awq",
-            "fp8_pb_wo",
-            "w4a8_mxfp4_fp8",
-            "nvfp4_mlp_only",
-        ]
-        for qformat in qformat_list
-    ), "One or more quantization formats provided are not supported for unified checkpoint export"
+    assert all(q in ALLOWED_UNIFIED_HF_QFORMATS for q in qformat_list), (
+        "One or more quantization formats provided are not supported for unified checkpoint export"
+    )

Add near QUANT_CFG_CHOICES:

ALLOWED_UNIFIED_HF_QFORMATS = {
    "fp8",
    "int4_awq",
    "nvfp4",
    "nvfp4_awq",
    "w4a8_awq",
    "fp8_pb_wo",
    "w4a8_mxfp4_fp8",
    "nvfp4_mlp_only",
}
modelopt/deploy/llm/generate.py (2)

113-120: MOE detection heuristic is brittle.

Scanning only top-level keys for “num” and “experts” may miss/false-trigger. Consider a recursive search for typical MOE markers (e.g., num_experts, num_local_experts) or a config flag.

I can propose a small recursive helper to detect MOE fields robustly.


136-139: Align version gating and ImportError guidance.

You gate features at ≥1.1.0rc2 but the ImportError message above still says “1.0.0rc or later.” Update the message to match the minimum required for llmapi + context logits UX.

-except ImportError:
-    print("Please upgrade tensorrt-llm to 1.0.0rc or later")
+except ImportError:
+    print("Please upgrade tensorrt-llm to 1.1.0rc2 or later")
examples/llm_ptq/scripts/parser.sh (3)

92-96: Usage/help text is outdated and incomplete.

It omits many supported options (e.g., kv_cache_quant, calib_batch_size, num_samples, etc.). Update to reduce user confusion.

Apply this minimal improvement:

-    echo "Usage: $0 --model=<MODEL_PATH> --quant=<QFORMAT> --tasks=<TASK,...>"
-    echo "Optional args: --sparsity=<SPARSITY_FMT> --awq_block_size=<AWQ_BLOCK_SIZE> --calib=<CALIB_SIZE>"
+    echo "Usage: $0 --model=<MODEL_PATH> --quant=<QFORMAT> --tasks=<TASK,...>"
+    echo "Optional: --kv_cache_quant=FMT --sparsity=FMT --awq_block_size=N --calib=N --calib_batch_size=N --auto_quantize_bits=N"
+    echo "          --input=N --output=N --batch=N --lm_eval_tasks=LIST --lm_eval_limit=N --simple_eval_tasks=LIST --num_samples=N"
+    echo "          --trust_remote_code --use_seq_device_map --gpu_max_mem_percentage=N --kv_cache_free_gpu_memory_fraction=F"
+    echo "          --low_memory_mode --no-verbose --calib_dataset=NAME"

117-123: Message text references deprecated TRT-LLM TRT path and has grammar issues.

Keep messaging backend-agnostic and fix grammar.

-    if [[ "$QFORMAT" == *"awq"* ]]; then
-        echo "Sparsity is not compatible with 'awq' quantization for TRT-LLM deployment."
+    if [[ "$QFORMAT" == *"awq"* ]]; then
+        echo "Sparsity is not compatible with 'awq' quantization (no speedup)."
         exit 1  # Exit script with an error
     fi

141-141: Typo: use num_samples (plural) for consistency.

-  echo "num_sample: $NUM_SAMPLES"
+  echo "num_samples: $NUM_SAMPLES"
examples/vlm_ptq/scripts/huggingface_example.sh (2)

73-84: Dead branch: MODEL_TYPE is never set by parser.sh now.

parser.sh no longer exposes --type/MODEL_TYPE. This VILA setup block is unreachable.

Options:

  • Remove this block; or
  • Gate via a new flag (e.g., --enable_vila_setup) parsed in parser.sh; or
  • Expect MODEL_TYPE via env and document it.
    Do you want a patch to add --enable_vila_setup?

101-113: Minor: keep comments aligned with actual behavior.

Header says “Prepare datasets for TRT-LLM benchmark” but this section only sets and uses quickstart. Consider rewording to avoid confusion.

examples/llm_ptq/scripts/huggingface_example.sh (10)

187-194: Outdated Blackwell guidance mentions engine build.

If using Torch llmapi, suggest wording that avoids “build engine”.

-            echo "Please build the tensorrt_llm engine on Blackwell GPU for deployment. Checkpoint export_path: $SAVE_PATH"
+            echo "This quantization requires deployment on a Blackwell (SM100) GPU with the TRT-LLM Torch backend. Checkpoint export_path: $SAVE_PATH"

201-206: Initialize RUN_ARGS to avoid unbound expansions and quote paths.

-    if $TRUST_REMOTE_CODE; then
+    RUN_ARGS=""
+    if $TRUST_REMOTE_CODE; then
         RUN_ARGS+=" --trust_remote_code "
     fi
-
-    python run_tensorrt_llm.py --engine_dir=$SAVE_PATH $RUN_ARGS
+    python run_tensorrt_llm.py --engine_dir="$SAVE_PATH" $RUN_ARGS

285-289: Quote command substitution result (SC2046) and prefer a single match.

Avoid word-splitting and pick one file deterministically.

-    JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl'))
+    JSONL_PATH=$(readlink -f "$(find "$SAVE_PATH" -type f -name '*.jsonl' -print -quit)")

299-303: Quote paths for safety; minor robustness.

-    trtllm-serve $SAVE_PATH --host 0.0.0.0 --port $PORT >$SAVE_PATH/serve.txt 2>&1 &
+    trtllm-serve "$SAVE_PATH" --host 0.0.0.0 --port "$PORT" >"$SAVE_PATH/serve.txt" 2>&1 &

373-380: Quote model and dataset paths.

-    MODEL_ARGS="--model_path $SAVE_PATH "
+    MODEL_ARGS="--model_path \"$SAVE_PATH\" "
@@
-        trtllm-bench --model $MODEL_PATH $MODEL_ARGS throughput $EXTRA_ARGS --dataset $DATASET_TXT | tee -a $BENCHMARK_RESULT
+        trtllm-bench --model "$MODEL_PATH" $MODEL_ARGS throughput $EXTRA_ARGS --dataset "$DATASET_TXT" | tee -a "$BENCHMARK_RESULT"
@@
-        trtllm-bench --model $MODEL_PATH $MODEL_ARGS latency $EXTRA_ARGS --dataset $DATASET_TXT | tee -a $BENCHMARK_RESULT
+        trtllm-bench --model "$MODEL_PATH" $MODEL_ARGS latency $EXTRA_ARGS --dataset "$DATASET_TXT" | tee -a "$BENCHMARK_RESULT"

236-247: Consistent quoting for paths and args.

-    LM_EVAL_RESULT=${SAVE_PATH}/lm_eval.txt
+    LM_EVAL_RESULT="${SAVE_PATH}/lm_eval.txt"
@@
-    python lm_eval_tensorrt_llm.py \
+    python lm_eval_tensorrt_llm.py \
         --model trt-llm \
-        --model_args tokenizer=$MODEL_PATH,engine_dir=$SAVE_PATH,max_gen_toks=$BUILD_MAX_OUTPUT_LEN \
+        --model_args tokenizer="$MODEL_PATH",engine_dir="$SAVE_PATH",max_gen_toks="$BUILD_MAX_OUTPUT_LEN" \
         --tasks $LM_EVAL_TASKS \
-        --batch_size $BUILD_MAX_BATCH_SIZE $lm_eval_flags | tee $LM_EVAL_RESULT
+        --batch_size "$BUILD_MAX_BATCH_SIZE" $lm_eval_flags | tee "$LM_EVAL_RESULT"

255-276: Quoting and portability nits in MMLU block.

-    MMLU_RESULT=${SAVE_PATH}/mmlu.txt
+    MMLU_RESULT="${SAVE_PATH}/mmlu.txt"
@@
-        tar -xf /tmp/mmlu.tar -C data && mv data/data $MMLU_DATA_PATH
+        tar -xf /tmp/mmlu.tar -C data && mv data/data "$MMLU_DATA_PATH"
@@
-        --engine_dir $SAVE_PATH \
-        --data_dir $MMLU_DATA_PATH | tee $MMLU_RESULT
+        --engine_dir "$SAVE_PATH" \
+        --data_dir "$MMLU_DATA_PATH" | tee "$MMLU_RESULT"

318-328: Quote args and results paths in livecodebench/simple_eval.

-        bash run_livecodebench.sh $MODEL_FULL_NAME $BUILD_MAX_BATCH_SIZE $BUILD_MAX_OUTPUT_LEN $PORT | tee $SAVE_PATH/livecodebench.txt
+        bash run_livecodebench.sh "$MODEL_FULL_NAME" "$BUILD_MAX_BATCH_SIZE" "$BUILD_MAX_OUTPUT_LEN" "$PORT" | tee "$SAVE_PATH/livecodebench.txt"
-        mkdir -p $SAVE_PATH/livecodebench
-        mv LiveCodeBench/output/$MODEL_FULL_NAME/* $SAVE_PATH/livecodebench
+        mkdir -p "$SAVE_PATH/livecodebench"
+        mv "LiveCodeBench/output/$MODEL_FULL_NAME/"* "$SAVE_PATH/livecodebench"
@@
-        bash run_simple_eval.sh $MODEL_FULL_NAME $SIMPLE_EVAL_TASKS $BUILD_MAX_OUTPUT_LEN $PORT | tee $SAVE_PATH/simple_eval.txt
+        bash run_simple_eval.sh "$MODEL_FULL_NAME" "$SIMPLE_EVAL_TASKS" "$BUILD_MAX_OUTPUT_LEN" "$PORT" | tee "$SAVE_PATH/simple_eval.txt"

384-391: Clean-up quoting.

-    rm -f $SAVE_PATH/*.json
-    rm -f $SAVE_PATH/*.safetensors
-    rm -f $SAVE_PATH/*/*.json
-    rm -f $SAVE_PATH/*/*.engine
-    rm -f $SAVE_PATH/*/*.cache
+    rm -f "$SAVE_PATH"/*.json
+    rm -f "$SAVE_PATH"/*.safetensors
+    rm -f "$SAVE_PATH"/*/*.json
+    rm -f "$SAVE_PATH"/*/*.engine
+    rm -f "$SAVE_PATH"/*/*.cache

21-24: Optional: add set -o pipefail with set -e.

Improves failure detection in piped commands.

 set -e
+set -o pipefail
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0d279f1 and 888a89e.

📒 Files selected for processing (18)
  • CHANGELOG.rst (1 hunks)
  • docker/Dockerfile (1 hunks)
  • examples/llm_ptq/hf_ptq.py (4 hunks)
  • examples/llm_ptq/modelopt_to_tensorrt_llm.py (0 hunks)
  • examples/llm_ptq/run_tensorrt_llm.py (1 hunks)
  • examples/llm_ptq/scripts/huggingface_example.sh (14 hunks)
  • examples/llm_ptq/scripts/parser.sh (2 hunks)
  • examples/vlm_eval/README.md (0 hunks)
  • examples/vlm_eval/convert_gqa_for_eval.py (0 hunks)
  • examples/vlm_eval/gqa.sh (0 hunks)
  • examples/vlm_eval/model_gqa_loader.py (0 hunks)
  • examples/vlm_eval/requirements-vila.txt (0 hunks)
  • examples/vlm_eval/vlm_eval_utils.py (0 hunks)
  • examples/vlm_ptq/scripts/huggingface_example.sh (3 hunks)
  • examples/vlm_ptq/vlm_run.py (0 hunks)
  • examples/vlm_ptq/vlm_visual_engine.py (0 hunks)
  • modelopt/deploy/llm/generate.py (8 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (0 hunks)
💤 Files with no reviewable changes (10)
  • examples/vlm_ptq/vlm_visual_engine.py
  • examples/vlm_eval/README.md
  • examples/vlm_eval/gqa.sh
  • examples/vlm_ptq/vlm_run.py
  • examples/vlm_eval/requirements-vila.txt
  • tests/examples/llm_ptq/test_llm_ptq.py
  • examples/vlm_eval/model_gqa_loader.py
  • examples/vlm_eval/convert_gqa_for_eval.py
  • examples/vlm_eval/vlm_eval_utils.py
  • examples/llm_ptq/modelopt_to_tensorrt_llm.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/run_tensorrt_llm.py (1)
modelopt/deploy/llm/generate.py (1)
  • generate_context_logits (264-292)
modelopt/deploy/llm/generate.py (1)
examples/llm_eval/lm_eval_tensorrt_llm.py (1)
  • TRTLLM (41-188)
🪛 Shellcheck (0.10.0)
examples/llm_ptq/scripts/huggingface_example.sh

[warning] 196-196: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)


[warning] 288-288: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (6)
docker/Dockerfile (1)

1-1: Please verify llmapi feature support in the new base image
Build and run the Docker image and, inside the container, confirm that tensorrt_llm.__version__ >= "1.1.0rc2" and that the llmapi symbols (CudaGraphConfig, KvCacheConfig, LLM) are present. For example:

docker build -f docker/Dockerfile -t trt-llm-img .
docker run --rm trt-llm-img python3 - <<'PYTHON'
from packaging.version import parse as V
import importlib

import tensorrt_llm as trtllm
assert V(trtllm.__version__) >= V("1.1.0rc2"), "Tensorrt-llm version is too low"

m = importlib.import_module("tensorrt_llm.llmapi")
assert hasattr(m, "CudaGraphConfig")
assert hasattr(m, "KvCacheConfig")

from tensorrt_llm.llmapi.llm import LLM
assert LLM is not None
print("llmapi check passed:", trtllm.__version__)
PYTHON
examples/llm_ptq/hf_ptq.py (1)

586-615: HF vs. TRT-LLM export split: add a brief rationale and ensure docs match.

Given TRT backend deprecation, clarify here that TRT-LLM checkpoint export is used only where the torch runtime lacks support (t5/bart/whisper or sparse). Also ensure CHANGELOG/README reflect this split.

Would you like me to add a short comment block and update README accordingly?

examples/llm_ptq/scripts/huggingface_example.sh (4)

80-99: Ensure SAVE_PATH exists before symlinking.

You mkdir -p later (Line 98). Good. Keep it before any potential early use. Already correct; no action needed.


345-371: Benchmark dataset prep pathing looks good.

Defaulting TRT_LLM_CODE_PATH and generating/reusing dataset deterministically is fine.


45-49: ROOT_SAVE_PATH defaulting is solid.


21-23: Parsing reuse from parser.sh is consistent.

Comment on lines +214 to 229
if not args.auto_quantize_bits:
assert (
args.qformat
in [
"int4_awq",
"fp8",
"nvfp4",
"nvfp4_awq",
"w4a8_awq",
"fp8_pb_wo",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
]
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
), f"Quantization format {args.qformat} not supported for HF export path"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Assertion is ineffective; or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES is always True.

This currently lets unsupported qformats through. Tighten the check.

-    if not args.auto_quantize_bits:
-        assert (
-            args.qformat
-            in [
-                "int4_awq",
-                "fp8",
-                "nvfp4",
-                "nvfp4_awq",
-                "w4a8_awq",
-                "fp8_pb_wo",
-                "w4a8_mxfp4_fp8",
-                "nvfp4_mlp_only",
-            ]
-            or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
-        ), f"Quantization format {args.qformat} not supported for HF export path"
+    if not args.auto_quantize_bits:
+        assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, (
+            f"Quantization format {args.qformat} not supported for HF export path"
+        )

Note: If you intended to allow “KV-cache-only” quant, handle that as a separate branch instead of weakening this assert.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 37 to 44
#Iterate over list of qformats provided and check if they are supported in HF export path
IFS=","
for qformat in $QFORMAT; do
case $qformat in
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
*)
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
exit 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Follow-up: finalize the allowed quant list after deduplication.

Please confirm the authoritative allowed set for the HF path and Torch llmapi to avoid drift.

Run to locate all hardcoded QFORMAT lists:


🏁 Script executed:

#!/bin/bash
rg -nP --type=sh -C2 '\b(fp16|bf16|fp8|fp8_pc_pt|fp8_pb_wo|int8_sq|int4_awq|nvfp4|nvfp4_awq|w4a8_awq|w4a8_nvfp4_fp8|w4a8_mxfp4_fp8)\b'

Length of output: 9308


Centralize quant-format lists for HF export and Torch llmapi
Move both HF-export (fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8) and Torch-llmapi (fp16, bf16, fp8, nvfp4, int4_awq, w4a8_awq) formats into shared constants and update the case statements in examples/llm_ptq/scripts/huggingface_example.sh to reference them, ensuring a single source of truth and preventing drift.

🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 37-44, the
supported quant-format list is hard-coded in the case statement; centralize the
HF-export and Torch-llmapi format lists into a single shared constant file
(e.g., examples/llm_ptq/scripts/quant_formats.sh), export two variables
(HF_EXPORT_QFORMATS and TORCH_LLMAPI_QFORMATS), source that file from
huggingface_example.sh, and replace the inline case pattern with a check that
tests the qformat against the appropriate exported list (e.g., iterate or use a
helper function that returns true if qformat is in HF_EXPORT_QFORMATS) so both
scripts consume the same source of truth and avoid drift.

Comment on lines 196 to 200
if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then
echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
exit 0
fi

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Regex-in-string and comma-list membership check (SC2076) + incorrect handling of comma-separated QFORMAT.

Current check treats the whole comma-joined list as one token.

-    if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then
-        echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
+    unsupported=""
+    IFS=','; for q in $QFORMAT; do
+        case "$q" in fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) : ;; * ) unsupported+="${unsupported:+,}$q" ;; esac
+    done; IFS=' '
+    if [ -n "$unsupported" ]; then
+        echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
         exit 0
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then
echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
exit 0
fi
# Validate that each comma-separated QFORMAT is in the allowed list.
unsupported=""
# Split on commas
IFS=','
for q in $QFORMAT; do
case "$q" in
fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq)
# supported, do nothing
;;
*)
# collect any unsupported formats
unsupported+="${unsupported:+,}$q"
;;
esac
done
# Restore IFS to a single space (or save/restore original IFS if needed)
IFS=' '
if [ -n "$unsupported" ]; then
echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
exit 0
fi
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 196-196: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)

🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 196-200, the
current membership check uses a regex-in-string and fails for comma-separated
QFORMAT values; update it to split QFORMAT on commas and validate each token
against an explicit allowed list. Specifically: define an array or string of
allowed formats (fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq), split QFORMAT by ','
into tokens, trim whitespace, loop over tokens and for each token check
membership using a case statement or a simple loop comparison; if any token is
not in the allowed list, echo the same error message and exit with a non-zero
status (exit 1).


# Parse command-line options
ARGS=$(getopt -o "" -l "type:,model:,quant:,kv_cache_quant:,tp:,calib_tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,export_fmt:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing getopt entry for --num_samples causes parsing failure.

You parse --num_samples below (Line 57) but it’s not declared in the getopt long options, so passing it will fail.

Apply this diff:

-  ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
+  ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,num_samples:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,num_samples:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/parser.sh around line 37, the getopt long options
list is missing the --num_samples entry which causes parsing failures when that
flag is used; update the long options string to include num_samples: (with a
trailing colon) among the other long options so getopt recognizes and parses
--num_samples correctly, keeping the comma separation/format consistent with the
existing options.

Comment on lines +122 to 125
trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7)
trt_kv_cache_config.max_tokens = self._max_seq_len * (
max_batch_size if max_batch_size > 0 else 8
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

kv_cache_config argument is ignored; honor user overrides.

Currently hardcodes free_gpu_memory_fraction and max_tokens. Apply values from kv_cache_config when provided.

-        trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7)
-        trt_kv_cache_config.max_tokens = self._max_seq_len * (
-            max_batch_size if max_batch_size > 0 else 8
-        )
+        trt_kv_cache_config = TRT_KvCacheConfig(
+            free_gpu_memory_fraction=kv_cache_config.get("free_gpu_memory_fraction", 0.7)
+        )
+        trt_kv_cache_config.max_tokens = kv_cache_config.get(
+            "max_tokens",
+            self._max_seq_len * (max_batch_size if max_batch_size > 0 else 8),
+        )

If more fields are supported by TRT_KvCacheConfig, merge them generically:

for k, v in kv_cache_config.items():
    if hasattr(trt_kv_cache_config, k):
        setattr(trt_kv_cache_config, k, v)
🤖 Prompt for AI Agents
In modelopt/deploy/llm/generate.py around lines 122 to 125, the code creates a
TRT_KvCacheConfig with hardcoded free_gpu_memory_fraction and max_tokens and
ignores the incoming kv_cache_config parameter; update this to first construct
the default TRT_KvCacheConfig (with max_tokens defaulting to self._max_seq_len *
(max_batch_size if max_batch_size > 0 else 8)), then if kv_cache_config is
provided, iterate its items and for each key value pair set the attribute on the
TRT_KvCacheConfig instance only if that attribute exists (so user-supplied
overrides like free_gpu_memory_fraction and max_tokens take effect), ensuring no
unknown keys raise errors.

Copy link

codecov bot commented Sep 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.82%. Comparing base (d94fc1b) to head (8c4a6da).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #297   +/-   ##
=======================================
  Coverage   73.82%   73.82%           
=======================================
  Files         172      172           
  Lines       17438    17438           
=======================================
  Hits        12873    12873           
  Misses       4565     4565           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97
Copy link
Collaborator

kevalmorabia97 commented Sep 6, 2025

Missing changes in

Since this is a big change, please make sure to test it

  • Rebuild docker with latest dockerfile and push it to gitlab-master.nvidia.com:5005/omniml/modelopt/modelopt_examples:latest
  • Trigger internal CI and make sure gpu, example-ada: llm_ptq and example-ada: vlm_ptq pass

Co-authored-by: Keval Morabia <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
@noeyy-mino
Copy link

The llm_sparsity mentions "export tensorrt-llm checkpoint" and the llm_eval mentions "TensorRT LLM" , which may also need to be updated.

@knote2019
Copy link

image some model need set TP size, so run_tensorrt_llm.py in LLM_PTQ may also need add one more parameter to set TP, right? or there is other way to do so?

@cjluo-nv
Copy link
Collaborator Author

image some model need set TP size, so run_tensorrt_llm.py in LLM_PTQ may also need add one more parameter to set TP, right? or there is other way to do so?

By default we set TP to all visible GPUs. If you need customization, feel free to customize this script as well.

Signed-off-by: Chenjie Luo <[email protected]>
…/TensorRT-Model-Optimizer into chenjiel/deprecate_trtllm_build_2
@cjluo-nv cjluo-nv requested a review from a team as a code owner September 16, 2025 15:59
@cjluo-nv cjluo-nv requested a review from kaix-nv September 16, 2025 15:59
@meenchen meenchen self-requested a review September 16, 2025 16:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
examples/vlm_ptq/scripts/huggingface_example.sh (1)

150-155: Bug: qformat is undefined; test against $QFORMAT and handle comma lists.

Current branch never triggers. Fix and support multiple tokens.

-    if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then
+    # If QFORMAT is exactly a single precision token (no quant), just link HF files.
+    if [[ "$QFORMAT" == "bf16" || "$QFORMAT" == "fp16" ]]; then

If mixed values are possible, split and detect:

-    if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then
+    is_prec_only=false
+    IFS=','; set -- $QFORMAT; [ $# -eq 1 ] && { [ "$1" = "bf16" ] || [ "$1" = "fp16" ]; } && is_prec_only=true; IFS=' '
+    if $is_prec_only; then
examples/llm_ptq/scripts/parser.sh (1)

36-68: Add missing getopt long option num_samples or remove its case arm.
In examples/llm_ptq/scripts/parser.sh (lines 36–68) the case block handles --num_samples but the getopt -l long-options list does not — add num_samples: to the -l string or remove the case handling.

examples/llm_ptq/scripts/huggingface_example.sh (1)

148-159: Bug: uses undefined $qformat; bf16/fp16 passthrough check broken.

This block references $qformat outside any loop and fails for comma-separated QFORMAT.

-if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then
-
-    if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then
+if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then
+
+    # If QFORMAT is a single precision (bf16/fp16), symlink files instead of re-quantizing
+    if [[ "$QFORMAT" =~ ^(bf16|fp16)$ ]]; then
         if [ -d "$MODEL_PATH" ]; then
             MODEL_CONFIG_EXIST=true
             MODEL_CONFIG=$MODEL_PATH/config.json
-            for file in $MODEL_PATH/*; do ln -sf "$file" $SAVE_PATH/; done
+            mkdir -p "$SAVE_PATH"
+            for file in "$MODEL_PATH"/*; do ln -sf "$file" "$SAVE_PATH/"; done
         else
             echo "Please use the model directory where the config.json file is present."
             exit 1
         fi
     fi
♻️ Duplicate comments (15)
examples/llm_ptq/run_tensorrt_llm.py (1)

83-84: Guard context‑logits call to avoid runtime assert on unsupported versions.

Keep example robust; skip when unsupported instead of asserting inside wrapper.

-    logits = llm.generate_context_logits(input_texts)
-    print(f"Generated logits: {logits}")
+    if hasattr(llm, "generate_context_logits"):
+        try:
+            logits = llm.generate_context_logits(input_texts)
+            print(f"Generated logits: {logits}")
+        except AssertionError as e:
+            print(f"Generated logits: skipped ({e})")
+    else:
+        print("Generated logits: skipped (not supported)")
examples/vlm_ptq/scripts/huggingface_example.sh (4)

37-47: Remove duplicate and divergent QFORMAT validation.

This duplicates the later validator (Lines 65–75) and has a different allowlist. Drop this block to avoid drift.

-#Iterate over list of qformats provided and check if they are supported in HF export path
-IFS=","
-for qformat in $QFORMAT; do
-    case $qformat in
-    fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
-    *)
-        echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
-        exit 1
-        ;;
-    esac
-done
-IFS=" "

182-185: Outdated guidance: references trtllm-build after deprecating TRT backend in examples.

Replace with Torch-backend guidance or explicit “not supported in this example” message.

-        echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Please deploy with the TRT-LLM using trtllm-build. Checkpoint export_path: $SAVE_PATH"
+        echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Sparse deployment is not supported in this HF example. Use the TRT-LLM Torch backend per official docs. Checkpoint export_path: $SAVE_PATH"

196-199: Fix comma-list membership check (SC2076) and validate each token.

Current regex-in-string breaks for lists.

-    if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then
-        echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
-        exit 0
-    fi
+    unsupported=""
+    IFS=','; for q in $QFORMAT; do
+        case "$q" in fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) : ;; * ) unsupported+="${unsupported:+,}$q" ;; esac
+    done; IFS=' '
+    if [ -n "$unsupported" ]; then
+        echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
+        exit 0
+    fi

337-341: PP may be unset after parser changes; comparison can error.

Drop the check or default PP=1 safely.

-    if [ "$PP" -ne 1 ]; then
-        echo "Benchmark does not work with multi PP. Please run the c++ benchmark in the TensorRT-LLM repo..."
-        exit 1
-    fi
+    # Benchmark assumes single pipeline parallelism.
examples/llm_ptq/hf_ptq.py (2)

92-105: Single source of truth for allowed qformats.

Extract the list into a constant and reuse here and below.

+ALLOWED_UNIFIED_HF_QFORMATS = {
+    "fp8",
+    "int4_awq",
+    "nvfp4",
+    "nvfp4_awq",
+    "w4a8_awq",
+    "fp8_pb_wo",
+    "w4a8_mxfp4_fp8",
+    "nvfp4_mlp_only",
+}
 ...
-    assert all(
-        qformat
-        in [
-            "fp8",
-            "int4_awq",
-            "nvfp4",
-            "nvfp4_awq",
-            "w4a8_awq",
-            "fp8_pb_wo",
-            "w4a8_mxfp4_fp8",
-            "nvfp4_mlp_only",
-        ]
-        for qformat in qformat_list
-    ), "One or more quantization formats provided are not supported for unified checkpoint export"
+    assert all(q in ALLOWED_UNIFIED_HF_QFORMATS for q in qformat_list), (
+        "One or more quantization formats provided are not supported for unified checkpoint export"
+    )

214-229: Fix assert: current or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES makes it always true.

Enforce qformat membership independently.

-    if not args.auto_quantize_bits:
-        assert (
-            args.qformat
-            in [
-                "int4_awq",
-                "fp8",
-                "nvfp4",
-                "nvfp4_awq",
-                "w4a8_awq",
-                "fp8_pb_wo",
-                "w4a8_mxfp4_fp8",
-                "nvfp4_mlp_only",
-            ]
-            or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
-        ), f"Quantization format {args.qformat} not supported for HF export path"
+    if not args.auto_quantize_bits:
+        assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, (
+            f"Quantization format {args.qformat} not supported for HF export path"
+        )
+        assert args.kv_cache_qformat in KV_QUANT_CFG_CHOICES, (
+            f"KV cache quantization format {args.kv_cache_qformat} is not supported"
+        )
modelopt/deploy/llm/generate.py (2)

69-78: Docstring param name is wrong (“engine_dir”); use “checkpoint_dir”.

Aligns with the signature and deprecates engine semantics. This was called out earlier and still needs fixing.

-        Args:
-            engine_dir: the directory path of the TensorRT-LLM engine.
+        Args:
+            checkpoint_dir: Path to the checkpoint/config directory (unified HF/TRT-LLM format).
             tokenizer: the tokenizer. For example, a tokenizer from the Huggingface model.
             kv_cache_config: the kv cache config as a dict. Please refer to
                 https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/
             medusa_choices: The medusa choices for the decoding config.
             tp: the tensor parallel size (for the torch backend). If 0, it will be set to the number of GPUs.
             trust_remote_code: whether to trust the remote code (for the torch backend).
             max_batch_size: Max batch size for the LLM backend. If 0, it is not specified.

121-126: Honor user-provided kv_cache_config overrides.

Currently hardcoded; apply provided values when present. This was previously flagged and remains.

-        trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7)
-        trt_kv_cache_config.max_tokens = self._max_seq_len * (
-            max_batch_size if max_batch_size > 0 else 8
-        )
+        trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7)
+        trt_kv_cache_config.max_tokens = self._max_seq_len * (
+            max_batch_size if max_batch_size > 0 else 8
+        )
+        if kv_cache_config:
+            for k, v in kv_cache_config.items():
+                if hasattr(trt_kv_cache_config, k):
+                    setattr(trt_kv_cache_config, k, v)
examples/llm_ptq/scripts/parser.sh (1)

37-37: Add missing --num_samples to getopt long options.

Currently parsed in case but undeclared in getopt, causing “unrecognized option” errors when used. Mirrors an earlier review; still unresolved.

-  ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
+  ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,num_samples:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
examples/llm_ptq/scripts/huggingface_example.sh (5)

37-47: Remove duplicate QFORMAT validator (keep single source of truth).

This block duplicates the validator below and drifts in the allowed set. Delete this one and keep the later validator.

-#Iterate over list of qformats provided and check if they are supported in HF export path
-IFS=","
-for qformat in $QFORMAT; do
-    case $qformat in
-    fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
-    *)
-        echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
-        exit 1
-        ;;
-    esac
-done
-IFS=" "

182-185: Update sparse-quant message; remove trtllm-build reference.

Keep messaging consistent with TRT backend deprecation.

-    if [[ "$SPARSITY_FMT" != "dense" ]]; then
-        echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Please deploy with the TRT-LLM using trtllm-build. Checkpoint export_path: $SAVE_PATH"
-        exit 0
-    fi
+    if [[ "$SPARSITY_FMT" != "dense" ]]; then
+        echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Not supported in this HF example; use the TRT-LLM Torch backend sparse deployment guide. Checkpoint export_path: $SAVE_PATH"
+        exit 0
+    fi

92-95: Remove deprecated EXPORT_FORMAT from name; add guard for legacy env.

Interpolating ${EXPORT_FORMAT} yields a trailing underscore and empty token.

-MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}_${EXPORT_FORMAT}
+MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}

Add an early guard (near argument parsing):

+if [ -n "${EXPORT_FORMAT:-}" ]; then
+  echo "ERROR: EXPORT_FORMAT is deprecated; remove it from your environment." >&2
+  exit 1
+fi

337-341: PP check can error when PP is unset.

Use a safe default or remove the check per deprecation.

-    if [ "$PP" -ne 1 ]; then
+    if [ "${PP:-1}" -ne 1 ]; then
         echo "Benchmark does not work with multi PP. Please run the c++ benchmark in the TensorRT-LLM repo..."
         exit 1
     fi

196-199: Fix QFORMAT membership check (SC2076) and comma-list handling.

Current regex-in-string check is wrong for comma-separated values.

-    if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then
-        echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
-        exit 0
-    fi
+    unsupported=""
+    IFS=','; for q in $QFORMAT; do
+        case "$q" in fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) : ;; * ) unsupported+="${unsupported:+,}$q" ;; esac
+    done; IFS=' '
+    if [ -n "$unsupported" ]; then
+        echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
+        exit 0
+    fi
🧹 Nitpick comments (26)
CHANGELOG.rst (2)

9-11: Changelog accuracy: --export_fmt is deprecated/ignored, not removed.

hf_ptq.py still accepts --export_fmt (default "hf") and only warns. Update the note to “Deprecated and ignored in 0.37; removal planned next release.”

Apply:

- - ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
+ - ``--export_fmt`` flag in ``examples/llm_ptq`` is deprecated and ignored. By default we export to the unified Hugging Face checkpoint format. It will be removed in the next release.

9-11: Add TensorRT-LLM upgrade to 1.1.0rc2 in v0.37 CHANGELOG

Add an entry under v0.37 "New Features" (or "Misc"): "Upgrade TensorRT-LLM dependency to 1.1.0rc2." Evidence: docker/Dockerfile uses nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2 and modelopt/deploy/llm/generate.py prompts upgrading to 1.1.0rc2 or later.

Follow-ups to verify/clean:

  • .github/CODEOWNERS still references examples/vlm_eval — remove or justify keeping.
  • Update or document remaining uses of --export_fmt and trtllm-build (e.g., examples/llm_ptq/hf_ptq.py, tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py, various docs/examples).
examples/llm_sparsity/README.md (1)

151-151: Clarify backend and minimum TRT-LLM version.

Add a note that the linked instructions use the TRT backend; if your repo examples favor the Torch runtime elsewhere, point that out and specify the minimum TRT-LLM version that supports --weight_sparsity.

docker/Dockerfile (1)

1-1: Pin base image/version and reflect in docs.

Consider pinning by digest or an ARG to avoid pulling moving RCs and update install docs accordingly.

Apply:

-FROM nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2
+ARG TRT_LLM_TAG=1.1.0rc2
+FROM nvcr.io/nvidia/tensorrt-llm/release:${TRT_LLM_TAG}
examples/llm_eval/README.md (3)

96-96: Tighten placeholder: make it clear engine_dir accepts a quantized HF checkpoint.

Suggest renaming the placeholder to <quantized_checkpoint_dir> for consistency across this page.

-python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <engine batch size>
+python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<quantized_checkpoint_dir> --tasks <comma separated tasks> --batch_size <engine batch size>

143-143: Same placeholder fix for MMLU section.

-python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <Quantized checkpoint dir>
+python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <quantized_checkpoint_dir>

166-166: Same placeholder fix for MT-Bench section.

-bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>
+bash run_fastchat.sh -h <HF model folder or model card> <quantized_checkpoint_dir>
examples/vlm_ptq/scripts/huggingface_example.sh (1)

286-289: Quote subshell to avoid word splitting when computing JSONL_PATH.

Prevents breakage with spaces/newlines.

-    JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl'))
+    JSONL_PATH="$(readlink -f "$(find "$SAVE_PATH" -type f -name '*.jsonl' -print -quit)")"
examples/llm_ptq/hf_ptq.py (3)

126-143: Avoid hidden global: auto_quantize reads args.kv_cache_qformat.

Pass KV format explicitly to keep function pure.

-def auto_quantize(
-    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
-):
+def auto_quantize(
+    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, kv_cache_qformat: str, batch_size: int = 1
+):
 ...
-    enable_quant_kv_cache = args.kv_cache_qformat != "none"
+    enable_quant_kv_cache = kv_cache_qformat != "none"
 ...
-        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]
+        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]

Call site (outside hunk):

-        model = auto_quantize(
+        model = auto_quantize(
             model,
             args.qformat,
             args.auto_quantize_bits,
             calib_dataloader,
             calibrate_loop,
+            args.kv_cache_qformat,
             args.batch_size,
         )

700-705: Sharpen deprecation message and type.

Make the help explicit and warn with DeprecationWarning.

-    parser.add_argument(
+    parser.add_argument(
         "--export_fmt",
         required=False,
         default="hf",
         choices=["tensorrt_llm", "hf"],
-        help="Deprecated. Please avoid using this argument.",
+        help="Deprecated and ignored in 0.37; will be removed next release.",
     )

759-761: Use DeprecationWarning category.

-    if args.export_fmt != "hf":
-        warnings.warn("Deprecated. --export_fmt will be ignored.")
+    if args.export_fmt != "hf":
+        warnings.warn("Deprecated. --export_fmt is ignored and will be removed in the next release.", DeprecationWarning)
examples/llm_ptq/scripts/parser.sh (3)

141-141: Typo in printed field name (num_sample → num_samples).

Avoid confusion with the flag name.

-  echo "num_sample: $NUM_SAMPLES"
+  echo "num_samples: $NUM_SAMPLES"

117-123: Outdated error message still references TRT-LLM deployment.

The repo is deprecating TRT backend; update the message to reflect the torch path or stay backend-agnostic.

-        echo "Sparsity is not compatible with 'awq' quantization for TRT-LLM deployment."
+        echo "Sparsity is not compatible with 'awq' quantization on the torch backend."

92-97: Usage string omits newly supported flags.

Include --num_samples to keep help accurate.

-    echo "Optional args: --sparsity=<SPARSITY_FMT> --awq_block_size=<AWQ_BLOCK_SIZE> --calib=<CALIB_SIZE>"
+    echo "Optional args: --sparsity=<SPARSITY_FMT> --awq_block_size=<AWQ_BLOCK_SIZE> --calib=<CALIB_SIZE> --num_samples=<NUM_SAMPLES>"
modelopt/deploy/llm/generate.py (6)

109-111: Fail fast when no GPU is available (tp resolves to 0).

Avoid passing tensor_parallel_size=0 to backend.

         if tp < 1:
-            tp = torch.cuda.device_count()
+            tp = torch.cuda.device_count()
+            if tp < 1:
+                raise RuntimeError("No CUDA devices available; torch backend requires at least 1 GPU.")

183-189: Avoid enabling beam search when beam_width == 1.

Minor perf/clarity improvement.

-        sampling_config = SamplingParams(
-            max_tokens=max_new_tokens,
-            use_beam_search=True,
-            best_of=beam_width,
-            stop=stop_words,
-            **kwargs,
-        )
+        use_beam = beam_width > 1
+        sampling_config = SamplingParams(
+            max_tokens=max_new_tokens,
+            use_beam_search=use_beam,
+            best_of=beam_width,
+            stop=stop_words,
+            **kwargs,
+        )

269-279: Fix return type annotation (torch.Tensor).

Use the correct Tensor type.

-    ) -> list[torch.tensor]:
+    ) -> list[torch.Tensor]:

79-83: Harden config.json loading with a clearer error.

Give a precise message when the checkpoint is missing/invalid.

-        with open(Path(checkpoint_dir) / "config.json") as config_file:
-            config = json.load(config_file)
+        cfg_path = Path(checkpoint_dir) / "config.json"
+        try:
+            with open(cfg_path) as config_file:
+                config = json.load(config_file)
+        except FileNotFoundError as e:
+            raise FileNotFoundError(f"config.json not found at: {cfg_path}") from e
+        except json.JSONDecodeError as e:
+            raise ValueError(f"Invalid JSON in {cfg_path}: {e}") from e

112-120: Heuristic MOE detection may misfire.

Scanning keys for “num” and “experts” is brittle; consider reading a known path (e.g., config.get("moe", {}).get("num_experts")) or checking values are ints > 0 before toggling EP/attention DP.


57-66: Remove unused medusa_choices parameter or implement support.

It’s accepted then immediately asserted unsupported; drop it to avoid API confusion.

-        medusa_choices: Any = None,
+        # medusa_choices removed; not supported by torch llmapi
examples/llm_ptq/scripts/huggingface_example.sh (6)

65-75: Finalize allowed quant list and centralize it.

Confirm the authoritative HF-export set and make this the single source of truth. Consider sourcing from a shared file to avoid drift.

-IFS=","
-for qformat in $QFORMAT; do
-    case $qformat in
-    fp8 | fp8_pc_pt | fp8_pb_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
+IFS=","
+for qformat in $QFORMAT; do
+    case $qformat in
+    fp16|bf16|fp8|fp8_pc_pt|fp8_pb_wo|int4_awq|nvfp4|nvfp4_awq|w4a8_awq|w4a8_nvfp4_fp8|w4a8_mxfp4_fp8) ;;
     *)
-        echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
+        echo "Unknown quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
         exit 1
         ;;
     esac
 done
 IFS=" "

Optional: move allowed lists to examples/llm_ptq/scripts/quant_formats.sh and source here.


176-180: Enc–Dec guidance contradicts deprecating TRT build path.

Linking to TRT-LLM enc_dec example is fine, but avoid implying engine build path is required here. Clarify Torch backend route or mark unsupported in this script.

-        echo "Please continue to deployment with the TRT-LLM enc_dec example, https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec. Checkpoint export_path: $SAVE_PATH"
+        echo "Enc–Dec models are not deployed by this HF example. Please follow the TRT-LLM Torch backend enc_dec guide. Checkpoint export_path: $SAVE_PATH"

201-205: Robust boolean handling for RUN_ARGS (avoid relying on shell builtins).

Prefer explicit string checks to avoid issues when vars are unset.

-    if $TRUST_REMOTE_CODE; then
+    if [ "${TRUST_REMOTE_CODE,,}" = "true" ]; then
         RUN_ARGS+=" --trust_remote_code "
     fi
-    python run_tensorrt_llm.py --engine_dir=$SAVE_PATH $RUN_ARGS
+    python run_tensorrt_llm.py --engine_dir="$SAVE_PATH" $RUN_ARGS

285-289: Quote command substitution; handle multiple JSONLs deterministically.

Prevent word-splitting and choose a single file (or collect all).

-    JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl'))
+    JSONL_PATH="$(find "$SAVE_PATH" -type f -name '*.jsonl' -print0 | xargs -0 -r ls -1t | head -n1)"

297-305: Confirm trtllm-serve is the intended Torch backend server.

If Torch llmapi is the target, verify trtllm-serve supports this flow; otherwise switch to the correct serve entrypoint or gate this task.


121-135: Consistent boolean parsing across the script.

Multiple places rely on executing “true/false”. Switch to string comparisons for resilience.

-if $TRUST_REMOTE_CODE; then
+if [ "${TRUST_REMOTE_CODE,,}" = "true" ]; then
     PTQ_ARGS+=" --trust_remote_code "
 fi

-if $USE_SEQ_DEVICE_MAP; then
+if [ "${USE_SEQ_DEVICE_MAP,,}" = "true" ]; then
     PTQ_ARGS+=" --use_seq_device_map "
 fi

-if ! $VERBOSE; then
+if [ "${VERBOSE,,}" != "true" ]; then
     PTQ_ARGS+=" --no-verbose "
 fi

Also applies to: 201-205

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d279f1 and 92e6900.

📒 Files selected for processing (20)
  • CHANGELOG.rst (1 hunks)
  • docker/Dockerfile (1 hunks)
  • examples/llm_eval/README.md (3 hunks)
  • examples/llm_ptq/hf_ptq.py (6 hunks)
  • examples/llm_ptq/modelopt_to_tensorrt_llm.py (0 hunks)
  • examples/llm_ptq/run_tensorrt_llm.py (1 hunks)
  • examples/llm_ptq/scripts/huggingface_example.sh (14 hunks)
  • examples/llm_ptq/scripts/parser.sh (2 hunks)
  • examples/llm_sparsity/README.md (1 hunks)
  • examples/vlm_eval/README.md (0 hunks)
  • examples/vlm_eval/convert_gqa_for_eval.py (0 hunks)
  • examples/vlm_eval/gqa.sh (0 hunks)
  • examples/vlm_eval/model_gqa_loader.py (0 hunks)
  • examples/vlm_eval/requirements-vila.txt (0 hunks)
  • examples/vlm_eval/vlm_eval_utils.py (0 hunks)
  • examples/vlm_ptq/scripts/huggingface_example.sh (3 hunks)
  • examples/vlm_ptq/vlm_run.py (0 hunks)
  • examples/vlm_ptq/vlm_visual_engine.py (0 hunks)
  • modelopt/deploy/llm/generate.py (8 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (0 hunks)
💤 Files with no reviewable changes (10)
  • examples/vlm_eval/requirements-vila.txt
  • examples/vlm_ptq/vlm_visual_engine.py
  • tests/examples/llm_ptq/test_llm_ptq.py
  • examples/vlm_eval/vlm_eval_utils.py
  • examples/vlm_eval/convert_gqa_for_eval.py
  • examples/vlm_eval/README.md
  • examples/vlm_eval/gqa.sh
  • examples/vlm_eval/model_gqa_loader.py
  • examples/llm_ptq/modelopt_to_tensorrt_llm.py
  • examples/vlm_ptq/vlm_run.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/run_tensorrt_llm.py (1)
modelopt/deploy/llm/generate.py (1)
  • generate_context_logits (264-292)
modelopt/deploy/llm/generate.py (1)
examples/llm_eval/lm_eval_tensorrt_llm.py (1)
  • TRTLLM (41-188)
🪛 Shellcheck (0.10.0)
examples/llm_ptq/scripts/huggingface_example.sh

[warning] 196-196: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)


[warning] 288-288: Quote this to prevent word splitting.

(SC2046)

🔇 Additional comments (6)
examples/llm_ptq/hf_ptq.py (1)

585-614: Confirm export routing rules.

For “t5/bart/whisper or sparsity!=dense” you still export TRT-LLM checkpoints. Ensure docs/tutorials reflect that these models require TRT-LLM export (not unified HF) and that CI covers this path.

examples/vlm_ptq/scripts/huggingface_example.sh (1)

373-375: Confirm trtllm-bench flag usage (--model vs --model_path)

Script passes both --model $MODEL_PATH and --model_path $SAVE_PATH; verification failed (trtllm-bench not available). Run trtllm-bench --help locally to confirm which flag expects the model/engine vs a save/checkpoint path and update the script to use only the correct flag.
Location: examples/vlm_ptq/scripts/huggingface_example.sh:373-375

modelopt/deploy/llm/generate.py (1)

24-36: ImportError message LGTM.

Clear guidance on required tensorrt-llm version.

examples/llm_ptq/scripts/huggingface_example.sh (3)

236-246: Output paths refactor to SAVE_PATH looks good.


317-322: Good: results consolidated under SAVE_PATH with stable naming.


342-375: Benchmark path updates look consistent with SAVE_PATH and Torch backend.

--pyt_ckpt_path=$MODEL_PATH \
--export_path=$SAVE_PATH \
--sparsity_fmt=$SPARSITY_FMT \
--qformat=$QFORMAT \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

MODEL_FULL_NAME includes deprecated ${EXPORT_FORMAT} and produces trailing underscore.

Remove ${EXPORT_FORMAT} and optionally fail fast if set.

-MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}_${EXPORT_FORMAT}
+MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}
+# Optional guard for legacy env:
+[ -n "${EXPORT_FORMAT:-}" ] && { echo "EXPORT_FORMAT is deprecated and unused"; exit 1; }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In examples/vlm_ptq/scripts/huggingface_example.sh around line 92,
MODEL_FULL_NAME is built using the deprecated ${EXPORT_FORMAT}, producing a
trailing underscore; remove ${EXPORT_FORMAT} from the MODEL_FULL_NAME assembly
and update any variable expansions to avoid leaving stray underscores, and add a
fast-fail check near script start that exits with an error if EXPORT_FORMAT is
set (or non-empty) so the deprecated variable is not used unintentionally.

Comment on lines +61 to 66
kv_cache_config: dict[str, int | float] = {},
medusa_choices: Any = None,
tp: int = 0,
trust_remote_code: bool = False,
max_batch_size: int = 0,
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid mutable default for kv_cache_config.

Use None and initialize inside to prevent cross-instance state leaks.

-        kv_cache_config: dict[str, int | float] = {},
+        kv_cache_config: dict[str, int | float] | None = None,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kv_cache_config: dict[str, int | float] = {},
medusa_choices: Any = None,
tp: int = 0,
trust_remote_code: bool = False,
max_batch_size: int = 0,
):
kv_cache_config: dict[str, int | float] | None = None,
medusa_choices: Any = None,
tp: int = 0,
trust_remote_code: bool = False,
max_batch_size: int = 0,
):
🤖 Prompt for AI Agents
In modelopt/deploy/llm/generate.py around lines 61 to 66, the function parameter
kv_cache_config uses a mutable default ({}), which can leak state across calls;
change the signature to accept None (e.g., kv_cache_config: Optional[dict[str,
int | float]] = None) and inside the function set kv_cache_config = {} if it's
None, preserving the same type and behavior while avoiding shared mutable
defaults.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tests/examples/speculative_decoding/test_medusa.py (1)

32-44: HF helper and CLI out of sync — hf_ptq.py still exposes --export_fmt; sync required.

  • examples/llm_ptq/hf_ptq.py: parser.add_argument("--export_path") at line 655; "--export_fmt" still present (~line 691) with deprecation warning (~line 751).
  • Verification did not detect --pyt_ckpt_path or --qformat in the check. Either remove the deprecated --export_fmt from hf_ptq.py to complete the HF-only change, or update the helper to pass --export_fmt and ensure --pyt_ckpt_path and --qformat are exposed.
examples/speculative_decoding/example.ipynb (4)

295-343: Remove TRT‑LLM conversion/build sections; they contradict this PR’s deprecation.

Notebook still converts to TRT‑LLM and calls trtllm-build. Per PR: “remove usage of TRT‑LLM’s TRT backend in repository examples” and “delete trtllm-build and related wrappers.” Keep this example fully HF/vLLM/torch-only.

Apply this diff to drop the deprecated flow and replace with a simple HF inference example:

-Then convert the Unified ckeckpoint to TRTLLM checkpoint.
-
-!python TensorRT-LLM/examples/eagle/convert_checkpoint.py --model_dir /tmp/hf_ckpt --output_dir /tmp/trtllm_ckpt --num_eagle_layers 5 --max_non_leaves_per_layer 4 --max_draft_len 25 --dtype float16
-
-Last, build a TensorRT-LLM engine.
-
-!trtllm-build --checkpoint_dir /tmp/trtllm_ckpt --output_dir /tmp/trtllm_engine --gemm_plugin float16 --use_paged_context_fmha enable --speculative_decoding_mode eagle  --max_batch_size 4
-
-To run the EAGLE engine, please refer to [TensorRT-LLM/examples/eagle](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/eagle):
-
-!python ../run.py --engine_dir /tmp/trtllm_engine \
-                  --tokenizer_dir /tmp/eagle_fp8_qat \
-                  --max_output_len=100 \
-                  --eagle_choices="[[0],[1],[2],[3],[0,0],[0,1],[0,2],[1,0],[1,1],[2,0],[2,1],[3,0],[0,0,0],[0,0,1],[0,0,2],[0,1,0],[0,1,1],[0,2,0],[0,2,1],[1,0,0],[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,0,0,0],[0,0,0,0,1]]" \
-                  --temperature 1.0 \
-                  --input_text "Once upon"
+Run a quick HF inference from the quantized (or QAT) checkpoint:
+
+```python
+import torch, transformers
+tok = transformers.AutoTokenizer.from_pretrained("/tmp/eagle_fp8_qat")
+mdl = transformers.AutoModelForCausalLM.from_pretrained("/tmp/eagle_fp8_qat", torch_dtype="auto", device_map="auto")
+out = mdl.generate(**tok("Once upon", return_tensors="pt").to(mdl.device), max_new_tokens=100, temperature=1.0)
+print(tok.decode(out[0], skip_special_tokens=True))
+```

If you plan to showcase TRT‑LLM’s torch backend later, add a “TODO” cell pointing to its docs, without invoking trtllm-build.


213-219: Avoid mutating global default quant config.

FP8_DEFAULT_CFG is likely shared; in-place edits can leak globally. Deep‑copy before modifying.

Apply this diff:

+import copy
 ...
-quant_cfg = getattr(mtq, "FP8_DEFAULT_CFG")
+quant_cfg = copy.deepcopy(getattr(mtq, "FP8_DEFAULT_CFG"))
quant_cfg["quant_cfg"]["*output_quantizer"] = {
    "num_bits": (4, 3),
    "axis": None,
    "enable": True,
}

220-221: Duplicate argument passed to create_forward_loop.

Likely signature is create_forward_loop(dataloader, ...); passing it positionally and by name can raise “multiple values for argument ‘dataloader’.”

Apply this diff:

-calibrate_loop = dataset_utils.create_forward_loop(calib_dataloader, dataloader=calib_dataloader)
+calibrate_loop = dataset_utils.create_forward_loop(calib_dataloader)

295-343: Sweep and replace legacy TRT‑LLM / trtllm-build references (repo‑wide)

Multiple files still reference trtllm-build / TRT‑LLM / --engine_dir / speculative_decoding_mode; update or mark as legacy and point to the current export/deploy workflow. Affected files found:

  • examples/speculative_decoding/example.ipynb (convert_checkpoint.py, trtllm-build, run.py --engine_dir)
  • examples/speculative_decoding/README.md (external TRT‑LLM doc link)
  • examples/llm_ptq/README.md, run_tensorrt_llm.py, scripts/huggingface_example.sh (trtllm-build guidance, --engine_dir calls, sparse-quantization message)
  • examples/llm_eval/run_fastchat.sh and examples/llm_eval/README.md (--engine_dir usage)
  • examples/windows/accuracy_benchmark/trtllm_utils.py and README.md (--engine_dir / trt-llm examples)
  • examples/vlm_ptq/utils.py (--engine_dir)
  • examples/llm_sparsity/README.md (trtllm-build guidance)

Action: remove or update these references to the current recommended workflow (or clearly label them as legacy and link to migration steps).

examples/llm_ptq/README.md (2)

83-104: Remove trtllm-build and legacy TRT backend export from examples docs.

This README still instructs users to export legacy TRT‑LLM checkpoints and run trtllm-build, which contradicts this PR’s goal to deprecate the TRT backend in examples. Recommend deleting these legacy sections or moving them to a clearly marked deprecated appendix, and pointing users to unified HF export + TRT‑LLM PyTorch LLM API instead.

Apply this doc diff to excise legacy guidance and add a short deprecation note:

@@
-#### (Legacy) TensorRT-LLM Checkpoints
-...
-After the TensorRT-LLM checkpoint export, you can use the `trtllm-build` build command to build the engines from the exported checkpoints. Please check the [TensorRT-LLM Build API](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis) documentation for reference.
+> Deprecated: Legacy TensorRT-LLM (C++ backend) checkpoint export and engine build are no longer covered by these examples. Use unified Hugging Face checkpoints and deploy via TRT-LLM’s PyTorch backend (LLM API), vLLM, or SGLang.

Also applies to: 369-391


346-368: Standardize on "NVFP4" and update the TensorRT‑LLM version note to v0.17.0

Replace all "FP4" occurrences with "NVFP4" (or define "FP4 = NVFP4" at first mention) in examples/llm_ptq/README.md (lines 346–368; also 110–121, 132–133). Update the footnote to: "NVFP4 inference requires Blackwell GPUs and TensorRT‑LLM v0.17.0 or later." Verify the vLLM/SGLang support-matrix rows match vendor docs (vendor docs use "NVFP4" and NVFP4 support was added in TRT‑LLM v0.17.0).

examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1)

687-695: Fix shell working-dir bug and use absolute model path in conversion step.

cd .. will not persist across %%sh cells and likely breaks the relative path. Also prefer the explicit container‑mounted path for the model.

Apply:

 %%sh # [run in TensorRT-LLM container]
 
 # set export path for converted checkpoints. The script saves the converted checkpoint in ${ROOT_SAVE_PATH}/saved_models_${MODEL_FULL_NAME}
 export ROOT_SAVE_PATH=/app/tensorrt_llm
 
-# run conversion script
-cd ..
-bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh --model $(pwd)/qat/checkpoint-450/ --quant nvfp4
+# run conversion script (no cd; use absolute, volume-mounted model path)
+bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \
+  --model /app/tensorrt_llm/qat/checkpoint-450 \
+  --quant nvfp4
♻️ Duplicate comments (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)

113-114: Set TP explicitly so multi‑GPU is actually exercised (tp=2).

Defaults often leave TP=1; make intent explicit.

Apply this diff:

-        PTQCommand(quant="nvfp4", min_gpu=2),
+        PTQCommand(quant="nvfp4", min_gpu=2, tp=2),
🧹 Nitpick comments (21)
tests/examples/llm_ptq/test_llm_ptq.py (1)

94-96: Auto‑quant combos include nvfp4; gate on SM100 or split matrix.

Without gating, CI on SM89 may select nvfp4 paths depending on heuristics. Add min_sm=100 to nvfp4‑containing cases (or split into SM‑specific param sets).

Apply this diff:

 PTQCommand(
-            quant="nvfp4,fp8",
+            quant="nvfp4,fp8", min_sm=100,
             kv_cache_quant="fp8",
             calib_batch_size=4,
             auto_quantize_bits=6.4,
 ),
 PTQCommand(
-            quant="nvfp4,fp8",
+            quant="nvfp4,fp8", min_sm=100,
             kv_cache_quant="nvfp4",
             calib_batch_size=4,
             auto_quantize_bits=6.4,
 ),

If auto‑quant logic already filters by capability, feel free to skip—but please confirm by pointing to the selection code.

Also applies to: 100-104

tests/examples/speculative_decoding/test_medusa.py (2)

68-68: Add a quick artifact smoke‑check after PTQ.

Assert the HF export actually produced quantized files; catches silent no‑ops and future CLI drift.

Apply:

-    _run_hf_ptq(medusa_path, tmp_path / "medusa-tinyllama-hf", "fp8")
+    hf_export = tmp_path / "medusa-tinyllama-hf"
+    _run_hf_ptq(medusa_path, hf_export, "fp8")
+    assert hf_export.exists(), "HF PTQ export directory not created"
+    assert any(hf_export.glob("*.safetensors")), "Expected .safetensors in HF export"

22-28: Avoid session‑wide pip installs in tests; prefer skip/xfail guard.

The autouse fixture mutates the env and can break offline CI. Recommend skipping when transformers≥4.50 instead of installing a different version at runtime.

Replace the fixture with a module‑level skip:

-# TODO: Medusa QAT FSDP test hangs if transformers>=4.50
-@pytest.fixture(scope="session", autouse=True)
-def install_transformers_lt_4_50():
-    subprocess.run(
-        ["pip", "install", "transformers<4.50"],
-        check=True,
-    )
+# TODO: Medusa QAT FSDP test hangs if transformers>=4.50 — skip to keep CI hermetic
+try:
+    import transformers  # type: ignore
+    _ver = tuple(int(x) for x in transformers.__version__.split(".")[:2])
+except Exception:
+    _ver = None
+pytestmark = pytest.mark.skipif(_ver is None or _ver >= (4, 50),
+                                reason="Known hang with transformers>=4.50; skip until fixed")
examples/speculative_decoding/example.ipynb (6)

17-18: Fix wording to match new default (HF), not “unified export format”.

The markdown still says “export to unified export format,” which conflicts with the PR objective “export to Hugging Face (HF) format by default.”

Apply this diff to align the text:

-First, quantize the base model (Llama-3.2-1B-Instruct) into FP8 and export to unified export format.
+First, quantize the base model (Llama-3.2-1B-Instruct) into FP8 and export to Hugging Face (HF) format (default).

58-59: Don’t hardcode API key in example.

Remove --api-key token-abc123 or document using an env var. Hardcoding encourages bad practice.

Apply this diff:

-!vllm serve /tmp/llama3.2_1B_fp8 --api-key token-abc123 --port 8000  --tensor-parallel-size 1 --quantization=modelopt
+!vllm serve /tmp/llama3.2_1B_fp8 --port 8000 --tensor-parallel-size 1 --quantization=modelopt
+# If auth is desired:
+# VLLM_API_KEY=$YOUR_TOKEN vllm serve ... --api-key "$VLLM_API_KEY"

129-136: Import path robustness for local utilities.

from speculative_decoding.eagle_utils ... assumes notebook CWD makes the package importable. Add a short path guard to avoid ImportError when run from repo root.

Apply this diff at the top of the cell:

+import os, sys
+nb_dir = os.path.dirname(os.getcwd()) if os.path.basename(os.getcwd()) == "speculative_decoding" else os.getcwd()
+if os.path.isdir(os.path.join(nb_dir, "speculative_decoding")) and nb_dir not in sys.path:
+    sys.path.insert(0, nb_dir)
 from speculative_decoding.eagle_utils import DataCollatorWithPadding, LazySupervisedDataset

187-188: Grammar: “an EAGLE model”.

Minor copy edit.

-Now we have a EAGLE model in BF16 format. Next, we quantize this model into FP8 (PTQ).
+Now we have an EAGLE model in BF16 format. Next, we quantize this model into FP8 (PTQ).

231-232: Hyphenate “fine-tune”.

Minor copy edit.

-To maintain the accuracy, we need to finetune the model (QAT).
+To maintain accuracy, we need to fine-tune the model (QAT).

295-296: Typo: “checkpoint”.

If you keep the section title (even after removing TRT‑LLM steps), fix the typo.

-Then convert the Unified ckeckpoint to TRTLLM checkpoint.
+Then convert the unified checkpoint...
examples/llm_ptq/README.md (5)

35-38: Clarify TRT‑LLM dependency as optional and prefer PyTorch backend wording.

Reword to emphasize TRT‑LLM install is optional and, when used, the PyTorch backend (LLM API) is the supported path in examples. Avoid implying C++ backend is required here.

-If you want to deploy the quantized model on TRT-LLM, you will also need to install the TRT-LLM dependencies as per the ...
-For NeMo models, use the NeMo container `nvcr.io/nvidia/nemo:25.04` or later which has all the dependencies including TRT-LLM installed.
+If you plan to deploy via TRT‑LLM, optionally install TRT‑LLM and use its PyTorch backend (LLM API) as per the ...
+For NeMo models, use the NeMo container `nvcr.io/nvidia/nemo:25.04` or later; install TRT‑LLM only if you deploy with its PyTorch backend.

151-153: Fix duplicated word.

“the the final …” → “the final …”.

-... so that
-the the final mixed precision quantized model has ...
+... so that
+the final mixed precision quantized model has ...

280-286: Grammar: “supports provide two paths”.

Tighten the sentence.

-Model Optimizer supports provide two paths to export the quantized model:
+Model Optimizer provides two paths to export the quantized model:

270-276: Broaden section title; remove TRT‑specific framing.

These benchmarks are framework‑agnostic. Rename to “Accuracy Validation” and avoid implying TRT‑only.

-### TensorRT-LLM Validation
+### Accuracy Validation

241-247: Nice practical notes; minor casing consistency.

Use “Hugging Face” consistently (not “Huggingface”).

-> *If the Huggingface model calibration fails ...
+> *If the Hugging Face model calibration fails ...
-> *Huggingface models trained with ...
+> *Hugging Face models trained with ...
-> *... for the Huggingface tokenizer ...
+> *... for the Hugging Face tokenizer ...

Also applies to: 248-253

examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (7)

387-390: Run calibration without grads and in eval mode.

Saves memory and prevents autograd graph buildup.

Apply:

-def forward_loop(model):
-    for data in data_loader:
-        model(**data)
+def forward_loop(model):
+    model.eval()
+    with torch.inference_mode():
+        for data in data_loader:
+            model(**data)

16-19: Fix QAT definition and grammar.

QAT learns quantization effects during training (not “post‑training”).

Apply:

-**Quantization Aware Training (QAT)** is a method that learn the effects of quantization during neural network post-training to preserve accuracy when deploying models in very-low-precision formats.
+**Quantization Aware Training (QAT)** is a method that learns the effects of quantization during training to preserve accuracy when deploying models in low‑precision formats.
@@
-This notebook demonstrates how to apply Quantization Aware Training (QAT) to an LLM, Qwen3-8b in this example, with NVIDIA's TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrates on a small eval subset, applying NVFP4 quantization and finally deploying the quantized model to TensorRT-LLM.
+This notebook demonstrates how to apply QAT to an LLM (Qwen3‑8B here) with NVIDIA’s TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrating on a small eval subset, applying NVFP4 quantization, and finally deploying the quantized model with TensorRT‑LLM.

26-27: Typo and wording fixes (user‑facing text).

Polish for readability and consistency; aligns with PR copy edits.

Apply:

-## Installing Prerequisites and Dependancies
+## Installing Prerequisites and Dependencies
@@
-If you haven't already, install the required dependencies for this notebook. Key dependancies include:
+If you haven't already, install the required dependencies for this notebook. Key dependencies include:
@@
-This repo contains a `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependancies.
+This repo contains an `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependencies.
@@
-## Setting HuggingFace Token and Model for Download (Optional)
+## Setting Hugging Face token and model for download (optional)
@@
-include you token
+include your token
@@
-These can be controled by
+These can be controlled by
@@
-# Deploying the QAT Model with TensorRT-LLM
+# Deploying the QAT model with TensorRT‑LLM (Torch backend)
@@
-## Exporting Quantized Model for deployment
-Before deploying the model with TensorRT-LLM you will need to export the model checkpoint files. This is similar to the step you take for a quantized PTQ Model. To export the unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM Pytorch, vLLM and SGLang you will need to run the [huggingface_example.sh](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/scripts/huggingface_example.sh) script found in the TensorRT Model Optimizer repo. 
+## Exporting quantized model for deployment
+Before deploying with TensorRT‑LLM you need to export the model checkpoint files. To export unified Hugging Face checkpoints (deployable on TensorRT‑LLM Torch, vLLM, and SGLang), run the [huggingface_example.sh](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/scripts/huggingface_example.sh) script from the TensorRT Model Optimizer repo.
@@
-## Stop the TensorRT-LLM Docker contrainer
+## Stop the TensorRT‑LLM Docker container

Also applies to: 34-41, 61-61, 69-69, 417-417, 574-575, 625-627, 802-802


73-83: Avoid embedding secrets in notebooks; prefer interactive Hugging Face login.

Safer UX and fewer accidental leaks.

Apply:

-%env HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
+python - <<'PY'
+from huggingface_hub import login
+login()  # Prompts securely; stores token in keyring or HF cache
+PY

724-725: TP size should reflect available GPUs; make it adaptive.

Hardcoding --tp_size 8 can fail on smaller hosts.

Apply:

-  --max_seq_len 4096 --tp_size 8 --pp_size 1 \
+  --max_seq_len 4096 --tp_size "${TP_SIZE:-$(nvidia-smi -L | wc -l)}" --pp_size 1 \

And document overriding TP_SIZE when needed.


720-727: Avoid hardcoding the model output directory — use runtime discovery (CLI flags confirmed)

Replace the hardcoded path with a discovery step; confirmed trtllm-serve flags are --max_seq_len, --max_batch_size, --max_num_tokens (TensorRT‑LLM 1.1.x, Torch backend).

Apply:

 %%sh # [run in TensorRT-LLM container]
 
-trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/  \
+# discover the most recent NVFP4 export
+MODEL_OUT_DIR="$(ls -dt /app/tensorrt_llm/saved_models_*_nvfp4* 2>/dev/null | head -n 1)"
+test -n "$MODEL_OUT_DIR" || { echo "No exported model found under /app/tensorrt_llm"; exit 1; }
+trtllm-serve "$MODEL_OUT_DIR"  \
   --max_batch_size 1 --max_num_tokens 1024 \
   --max_seq_len 4096 --tp_size 8 --pp_size 1 \
   --host 0.0.0.0 --port 8000 \
   --kv_cache_free_gpu_memory_fraction 0.95

612-618: Parameterize TRT‑LLM container tag; don't hard‑code an RC

Replace the hard‑coded image in examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (lines 612–618) with an env var (TRTLLM_IMAGE) so users can override; as of 2025-09-17 the latest NGC tag for TensorRT‑LLM (Torch backend) is 1.1.0rc3 (modified 2025-09-04) — update the notebook/docs to recommend the preferred tag or leave no default.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a355999 and 6a47146.

📒 Files selected for processing (10)
  • .github/CODEOWNERS (0 hunks)
  • docs/source/getting_started/_installation_for_Linux.rst (1 hunks)
  • examples/llm_ptq/README.md (3 hunks)
  • examples/llm_ptq/hf_ptq.py (5 hunks)
  • examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1 hunks)
  • examples/speculative_decoding/example.ipynb (1 hunks)
  • examples/vlm_ptq/README.md (1 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
  • tests/examples/speculative_decoding/test_medusa.py (2 hunks)
  • tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py (0 hunks)
💤 Files with no reviewable changes (2)
  • .github/CODEOWNERS
  • tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (2)
  • docs/source/getting_started/_installation_for_Linux.rst
  • examples/vlm_ptq/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm_ptq/hf_ptq.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
  • PTQCommand (28-87)
  • param_str (82-87)
  • run (44-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (9)
tests/examples/llm_ptq/test_llm_ptq.py (4)

109-109: SM‑bounded case looks good.

w4a8_awq constrained to SM89 only is clear and safe.


110-110: Correct: nvfp4 gated to SM100.

This avoids accidental runs on unsupported GPUs.


117-118: LGTM: simpler invocation via TINY_LLAMA_PATH.

Fixture removal and direct constant use are fine.


79-80: Keep explicit 'none' — code expects that literal to mean “no KV-cache quant.” examples/llm_ptq/hf_ptq.py checks args.kv_cache_qformat != 'none' (around lines 246 and 454), so tests using kv_cache_quant='none' are correct; do not remove.

examples/speculative_decoding/example.ipynb (3)

26-26: LGTM: dropping --export_fmt aligns with new default HF export.

This change matches the PR’s “HF by default” direction. No functional concerns.


9-11: Fix typos in markdown.

  • “medule” → “module”.
    [ suggest_nitpick_refactor ]
    Apply this diff:
-The speculative decoding medule needs to learn to predict tokens from the base model. Therefore, we need to prepare the data generated from the base model.
+The speculative decoding module needs to learn to predict tokens from the base model. Therefore, we need to prepare the data generated from the base model.

75-77: Path/availability note for prepare_data.sh.

Ensure this script exists in repo or provide a link; otherwise, the cell will fail.

If it’s in this example folder, consider making the call relative:

-!bash prepare_data.sh --data_path /tmp/Daring-Anteater/train.jsonl --output_path /tmp/finetune/data.jsonl --max_token 2048
+!bash examples/speculative_decoding/prepare_data.sh --data_path /tmp/Daring-Anteater/train.jsonl --output_path /tmp/finetune/data.jsonl --max_token 2048
examples/llm_ptq/README.md (2)

256-259: Avoid steering users to legacy export in DeepSeek example.

Change “export to TensorRT‑LLM” to “export unified HF checkpoints and deploy with TRT‑LLM/vLLM/SGLang”.

[suggest_minor_refactor]

-[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
+[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export unified HF checkpoints for deployment with TRT‑LLM/vLLM/SGLang.

206-206: Update README to match parser's task list and lm_eval defaults. examples/llm_ptq/scripts/parser.sh defines VALID_TASKS=("quant" "mmlu" "mtbench" "lm_eval" "livecodebench" "simple_eval"), default TASKS="quant", and LM_EVAL_TASKS="mmlu,gsm8k" (--lm_eval_tasks is supported). README (examples/llm_ptq/README.md, around line 206) currently lists only "quant,mmlu,lm_eval,livecodebench" — add "mtbench" and "simple_eval" and optionally document the LM_EVAL_TASKS default.

Likely an incorrect or invalid review comment.

> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.*
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Ensure --low_memory_mode is implemented and gated as documented.

README states it’s only supported for FP8/NVFP4 with max calibration. Verify hf_ptq.py enforces this and that weight compression happens before calibration.


🏁 Script executed:

#!/usr/bin/env bash
# Confirm flag definition and usage constraints
rg -n -C2 --type=py 'low_memory_mode|--low_memory_mode' examples/llm_ptq/hf_ptq.py
rg -n -C3 --type=py 'calibrate\(|mtq\.calibrate|init_quantized_weights|NVFP4|FP8' examples/llm_ptq/hf_ptq.py

Length of output: 2446


Enforce --low_memory_mode gating and compress weights before calibration

README says low_memory_mode is only supported for FP8 / NVFP4 with max calibration. The flag is defined and referenced but there is no explicit enforcement or obvious compression-before-calibration:

  • examples/llm_ptq/hf_ptq.py:730–731 (parser.add_argument for --low_memory_mode) and 232 (if not args.low_memory_mode).
  • Calibration calls: 138–142 (mtq.calibrate(..., algorithm="max")) and 186–189 (mtq.calibrate / mtq.quantize branch).
  • init_quantized_weights context: 252–255 (used around model loading/calibration).

Action items:

  • Validate and error out if args.low_memory_mode is set while the chosen quant config is not FP8 or NVFP4, or the calibration algorithm is not "max".
  • Ensure weight compression path for low_memory_mode runs before any mtq.calibrate / mtq.quantize calls (move compression into the HF checkpoint load path used when args.low_memory_mode is True).
  • Add a small test or assertion exercising low_memory_mode to fail fast on unsupported configs.

Comment on lines 301 to 305
### Quantize and Export

```bash
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_fmt hf --export_path <quantized_ckpt_path> --trust_remote_code
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Command examples: confirm defaults now export unified HF without extra flags.

Looks good; verify hf_ptq.py and huggingface_example.sh default to unified HF format and no longer accept --export_fmt.

Also applies to: 311-313


🏁 Script executed:

#!/usr/bin/env bash
# Check CLI surfaces for --export_fmt remnants and default export path
rg -n --type=sh --type=py 'export_fmt|export_hf_checkpoint|export_tensorrt_llm_checkpoint' examples/llm_ptq/{hf_ptq.py,scripts/huggingface_example.sh}

Length of output: 474


Remove deprecated --export_fmt from hf_ptq CLI and enforce HF-only export

examples/llm_ptq/hf_ptq.py still defines/uses --export_fmt (arg at L691; forced/warned at L750–751) and has related references at L40–41 and call sites at L594/L607 — remove the CLI flag and deprecation handling, and update docs/examples accordingly (examples/llm_ptq/scripts/huggingface_example.sh: no matches).

🤖 Prompt for AI Agents
In examples/llm_ptq/README.md around lines 301–305, update the docs and code to
remove the deprecated --export_fmt CLI flag and enforce HuggingFace-only export:
remove the --export_fmt argument definition in examples/llm_ptq/hf_ptq.py
(around L691), delete any deprecation warnings/handling (around L750–751) and
references at the top of the file (L40–41), update or remove call sites that
pass --export_fmt (L594 and L607), and adjust
examples/llm_ptq/scripts/huggingface_example.sh and README usage lines to call
hf_ptq without --export_fmt and to document only HF export (e.g., keep
--export_path and --trust_remote_code). Ensure any code paths or conditionals
that previously branched on export_fmt are simplified to HF-only export and
remove related tests or grep matches.

@kevalmorabia97 kevalmorabia97 force-pushed the chenjiel/deprecate_trtllm_build_2 branch from 6a47146 to 48440c8 Compare September 17, 2025 05:24
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner September 17, 2025 05:24
…o chenjiel/deprecate_trtllm_build_2

Signed-off-by: Chenjie Luo <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/examples/llm_ptq/test_llm_ptq.py (1)

113-114: Enable multi‑GPU semantics: set TP explicitly.

Default inference_tensor_parallel is 1; this won’t exercise TP across GPUs. Pass tp=2 to actually test multi‑GPU.

-        PTQCommand(quant="nvfp4", min_gpu=2),
+        PTQCommand(quant="nvfp4", min_gpu=2, tp=2),
examples/llm_ptq/hf_ptq.py (1)

214-229: Fix weakened qformat validation (always‑true because of kv‑cache clause).

The condition “... or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES” makes the assert pass even for unsupported qformats (kv_cache_qformat always valid due to argparse choices). Validate qformat independently.

-    if not args.auto_quantize_bits:
-        assert (
-            args.qformat
-            in [
-                "int4_awq",
-                "fp8",
-                "nvfp4",
-                "nvfp4_awq",
-                "w4a8_awq",
-                "fp8_pb_wo",
-                "w4a8_mxfp4_fp8",
-                "nvfp4_mlp_only",
-            ]
-            or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
-        ), f"Quantization format {args.qformat} not supported for HF export path"
+    if not args.auto_quantize_bits:
+        assert args.qformat in QUANT_CFG_CHOICES, (
+            f"Quantization format {args.qformat} not supported for HF export path"
+        )
🧹 Nitpick comments (7)
examples/gpt-oss/requirements.txt (1)

6-6: Align Torch lower bound with code using torch.compiler.set_stance or guard the call.

hf_ptq.py calls torch.compiler.set_stance("force_eager"), which may not exist in some 2.7.x releases. Either:

  • raise lower bound to torch>=2.8.0, or
  • keep >2.7.1 but guard the call with hasattr to avoid AttributeError.

Apply one of:

- torch>2.7.1
+ torch>=2.8.0

Or in hf_ptq.py (if you prefer keeping >2.7.1):

- torch.compiler.set_stance("force_eager")
+ if hasattr(torch.compiler, "set_stance"):
+     torch.compiler.set_stance("force_eager")
tests/examples/speculative_decoding/test_medusa.py (1)

32-41: Drop export_fmt usage: good; but avoid session‑global pip install side‑effects.

  • The _run_hf_ptq change removing --export_fmt is correct and aligns with unified HF export. LGTM.
  • The session‑autouse fixture that pip installs transformers<4.50 affects the entire test session and can conflict with other tests’ version pins. Scope the requirement to this test (e.g., WithRequirements pattern) or switch to a function/class‑scoped fixture.

Here’s a scoped alternative:

-@pytest.fixture(scope="session", autouse=True)
-def install_transformers_lt_4_50():
-    subprocess.run(
-        ["pip", "install", "transformers<4.50"],
-        check=True,
-    )
+class TestMedusa(WithRequirements):
+    requirements = [("transformers", "<4.50")]
+
+    def test_llama_medusa_fp8_qat(...):
+        ...

Also applies to: 68-68

examples/llm_ptq/hf_ptq.py (5)

92-106: Avoid duplicating allowed qformats; derive from QUANT_CFG_CHOICES to prevent drift.

The hard‑coded list can diverge from QUANT_CFG_CHOICES. Use its keys.

-    assert all(
-        qformat
-        in [
-            "fp8",
-            "int4_awq",
-            "nvfp4",
-            "nvfp4_awq",
-            "w4a8_awq",
-            "fp8_pb_wo",
-            "w4a8_mxfp4_fp8",
-            "nvfp4_mlp_only",
-        ]
-        for qformat in qformat_list
-    ), "One or more quantization formats provided are not supported for unified checkpoint export"
+    allowed = set(QUANT_CFG_CHOICES.keys())
+    assert all(fmt in allowed for fmt in qformat_list), (
+        "One or more quantization formats provided are not supported for unified checkpoint export"
+    )

126-143: auto_quantize leaks CLI globals (args); pass needed flags explicitly.

Using args.kv_cache_qformat inside auto_quantize couples it to the CLI flow and breaks reuse/imports. Pass kv_cache_qformat (and any other needed fields) as parameters.

-def auto_quantize(
-    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
-):
+def auto_quantize(
+    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1, kv_cache_qformat="none"
+):
@@
-    enable_quant_kv_cache = args.kv_cache_qformat != "none"
+    enable_quant_kv_cache = kv_cache_qformat != "none"
@@
-        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]
+        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]

And update the call site in quantize_model:

-        model = auto_quantize(
+        model = auto_quantize(
             model,
             args.qformat,
             args.auto_quantize_bits,
             calib_dataloader,
             calibrate_loop,
-            args.batch_size,
+            args.batch_size,
+            kv_cache_qformat=args.kv_cache_qformat,
         )

586-590: Message clarity: call out “TRT‑LLM checkpoint export (no engine build)” explicitly.

To avoid confusion with the deprecated TRT engine path, clarify the warning text that this exports a TensorRT‑LLM checkpoint only (for torch runtime unsupported models), not engines.

-            warnings.warn(
-                "Still exporting TensorRT-LLM checkpoints for models not supported by the TensorRT-LLM torch runtime."
-            )
+            warnings.warn(
+                "Exporting a TensorRT-LLM CHECKPOINT (no engine build) for models not yet supported by the TRT-LLM torch runtime."
+            )

601-606: Redundant sparsity assert.

This branch already implies dense sparsity. The explicit assert is redundant; drop to reduce noise.

-            assert args.sparsity_fmt == "dense", (
-                f"Sparsity format {args.sparsity_fmt} not supported by unified export api."
-            )

693-696: Use DeprecationWarning category for --export_fmt.

So users can filter warnings; current UserWarning is noisier.

-        help="Deprecated. Please avoid using this argument.",
+        help="Deprecated; ignored. Will be removed in a future release.",
-    if args.export_fmt != "hf":
-        warnings.warn("Deprecated. --export_fmt forced to hf.")
+    if args.export_fmt != "hf":
+        warnings.warn("Deprecated. --export_fmt is ignored and forced to 'hf'.", DeprecationWarning)

Also applies to: 750-752

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a47146 and 48440c8.

📒 Files selected for processing (11)
  • .github/CODEOWNERS (0 hunks)
  • docs/source/getting_started/_installation_for_Linux.rst (1 hunks)
  • examples/gpt-oss/requirements.txt (1 hunks)
  • examples/llm_ptq/README.md (3 hunks)
  • examples/llm_ptq/hf_ptq.py (5 hunks)
  • examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1 hunks)
  • examples/speculative_decoding/example.ipynb (1 hunks)
  • examples/vlm_ptq/README.md (1 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
  • tests/examples/speculative_decoding/test_medusa.py (2 hunks)
  • tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py (0 hunks)
💤 Files with no reviewable changes (2)
  • .github/CODEOWNERS
  • tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (1)
  • docs/source/getting_started/_installation_for_Linux.rst
🚧 Files skipped from review as they are similar to previous changes (4)
  • examples/speculative_decoding/example.ipynb
  • examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
  • examples/vlm_ptq/README.md
  • examples/llm_ptq/README.md
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/hf_ptq.py (1)
modelopt/torch/export/model_config_export.py (1)
  • export_tensorrt_llm_checkpoint (443-553)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
  • PTQCommand (28-87)
  • param_str (82-87)
  • run (44-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs

@kevalmorabia97
Copy link
Collaborator

Internal CI with latest container running: https://gitlab-master.nvidia.com/omniml/modelopt/-/pipelines/35039819

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tests/examples/llm_ptq/test_llm_ptq.py (2)

113-113: Explicit TP=2 for multi‑GPU: good fix and matches prior feedback.

Keeps behavior unambiguous across runners.

Run to confirm parser wiring of --tp:

#!/bin/bash
rg -n -C2 --hidden '\b(tp|tensor[_-]?parallel)\b' examples/llm_ptq

93-104: Make autoquant_kv_cache deterministic per arch.

Ensure fp8 path runs on sm89 and nvfp4 path is skipped unless sm100.

-PTQCommand(
-    quant="nvfp4,fp8",
-    kv_cache_quant="fp8",
-    calib_batch_size=4,
-    auto_quantize_bits=6.4,
-),
-PTQCommand(
-    quant="nvfp4,fp8",
-    kv_cache_quant="nvfp4",
-    calib_batch_size=4,
-    auto_quantize_bits=6.4,
-),
+PTQCommand(
+    quant="fp8",
+    kv_cache_quant="fp8",
+    calib_batch_size=4,
+    auto_quantize_bits=6.4,
+),
+PTQCommand(
+    quant="nvfp4",
+    kv_cache_quant="nvfp4",
+    calib_batch_size=4,
+    auto_quantize_bits=6.4,
+    min_sm=100,
+),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3d08b and 070063c.

📒 Files selected for processing (1)
  • tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
  • PTQCommand (28-87)
  • param_str (82-87)
  • run (44-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (2)
tests/examples/llm_ptq/test_llm_ptq.py (2)

108-111: SM‑targeted smoke tests look good.

w4a8_awq pinned to sm89 and nvfp4 to sm100 aligns with hardware support.


117-118: LGTM: using TINY_LLAMA_PATH directly simplifies the test.

> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.*
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: I remember we had some accuracy degradation with --low_memory_mode - is that correct?

Copy link
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I did an overall review - I might have missed some details.

@cjluo-nv cjluo-nv force-pushed the chenjiel/deprecate_trtllm_build_2 branch from 070063c to f4b3f73 Compare September 17, 2025 18:47
Signed-off-by: Chenjie Luo <[email protected]>
@cjluo-nv cjluo-nv force-pushed the chenjiel/deprecate_trtllm_build_2 branch from f4b3f73 to 1885e81 Compare September 17, 2025 18:49
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (10)
pyproject.toml (1)

119-119: Re‑enabling pytest defaults (strict markers, durations, coverage) — was this intentional?

Commenting out addopts disables strict marker enforcement, slow‑test timing, and coverage reports; CI expecting coverage.xml may break. Prior feedback said this was reverted.

Option A — restore previous behavior (add explicit --cov target to avoid no-op):

-#addopts = "-ra --cov-report=term-missing --cov-report=html --cov-report=xml:coverage.xml --cov-config=pyproject.toml --durations=20 --strict-markers"
+addopts = "-ra --cov=modelopt --cov-report=term-missing --cov-report=xml:coverage.xml --cov-config=pyproject.toml --durations=20 --strict-markers"

Option B — minimal local defaults; move coverage flags to CI:

-#addopts = "-ra --cov-report=term-missing --cov-report=html --cov-report=xml:coverage.xml --cov-config=pyproject.toml --durations=20 --strict-markers"
+addopts = "-ra --strict-markers --durations=20"

Run to verify CI coupling before choosing:

#!/bin/bash
# Find CI steps/artifacts depending on coverage.xml or pytest-cov flags
fd -HI --type f '.github' '.gitlab' | xargs -I{} rg -n --hidden -C2 -e 'coverage\.xml|--cov(=|\b)|pytest.+--cov' {}
# Also check for Codecov/Sonar references
rg -n -C2 -e 'codecov|sonar.*coverage' .github .gitlab || true
tests/examples/llm_ptq/test_llm_ptq.py (3)

83-85: Gate NVFP4 variants to Blackwell (SM100).

These entries will run on SM89 runners and fail.

-        PTQCommand(quant="nvfp4"),
-        PTQCommand(quant="nvfp4_awq"),
+        PTQCommand(quant="nvfp4", min_sm=100),
+        PTQCommand(quant="nvfp4_awq", min_sm=100),

86-92: Split auto‑quant by arch; keep NVFP4 behind SM100.

Prevents mixed list from breaking on SM89.

-        PTQCommand(
-            quant="int4_awq,nvfp4,fp8,w4a8_awq",
-            calib_batch_size=4,
-            auto_quantize_bits=6.4,
-            kv_cache_quant="none",
-        ),
+        PTQCommand(
+            quant="int4_awq,fp8,w4a8_awq",
+            calib_batch_size=4,
+            auto_quantize_bits=6.4,
+            kv_cache_quant="none",
+        ),
+        PTQCommand(
+            quant="int4_awq,nvfp4",
+            calib_batch_size=4,
+            auto_quantize_bits=6.4,
+            kv_cache_quant="none",
+            min_sm=100,
+        ),

113-114: Set TP explicitly for multi‑GPU cases to actually engage tensor‑parallel.

min_gpu=2 alone won’t enable TP.

-        PTQCommand(quant="fp8", min_gpu=2, min_sm=89),
-        PTQCommand(quant="nvfp4", min_gpu=2, min_sm=100),
+        PTQCommand(quant="fp8", min_gpu=2, min_sm=89, tp=2),
+        PTQCommand(quant="nvfp4", min_gpu=2, min_sm=100, tp=2),
examples/llm_ptq/scripts/huggingface_example.sh (3)

131-138: Bug: using $qformat outside its loop; use $QFORMAT and quote paths.

Otherwise the bf16/fp16 fast‑path may never trigger; also fix brittle symlink loop.

-    if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then
+    if [ "$QFORMAT" = "bf16" ] || [ "$QFORMAT" = "fp16" ]; then
         if [ -d "$MODEL_PATH" ]; then
             MODEL_CONFIG_EXIST=true
             MODEL_CONFIG=$MODEL_PATH/config.json
-            for file in $MODEL_PATH/*; do ln -sf "$file" $SAVE_PATH/; done
+            for file in "$MODEL_PATH"/*; do ln -sf "$file" "$SAVE_PATH"/; done

Also make the “empty dir” check robust under set -e:

-if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then
+if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || ! ls -A "$SAVE_PATH" >/dev/null 2>&1; then

167-170: Message still references deprecated trtllm-build.

Align with Torch backend guidance.

-        echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Please deploy with the TRT-LLM using trtllm-build. Checkpoint export_path: $SAVE_PATH"
+        echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). This HF example does not deploy sparse models; use the TensorRT‑LLM Torch backend sparse deployment flow. Checkpoint export_path: $SAVE_PATH"

181-184: Fix SC2076 and support comma‑separated QFORMAT values.

Current regex‑in‑string check is wrong and fails for lists.

-    if [[ ! " fp8 nvfp4 bf16 fp16 " =~ " ${QFORMAT} " ]]; then
-        echo "Quant $QFORMAT specified. Please read TensorRT-LLM quantization support matrix https://nvidia.github.io/TensorRT-LLM/features/quantization.html#quantization-in-tensorrt-llm and use TensorRT-LLM for deployment. Checkpoint export_path: $SAVE_PATH"
-        exit 0
-    fi
+    unsupported=""
+    IFS=','; for q in $QFORMAT; do
+        case "$q" in fp8|nvfp4|bf16|fp16) : ;; *) unsupported+="${unsupported:+,}$q" ;; esac
+    done; IFS=' '
+    if [ -n "$unsupported" ]; then
+        echo "Quant(s) [$unsupported] not supported by the TensorRT‑LLM Torch runtime; please consult the support matrix and deploy with TensorRT‑LLM as appropriate. Checkpoint export_path: $SAVE_PATH"
+        exit 0
+    fi
examples/vlm_ptq/scripts/huggingface_example.sh (2)

89-103: Make empty‑dir test robust and standardize on python3.

Prevents brittle failures under set -e and mismatched interpreters.

-if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then
+if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || ! ls -A "$SAVE_PATH" >/dev/null 2>&1; then
@@
-        python ../llm_ptq/hf_ptq.py \
+        python3 ../llm_ptq/hf_ptq.py \
             --pyt_ckpt_path=$MODEL_PATH \
-            --export_path=$SAVE_PATH \
-            --qformat=$QFORMAT \
+            --export_path="$SAVE_PATH" \
+            --qformat="$QFORMAT" \
             --calib_size=$CALIB_SIZE \
             --batch_size=$CALIB_BATCH_SIZE \
             $PTQ_ARGS

104-116: Reorder NVFP4 Blackwell guard before early exit.

Currently unreachable if QFORMAT != fp8.

-if [[ "$QFORMAT" != "fp8" ]]; then
-    echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH."
-    exit 0
-fi
-
-if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "$KV_CACHE_QUANT" == *"nvfp4"* ]]; then
+if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "${KV_CACHE_QUANT:-}" == *"nvfp4"* ]]; then
     cuda_major=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader -i 0 | cut -d. -f1)
 
     if [ "$cuda_major" -lt 10 ]; then
         echo "Please deploy the NVFP4 checkpoint on a Blackwell GPU. Checkpoint export_path: $SAVE_PATH"
         exit 0
     fi
 fi
+
+if [[ "$QFORMAT" != "fp8" ]]; then
+    echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH."
+    exit 0
+fi
examples/llm_ptq/hf_ptq.py (1)

214-229: Assertion is ineffective; it always passes because kv_cache_qformat is always valid.

This lets unsupported qformats through to export.

-    if not args.auto_quantize_bits:
-        assert (
-            args.qformat
-            in [
-                "int4_awq",
-                "fp8",
-                "nvfp4",
-                "nvfp4_awq",
-                "w4a8_awq",
-                "fp8_pb_wo",
-                "w4a8_mxfp4_fp8",
-                "nvfp4_mlp_only",
-            ]
-            or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
-        ), f"Quantization format {args.qformat} not supported for HF export path"
+    if not args.auto_quantize_bits:
+        assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, (
+            f"Quantization format {args.qformat} not supported for unified HF export"
+        )

Add near the top (after QUANT_CFG_CHOICES) to centralize:

+ALLOWED_UNIFIED_HF_QFORMATS = {
+    "int4_awq",
+    "fp8",
+    "nvfp4",
+    "nvfp4_awq",
+    "w4a8_awq",
+    "fp8_pb_wo",
+    "w4a8_mxfp4_fp8",
+    "nvfp4_mlp_only",
+}
🧹 Nitpick comments (9)
tests/examples/llm_ptq/test_llm_ptq.py (2)

80-81: Add min_gpu=2 when requesting tp=2 to avoid single‑GPU CI failures.

Without gating, this will try TP=2 on 1 GPU and fail.

-        PTQCommand(quant="int8_sq", kv_cache_quant="none", tp=2, pp=2),
+        PTQCommand(quant="int8_sq", kv_cache_quant="none", tp=2, pp=2, min_gpu=2),

108-111: Fix misleading comment or gate accordingly.

Comment says “sm100” but min_sm=89.

-        PTQCommand(quant="fp8", kv_cache_quant="none", min_sm=89),  # sm100
+        PTQCommand(quant="fp8", kv_cache_quant="none", min_sm=89),  # sm89
examples/llm_ptq/scripts/huggingface_example.sh (2)

275-276: Quote command substitution to avoid word splitting; pick one JSONL.

Prevents SC2046 and multi‑file expansion issues.

-    JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl'))
+    JSONL_PATH=$(readlink -f "$(find "$SAVE_PATH" -type f -name '*.jsonl' | head -n1)")

52-62: Avoid hard‑coding qformat allow‑lists in multiple places.

Source a single quant_formats.sh to keep HF/Torch lists in sync.

+# quant_formats.sh exports HF_EXPORT_QFORMATS and TORCH_LLMAPI_QFORMATS
+. "$script_dir/quant_formats.sh"
-#Iterate over list of qformats provided and check if they are valid
-IFS=","
-for qformat in $QFORMAT; do
-    case $qformat in
-    fp8 | fp8_pc_pt | fp8_pb_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
-    *)
-        echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
-        exit 1
-        ;;
-    esac
-done
+IFS=','; for q in $QFORMAT; do
+  case " $HF_EXPORT_QFORMATS " in *" $q "*) : ;; *) echo "Unknown quant: $q"; exit 1 ;; esac
+done; IFS=' '
examples/llm_ptq/hf_ptq.py (3)

92-105: Deduplicate allowed‑format checks; reuse a single constant.

Keeps auto‑quant validation consistent with main arg validation.

-    assert all(
-        qformat
-        in [
-            "fp8",
-            "int4_awq",
-            "nvfp4",
-            "nvfp4_awq",
-            "w4a8_awq",
-            "fp8_pb_wo",
-            "w4a8_mxfp4_fp8",
-            "nvfp4_mlp_only",
-        ]
-        for qformat in qformat_list
-    ), "One or more quantization formats provided are not supported for unified checkpoint export"
+    assert all(q in ALLOWED_UNIFIED_HF_QFORMATS for q in qformat_list), (
+        "One or more quantization formats are not supported for unified checkpoint export"
+    )

86-93: Avoid hidden dependency on global args inside auto_quantize.

Pass needed values explicitly to prevent surprises on import.

-def auto_quantize(
-    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
-):
+def auto_quantize(
+    model,
+    qformat,
+    auto_quantize_bits,
+    calib_dataloader,
+    calibrate_loop,
+    kv_cache_qformat,
+    batch_size=1,
+):
@@
-    enable_quant_kv_cache = args.kv_cache_qformat != "none"
+    enable_quant_kv_cache = kv_cache_qformat != "none"
@@
-        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]
+        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]

And at the call site:

-        model = auto_quantize(
+        model = auto_quantize(
             model,
             args.qformat,
             args.auto_quantize_bits,
             calib_dataloader,
             calibrate_loop,
+            args.kv_cache_qformat,
             args.batch_size,
         )

Also applies to: 178-185


545-548: Typo: “quantizaton”.

Minor doc fix.

-        assert model_type != "dbrx", f"Does not support export {model_type} without quantizaton"
+        assert model_type != "dbrx", f"Does not support export {model_type} without quantization"
examples/llm_ptq/scripts/parser.sh (1)

135-141: Remove stale prints for unset variables (BUILD_MAX_INPUT_LEN, NUM_SAMPLES).

These were dropped from parsing; printing them is confusing.

-  echo "input: $BUILD_MAX_INPUT_LEN"
   echo "output: $BUILD_MAX_OUTPUT_LEN"
   echo "batch: $BUILD_MAX_BATCH_SIZE"
@@
-  echo "num_sample: $NUM_SAMPLES"
CHANGELOG.rst (1)

9-12: Clarify deprecation wording.

Tighten language and fix grammar.

-- TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
-- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
-- ``int8_sq`` quantization format is deprecated from the ``examples/vlm_ptq`` respect to the TensorRT-LLM's torch backend switch. Please refer to the previous releases if this quantization format is needed.
-- ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.
+- Deprecate the TensorRT‑LLM TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. The ``build`` and ``benchmark`` tasks are removed; use ``quant`` plus ``trtllm-bench`` for performance evaluation.
+- Remove ``--export_fmt`` in ``examples/llm_ptq``. Export defaults to the unified Hugging Face checkpoint format.
+- Deprecate ``int8_sq`` in ``examples/vlm_ptq`` due to the switch to the TensorRT‑LLM Torch backend. Refer to previous releases if needed.
+- Deprecate ``examples/vlm_eval`` (it depends on the removed TRT backend).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 070063c and 1885e81.

📒 Files selected for processing (7)
  • CHANGELOG.rst (1 hunks)
  • examples/llm_ptq/hf_ptq.py (5 hunks)
  • examples/llm_ptq/scripts/huggingface_example.sh (9 hunks)
  • examples/llm_ptq/scripts/parser.sh (3 hunks)
  • examples/vlm_ptq/scripts/huggingface_example.sh (4 hunks)
  • pyproject.toml (1 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/hf_ptq.py (1)
modelopt/torch/export/model_config_export.py (1)
  • export_tensorrt_llm_checkpoint (443-553)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
  • PTQCommand (28-87)
  • param_str (82-87)
  • run (44-80)
🪛 Shellcheck (0.10.0)
examples/llm_ptq/scripts/huggingface_example.sh

[warning] 181-181: Remove quotes from right-hand side of =~ to match as a regex rather than literally.

(SC2076)


[warning] 275-275: Quote this to prevent word splitting.

(SC2046)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait


# Parse command-line options
ARGS=$(getopt -o "" -l "type:,model:,quant:,kv_cache_quant:,tp:,calib_tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,export_fmt:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Optional: add --num_samples back or delete related code across scripts.

If still referenced externally, re‑add to getopt; else ensure nothing depends on it.


🏁 Script executed:

#!/bin/bash
# Find any remaining references to --num_samples / NUM_SAMPLES
rg -n --hidden -C2 '\b(--num_samples|NUM_SAMPLES)\b' || true

Length of output: 420


Re-add --num_samples to getopt or remove NUM_SAMPLES usage

examples/llm_ptq/scripts/parser.sh still references NUM_SAMPLES (line 141: echo "num_sample: $NUM_SAMPLES"); either add --num_samples to the ARGS getopt long options (around line 39) or remove the NUM_SAMPLES variable, its echo, and any dependent code.

🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/parser.sh around line 39, the getopt long options
list is missing --num_samples while the script later references NUM_SAMPLES; fix
by adding "num_samples:" to the long options string (so getopt recognizes and
parses --num_samples) and ensure the subsequent case/assignment logic maps that
option into the NUM_SAMPLES variable, or alternatively remove all references to
NUM_SAMPLES and its echo if you intend to drop the feature.

Signed-off-by: Keval Morabia <[email protected]>
@cjluo-nv cjluo-nv enabled auto-merge (squash) September 17, 2025 19:44
@kevalmorabia97 kevalmorabia97 merged commit 146d1d9 into main Sep 17, 2025
20 checks passed
@kevalmorabia97 kevalmorabia97 deleted the chenjiel/deprecate_trtllm_build_2 branch September 17, 2025 21:52
yeyu-nvidia pushed a commit that referenced this pull request Sep 18, 2025
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Keval Morabia <[email protected]>
Co-authored-by: Chenjie Luo <[email protected]>
Co-authored-by: Keval Morabia <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants