-
Notifications
You must be signed in to change notification settings - Fork 161
Deprecate TRTLLM-build in examples #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chenjie Luo <[email protected]>
WalkthroughThe changes deprecate TRT-LLM engine build paths, unify HF checkpoint export in PTQ, remove VLM evaluation scripts, refactor LLM deployment to use TRT-LLM public API, update tests to the new quant/kv-cache flows, and adjust docs, notebooks, and requirements to reflect the streamlined quantization-first workflow. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant PTQ as hf_ptq.py
participant V as Validator
participant E as Exporter
Note over PTQ: Unified export selection
U->>PTQ: Invoke with model/quant args (\\--export_fmt deprecated)
PTQ->>V: Validate qformat/kv_cache_qformat
V-->>PTQ: Validation result
alt Model unsupported by HF runtime or sparsity != dense
PTQ->>E: export_tensorrt_llm_checkpoint(checkpoint_dir)
Note over E: Emits deprecation/runtime warnings
else
PTQ->>E: export_hf_checkpoint(checkpoint_dir)
end
PTQ-->>U: Save artifacts to SAVE_PATH
sequenceDiagram
autonumber
participant App as Application
participant L as LLM (TRTLLM)
participant FS as Filesystem
participant TRT as tensorrt_llm
Note over L: Initialize from HF checkpoint
App->>L: __init__(checkpoint_dir, tp, max_batch_size, ...)
L->>FS: Read config.json
FS-->>L: Config (seq len, experts, etc.)
L->>TRT: Version check (>= 1.1.0rc2)
alt Version supports context logits/stop-words
Note over L: enable gather_context_logits
else
Note over L: disable context logits/stop-words
end
App->>L: generate(inputs, stop_words?)
L-->>App: Outputs (with version-gated features)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Chenjie Luo <[email protected]>
), "One or more quantization formats provided are not supported for tensorrt llm export" | ||
assert all( | ||
qformat | ||
in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we can pull this list of supported qformats as a variable and reuse in other places (in the auto quantize section)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK. This PR is pretty large. I hope we can move the improvements to a following up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
examples/llm_ptq/scripts/huggingface_example.sh (2)
150-159
: Logic bug: only the last qformat is checked for bf16/fp16.
$qformat
here is whatever remained from a prior loop. If any qformat is bf16/fp16, do the symlink path.- if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then + needs_symlink=false + IFS=','; for q in $QFORMAT; do + if [ "$q" = "bf16" ] || [ "$q" = "fp16" ]; then needs_symlink=true; fi + done; IFS=' ' + if $needs_symlink; then if [ -d "$MODEL_PATH" ]; then MODEL_CONFIG_EXIST=true MODEL_CONFIG=$MODEL_PATH/config.json - for file in $MODEL_PATH/*; do ln -sf "$file" $SAVE_PATH/; done + for file in "$MODEL_PATH"/*; do ln -sf "$file" "$SAVE_PATH/"; done else echo "Please use the model directory where the config.json file is present." exit 1 fi fi
301-314
: Subshell exit won’t stop the script; make the log-wait loop run in current shell.
exit 1
inside the pipeline’s subshell won’t abort the parent. Use process substitution andread -r
.- tail -f $SAVE_PATH/serve.txt | while read line; do + while read -r line; do if echo "$line" | grep -q "Application startup complete"; then echo "Application startup complete." break fi if ! kill -0 $SERVE_PID 2>/dev/null; then echo "trtllm-serve has exited." exit 1 fi - done + done < <(tail -n +1 -f "$SAVE_PATH/serve.txt")Optionally, add a timeout to avoid indefinite hangs.
🧹 Nitpick comments (19)
CHANGELOG.rst (1)
9-12
: Tighten deprecation bullets and clarify default behavior.Suggest minor wording for precision and consistency.
-- TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. - - ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format. - - ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend. + - Deprecate TRT-LLM's TensorRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. + - Remove the ``--export_fmt`` flag in ``examples/llm_ptq``; exports default to the unified Hugging Face checkpoint format. + - Deprecate ``examples/vlm_eval`` (depends on the deprecated TensorRT backend).examples/llm_ptq/hf_ptq.py (1)
92-105
: De-duplicate allowed qformats (single source of truth).The hardcoded list risks drifting from other checks. Define a constant and reuse it here and below.
- assert all( - qformat - in [ - "fp8", - "int4_awq", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - for qformat in qformat_list - ), "One or more quantization formats provided are not supported for unified checkpoint export" + assert all(q in ALLOWED_UNIFIED_HF_QFORMATS for q in qformat_list), ( + "One or more quantization formats provided are not supported for unified checkpoint export" + )Add near QUANT_CFG_CHOICES:
ALLOWED_UNIFIED_HF_QFORMATS = { "fp8", "int4_awq", "nvfp4", "nvfp4_awq", "w4a8_awq", "fp8_pb_wo", "w4a8_mxfp4_fp8", "nvfp4_mlp_only", }modelopt/deploy/llm/generate.py (2)
113-120
: MOE detection heuristic is brittle.Scanning only top-level keys for “num” and “experts” may miss/false-trigger. Consider a recursive search for typical MOE markers (e.g., num_experts, num_local_experts) or a config flag.
I can propose a small recursive helper to detect MOE fields robustly.
136-139
: Align version gating and ImportError guidance.You gate features at ≥1.1.0rc2 but the ImportError message above still says “1.0.0rc or later.” Update the message to match the minimum required for llmapi + context logits UX.
-except ImportError: - print("Please upgrade tensorrt-llm to 1.0.0rc or later") +except ImportError: + print("Please upgrade tensorrt-llm to 1.1.0rc2 or later")examples/llm_ptq/scripts/parser.sh (3)
92-96
: Usage/help text is outdated and incomplete.It omits many supported options (e.g., kv_cache_quant, calib_batch_size, num_samples, etc.). Update to reduce user confusion.
Apply this minimal improvement:
- echo "Usage: $0 --model=<MODEL_PATH> --quant=<QFORMAT> --tasks=<TASK,...>" - echo "Optional args: --sparsity=<SPARSITY_FMT> --awq_block_size=<AWQ_BLOCK_SIZE> --calib=<CALIB_SIZE>" + echo "Usage: $0 --model=<MODEL_PATH> --quant=<QFORMAT> --tasks=<TASK,...>" + echo "Optional: --kv_cache_quant=FMT --sparsity=FMT --awq_block_size=N --calib=N --calib_batch_size=N --auto_quantize_bits=N" + echo " --input=N --output=N --batch=N --lm_eval_tasks=LIST --lm_eval_limit=N --simple_eval_tasks=LIST --num_samples=N" + echo " --trust_remote_code --use_seq_device_map --gpu_max_mem_percentage=N --kv_cache_free_gpu_memory_fraction=F" + echo " --low_memory_mode --no-verbose --calib_dataset=NAME"
117-123
: Message text references deprecated TRT-LLM TRT path and has grammar issues.Keep messaging backend-agnostic and fix grammar.
- if [[ "$QFORMAT" == *"awq"* ]]; then - echo "Sparsity is not compatible with 'awq' quantization for TRT-LLM deployment." + if [[ "$QFORMAT" == *"awq"* ]]; then + echo "Sparsity is not compatible with 'awq' quantization (no speedup)." exit 1 # Exit script with an error fi
141-141
: Typo: use num_samples (plural) for consistency.- echo "num_sample: $NUM_SAMPLES" + echo "num_samples: $NUM_SAMPLES"examples/vlm_ptq/scripts/huggingface_example.sh (2)
73-84
: Dead branch: MODEL_TYPE is never set by parser.sh now.
parser.sh
no longer exposes--type
/MODEL_TYPE. This VILA setup block is unreachable.Options:
- Remove this block; or
- Gate via a new flag (e.g.,
--enable_vila_setup
) parsed inparser.sh
; or- Expect MODEL_TYPE via env and document it.
Do you want a patch to add--enable_vila_setup
?
101-113
: Minor: keep comments aligned with actual behavior.Header says “Prepare datasets for TRT-LLM benchmark” but this section only sets and uses quickstart. Consider rewording to avoid confusion.
examples/llm_ptq/scripts/huggingface_example.sh (10)
187-194
: Outdated Blackwell guidance mentions engine build.If using Torch llmapi, suggest wording that avoids “build engine”.
- echo "Please build the tensorrt_llm engine on Blackwell GPU for deployment. Checkpoint export_path: $SAVE_PATH" + echo "This quantization requires deployment on a Blackwell (SM100) GPU with the TRT-LLM Torch backend. Checkpoint export_path: $SAVE_PATH"
201-206
: Initialize RUN_ARGS to avoid unbound expansions and quote paths.- if $TRUST_REMOTE_CODE; then + RUN_ARGS="" + if $TRUST_REMOTE_CODE; then RUN_ARGS+=" --trust_remote_code " fi - - python run_tensorrt_llm.py --engine_dir=$SAVE_PATH $RUN_ARGS + python run_tensorrt_llm.py --engine_dir="$SAVE_PATH" $RUN_ARGS
285-289
: Quote command substitution result (SC2046) and prefer a single match.Avoid word-splitting and pick one file deterministically.
- JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl')) + JSONL_PATH=$(readlink -f "$(find "$SAVE_PATH" -type f -name '*.jsonl' -print -quit)")
299-303
: Quote paths for safety; minor robustness.- trtllm-serve $SAVE_PATH --host 0.0.0.0 --port $PORT >$SAVE_PATH/serve.txt 2>&1 & + trtllm-serve "$SAVE_PATH" --host 0.0.0.0 --port "$PORT" >"$SAVE_PATH/serve.txt" 2>&1 &
373-380
: Quote model and dataset paths.- MODEL_ARGS="--model_path $SAVE_PATH " + MODEL_ARGS="--model_path \"$SAVE_PATH\" " @@ - trtllm-bench --model $MODEL_PATH $MODEL_ARGS throughput $EXTRA_ARGS --dataset $DATASET_TXT | tee -a $BENCHMARK_RESULT + trtllm-bench --model "$MODEL_PATH" $MODEL_ARGS throughput $EXTRA_ARGS --dataset "$DATASET_TXT" | tee -a "$BENCHMARK_RESULT" @@ - trtllm-bench --model $MODEL_PATH $MODEL_ARGS latency $EXTRA_ARGS --dataset $DATASET_TXT | tee -a $BENCHMARK_RESULT + trtllm-bench --model "$MODEL_PATH" $MODEL_ARGS latency $EXTRA_ARGS --dataset "$DATASET_TXT" | tee -a "$BENCHMARK_RESULT"
236-247
: Consistent quoting for paths and args.- LM_EVAL_RESULT=${SAVE_PATH}/lm_eval.txt + LM_EVAL_RESULT="${SAVE_PATH}/lm_eval.txt" @@ - python lm_eval_tensorrt_llm.py \ + python lm_eval_tensorrt_llm.py \ --model trt-llm \ - --model_args tokenizer=$MODEL_PATH,engine_dir=$SAVE_PATH,max_gen_toks=$BUILD_MAX_OUTPUT_LEN \ + --model_args tokenizer="$MODEL_PATH",engine_dir="$SAVE_PATH",max_gen_toks="$BUILD_MAX_OUTPUT_LEN" \ --tasks $LM_EVAL_TASKS \ - --batch_size $BUILD_MAX_BATCH_SIZE $lm_eval_flags | tee $LM_EVAL_RESULT + --batch_size "$BUILD_MAX_BATCH_SIZE" $lm_eval_flags | tee "$LM_EVAL_RESULT"
255-276
: Quoting and portability nits in MMLU block.- MMLU_RESULT=${SAVE_PATH}/mmlu.txt + MMLU_RESULT="${SAVE_PATH}/mmlu.txt" @@ - tar -xf /tmp/mmlu.tar -C data && mv data/data $MMLU_DATA_PATH + tar -xf /tmp/mmlu.tar -C data && mv data/data "$MMLU_DATA_PATH" @@ - --engine_dir $SAVE_PATH \ - --data_dir $MMLU_DATA_PATH | tee $MMLU_RESULT + --engine_dir "$SAVE_PATH" \ + --data_dir "$MMLU_DATA_PATH" | tee "$MMLU_RESULT"
318-328
: Quote args and results paths in livecodebench/simple_eval.- bash run_livecodebench.sh $MODEL_FULL_NAME $BUILD_MAX_BATCH_SIZE $BUILD_MAX_OUTPUT_LEN $PORT | tee $SAVE_PATH/livecodebench.txt + bash run_livecodebench.sh "$MODEL_FULL_NAME" "$BUILD_MAX_BATCH_SIZE" "$BUILD_MAX_OUTPUT_LEN" "$PORT" | tee "$SAVE_PATH/livecodebench.txt" - mkdir -p $SAVE_PATH/livecodebench - mv LiveCodeBench/output/$MODEL_FULL_NAME/* $SAVE_PATH/livecodebench + mkdir -p "$SAVE_PATH/livecodebench" + mv "LiveCodeBench/output/$MODEL_FULL_NAME/"* "$SAVE_PATH/livecodebench" @@ - bash run_simple_eval.sh $MODEL_FULL_NAME $SIMPLE_EVAL_TASKS $BUILD_MAX_OUTPUT_LEN $PORT | tee $SAVE_PATH/simple_eval.txt + bash run_simple_eval.sh "$MODEL_FULL_NAME" "$SIMPLE_EVAL_TASKS" "$BUILD_MAX_OUTPUT_LEN" "$PORT" | tee "$SAVE_PATH/simple_eval.txt"
384-391
: Clean-up quoting.- rm -f $SAVE_PATH/*.json - rm -f $SAVE_PATH/*.safetensors - rm -f $SAVE_PATH/*/*.json - rm -f $SAVE_PATH/*/*.engine - rm -f $SAVE_PATH/*/*.cache + rm -f "$SAVE_PATH"/*.json + rm -f "$SAVE_PATH"/*.safetensors + rm -f "$SAVE_PATH"/*/*.json + rm -f "$SAVE_PATH"/*/*.engine + rm -f "$SAVE_PATH"/*/*.cache
21-24
: Optional: addset -o pipefail
withset -e
.Improves failure detection in piped commands.
set -e +set -o pipefail
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (18)
CHANGELOG.rst
(1 hunks)docker/Dockerfile
(1 hunks)examples/llm_ptq/hf_ptq.py
(4 hunks)examples/llm_ptq/modelopt_to_tensorrt_llm.py
(0 hunks)examples/llm_ptq/run_tensorrt_llm.py
(1 hunks)examples/llm_ptq/scripts/huggingface_example.sh
(14 hunks)examples/llm_ptq/scripts/parser.sh
(2 hunks)examples/vlm_eval/README.md
(0 hunks)examples/vlm_eval/convert_gqa_for_eval.py
(0 hunks)examples/vlm_eval/gqa.sh
(0 hunks)examples/vlm_eval/model_gqa_loader.py
(0 hunks)examples/vlm_eval/requirements-vila.txt
(0 hunks)examples/vlm_eval/vlm_eval_utils.py
(0 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(3 hunks)examples/vlm_ptq/vlm_run.py
(0 hunks)examples/vlm_ptq/vlm_visual_engine.py
(0 hunks)modelopt/deploy/llm/generate.py
(8 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(0 hunks)
💤 Files with no reviewable changes (10)
- examples/vlm_ptq/vlm_visual_engine.py
- examples/vlm_eval/README.md
- examples/vlm_eval/gqa.sh
- examples/vlm_ptq/vlm_run.py
- examples/vlm_eval/requirements-vila.txt
- tests/examples/llm_ptq/test_llm_ptq.py
- examples/vlm_eval/model_gqa_loader.py
- examples/vlm_eval/convert_gqa_for_eval.py
- examples/vlm_eval/vlm_eval_utils.py
- examples/llm_ptq/modelopt_to_tensorrt_llm.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/run_tensorrt_llm.py (1)
modelopt/deploy/llm/generate.py (1)
generate_context_logits
(264-292)
modelopt/deploy/llm/generate.py (1)
examples/llm_eval/lm_eval_tensorrt_llm.py (1)
TRTLLM
(41-188)
🪛 Shellcheck (0.10.0)
examples/llm_ptq/scripts/huggingface_example.sh
[warning] 196-196: Remove quotes from right-hand side of =~ to match as a regex rather than literally.
(SC2076)
[warning] 288-288: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (6)
docker/Dockerfile (1)
1-1
: Please verify llmapi feature support in the new base image
Build and run the Docker image and, inside the container, confirm thattensorrt_llm.__version__ >= "1.1.0rc2"
and that thellmapi
symbols (CudaGraphConfig
,KvCacheConfig
,LLM
) are present. For example:docker build -f docker/Dockerfile -t trt-llm-img . docker run --rm trt-llm-img python3 - <<'PYTHON' from packaging.version import parse as V import importlib import tensorrt_llm as trtllm assert V(trtllm.__version__) >= V("1.1.0rc2"), "Tensorrt-llm version is too low" m = importlib.import_module("tensorrt_llm.llmapi") assert hasattr(m, "CudaGraphConfig") assert hasattr(m, "KvCacheConfig") from tensorrt_llm.llmapi.llm import LLM assert LLM is not None print("llmapi check passed:", trtllm.__version__) PYTHONexamples/llm_ptq/hf_ptq.py (1)
586-615
: HF vs. TRT-LLM export split: add a brief rationale and ensure docs match.Given TRT backend deprecation, clarify here that TRT-LLM checkpoint export is used only where the torch runtime lacks support (t5/bart/whisper or sparse). Also ensure CHANGELOG/README reflect this split.
Would you like me to add a short comment block and update README accordingly?
examples/llm_ptq/scripts/huggingface_example.sh (4)
80-99
: Ensure SAVE_PATH exists before symlinking.You
mkdir -p
later (Line 98). Good. Keep it before any potential early use. Already correct; no action needed.
345-371
: Benchmark dataset prep pathing looks good.Defaulting TRT_LLM_CODE_PATH and generating/reusing dataset deterministically is fine.
45-49
: ROOT_SAVE_PATH defaulting is solid.
21-23
: Parsing reuse from parser.sh is consistent.
if not args.auto_quantize_bits: | ||
assert ( | ||
args.qformat | ||
in [ | ||
"int4_awq", | ||
"fp8", | ||
"nvfp4", | ||
"nvfp4_awq", | ||
"w4a8_awq", | ||
"fp8_pb_wo", | ||
"w4a8_mxfp4_fp8", | ||
"nvfp4_mlp_only", | ||
] | ||
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES | ||
), f"Quantization format {args.qformat} not supported for HF export path" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion is ineffective; or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
is always True.
This currently lets unsupported qformats through. Tighten the check.
- if not args.auto_quantize_bits:
- assert (
- args.qformat
- in [
- "int4_awq",
- "fp8",
- "nvfp4",
- "nvfp4_awq",
- "w4a8_awq",
- "fp8_pb_wo",
- "w4a8_mxfp4_fp8",
- "nvfp4_mlp_only",
- ]
- or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
- ), f"Quantization format {args.qformat} not supported for HF export path"
+ if not args.auto_quantize_bits:
+ assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, (
+ f"Quantization format {args.qformat} not supported for HF export path"
+ )
Note: If you intended to allow “KV-cache-only” quant, handle that as a separate branch instead of weakening this assert.
Committable suggestion skipped: line range outside the PR's diff.
#Iterate over list of qformats provided and check if they are supported in HF export path | ||
IFS="," | ||
for qformat in $QFORMAT; do | ||
case $qformat in | ||
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; | ||
*) | ||
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Follow-up: finalize the allowed quant list after deduplication.
Please confirm the authoritative allowed set for the HF path and Torch llmapi to avoid drift.
Run to locate all hardcoded QFORMAT lists:
🏁 Script executed:
#!/bin/bash
rg -nP --type=sh -C2 '\b(fp16|bf16|fp8|fp8_pc_pt|fp8_pb_wo|int8_sq|int4_awq|nvfp4|nvfp4_awq|w4a8_awq|w4a8_nvfp4_fp8|w4a8_mxfp4_fp8)\b'
Length of output: 9308
Centralize quant-format lists for HF export and Torch llmapi
Move both HF-export (fp16
, bf16
, fp8
, fp8_pc_pt
, fp8_pb_wo
, int4_awq
, nvfp4
, nvfp4_awq
, w4a8_awq
, w4a8_nvfp4_fp8
, w4a8_mxfp4_fp8
) and Torch-llmapi (fp16
, bf16
, fp8
, nvfp4
, int4_awq
, w4a8_awq
) formats into shared constants and update the case statements in examples/llm_ptq/scripts/huggingface_example.sh
to reference them, ensuring a single source of truth and preventing drift.
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 37-44, the
supported quant-format list is hard-coded in the case statement; centralize the
HF-export and Torch-llmapi format lists into a single shared constant file
(e.g., examples/llm_ptq/scripts/quant_formats.sh), export two variables
(HF_EXPORT_QFORMATS and TORCH_LLMAPI_QFORMATS), source that file from
huggingface_example.sh, and replace the inline case pattern with a check that
tests the qformat against the appropriate exported list (e.g., iterate or use a
helper function that returns true if qformat is in HF_EXPORT_QFORMATS) so both
scripts consume the same source of truth and avoid drift.
if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then | ||
echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" | ||
exit 0 | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex-in-string and comma-list membership check (SC2076) + incorrect handling of comma-separated QFORMAT.
Current check treats the whole comma-joined list as one token.
- if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then
- echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
+ unsupported=""
+ IFS=','; for q in $QFORMAT; do
+ case "$q" in fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) : ;; * ) unsupported+="${unsupported:+,}$q" ;; esac
+ done; IFS=' '
+ if [ -n "$unsupported" ]; then
+ echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq"
exit 0
fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then | |
echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" | |
exit 0 | |
fi | |
# Validate that each comma-separated QFORMAT is in the allowed list. | |
unsupported="" | |
# Split on commas | |
IFS=',' | |
for q in $QFORMAT; do | |
case "$q" in | |
fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) | |
# supported, do nothing | |
;; | |
*) | |
# collect any unsupported formats | |
unsupported+="${unsupported:+,}$q" | |
;; | |
esac | |
done | |
# Restore IFS to a single space (or save/restore original IFS if needed) | |
IFS=' ' | |
if [ -n "$unsupported" ]; then | |
echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" | |
exit 0 | |
fi |
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 196-196: Remove quotes from right-hand side of =~ to match as a regex rather than literally.
(SC2076)
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 196-200, the
current membership check uses a regex-in-string and fails for comma-separated
QFORMAT values; update it to split QFORMAT on commas and validate each token
against an explicit allowed list. Specifically: define an array or string of
allowed formats (fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq), split QFORMAT by ','
into tokens, trim whitespace, loop over tokens and for each token check
membership using a case statement or a simple loop comparison; if any token is
not in the allowed list, echo the same error message and exit with a non-zero
status (exit 1).
examples/llm_ptq/scripts/parser.sh
Outdated
|
||
# Parse command-line options | ||
ARGS=$(getopt -o "" -l "type:,model:,quant:,kv_cache_quant:,tp:,calib_tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,export_fmt:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") | ||
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing getopt entry for --num_samples causes parsing failure.
You parse --num_samples
below (Line 57) but it’s not declared in the getopt long options, so passing it will fail.
Apply this diff:
- ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
+ ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,num_samples:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") | |
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,num_samples:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") |
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/parser.sh around line 37, the getopt long options
list is missing the --num_samples entry which causes parsing failures when that
flag is used; update the long options string to include num_samples: (with a
trailing colon) among the other long options so getopt recognizes and parses
--num_samples correctly, keeping the comma separation/format consistent with the
existing options.
trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7) | ||
trt_kv_cache_config.max_tokens = self._max_seq_len * ( | ||
max_batch_size if max_batch_size > 0 else 8 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
kv_cache_config argument is ignored; honor user overrides.
Currently hardcodes free_gpu_memory_fraction and max_tokens. Apply values from kv_cache_config when provided.
- trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7)
- trt_kv_cache_config.max_tokens = self._max_seq_len * (
- max_batch_size if max_batch_size > 0 else 8
- )
+ trt_kv_cache_config = TRT_KvCacheConfig(
+ free_gpu_memory_fraction=kv_cache_config.get("free_gpu_memory_fraction", 0.7)
+ )
+ trt_kv_cache_config.max_tokens = kv_cache_config.get(
+ "max_tokens",
+ self._max_seq_len * (max_batch_size if max_batch_size > 0 else 8),
+ )
If more fields are supported by TRT_KvCacheConfig, merge them generically:
for k, v in kv_cache_config.items():
if hasattr(trt_kv_cache_config, k):
setattr(trt_kv_cache_config, k, v)
🤖 Prompt for AI Agents
In modelopt/deploy/llm/generate.py around lines 122 to 125, the code creates a
TRT_KvCacheConfig with hardcoded free_gpu_memory_fraction and max_tokens and
ignores the incoming kv_cache_config parameter; update this to first construct
the default TRT_KvCacheConfig (with max_tokens defaulting to self._max_seq_len *
(max_batch_size if max_batch_size > 0 else 8)), then if kv_cache_config is
provided, iterate its items and for each key value pair set the attribute on the
TRT_KvCacheConfig instance only if that attribute exists (so user-supplied
overrides like free_gpu_memory_fraction and max_tokens take effect), ensuring no
unknown keys raise errors.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #297 +/- ##
=======================================
Coverage 73.82% 73.82%
=======================================
Files 172 172
Lines 17438 17438
=======================================
Hits 12873 12873
Misses 4565 4565 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Missing changes in
Since this is a big change, please make sure to test it
|
Co-authored-by: Keval Morabia <[email protected]> Signed-off-by: Chenjie Luo <[email protected]>
The llm_sparsity mentions "export tensorrt-llm checkpoint" and the llm_eval mentions "TensorRT LLM" , which may also need to be updated. |
Signed-off-by: Chenjie Luo <[email protected]>
…/TensorRT-Model-Optimizer into chenjiel/deprecate_trtllm_build_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
examples/vlm_ptq/scripts/huggingface_example.sh (1)
150-155
: Bug:qformat
is undefined; test against$QFORMAT
and handle comma lists.Current branch never triggers. Fix and support multiple tokens.
- if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then + # If QFORMAT is exactly a single precision token (no quant), just link HF files. + if [[ "$QFORMAT" == "bf16" || "$QFORMAT" == "fp16" ]]; thenIf mixed values are possible, split and detect:
- if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then + is_prec_only=false + IFS=','; set -- $QFORMAT; [ $# -eq 1 ] && { [ "$1" = "bf16" ] || [ "$1" = "fp16" ]; } && is_prec_only=true; IFS=' ' + if $is_prec_only; thenexamples/llm_ptq/scripts/parser.sh (1)
36-68
: Add missing getopt long optionnum_samples
or remove its case arm.
In examples/llm_ptq/scripts/parser.sh (lines 36–68) the case block handles --num_samples but the getopt -l long-options list does not — addnum_samples:
to the -l string or remove the case handling.examples/llm_ptq/scripts/huggingface_example.sh (1)
148-159
: Bug: uses undefined $qformat; bf16/fp16 passthrough check broken.This block references $qformat outside any loop and fails for comma-separated QFORMAT.
-if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then - - if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then +if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then + + # If QFORMAT is a single precision (bf16/fp16), symlink files instead of re-quantizing + if [[ "$QFORMAT" =~ ^(bf16|fp16)$ ]]; then if [ -d "$MODEL_PATH" ]; then MODEL_CONFIG_EXIST=true MODEL_CONFIG=$MODEL_PATH/config.json - for file in $MODEL_PATH/*; do ln -sf "$file" $SAVE_PATH/; done + mkdir -p "$SAVE_PATH" + for file in "$MODEL_PATH"/*; do ln -sf "$file" "$SAVE_PATH/"; done else echo "Please use the model directory where the config.json file is present." exit 1 fi fi
♻️ Duplicate comments (15)
examples/llm_ptq/run_tensorrt_llm.py (1)
83-84
: Guard context‑logits call to avoid runtime assert on unsupported versions.Keep example robust; skip when unsupported instead of asserting inside wrapper.
- logits = llm.generate_context_logits(input_texts) - print(f"Generated logits: {logits}") + if hasattr(llm, "generate_context_logits"): + try: + logits = llm.generate_context_logits(input_texts) + print(f"Generated logits: {logits}") + except AssertionError as e: + print(f"Generated logits: skipped ({e})") + else: + print("Generated logits: skipped (not supported)")examples/vlm_ptq/scripts/huggingface_example.sh (4)
37-47
: Remove duplicate and divergent QFORMAT validation.This duplicates the later validator (Lines 65–75) and has a different allowlist. Drop this block to avoid drift.
-#Iterate over list of qformats provided and check if they are supported in HF export path -IFS="," -for qformat in $QFORMAT; do - case $qformat in - fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; - *) - echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 - exit 1 - ;; - esac -done -IFS=" "
182-185
: Outdated guidance: referencestrtllm-build
after deprecating TRT backend in examples.Replace with Torch-backend guidance or explicit “not supported in this example” message.
- echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Please deploy with the TRT-LLM using trtllm-build. Checkpoint export_path: $SAVE_PATH" + echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Sparse deployment is not supported in this HF example. Use the TRT-LLM Torch backend per official docs. Checkpoint export_path: $SAVE_PATH"
196-199
: Fix comma-list membership check (SC2076) and validate each token.Current regex-in-string breaks for lists.
- if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then - echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" - exit 0 - fi + unsupported="" + IFS=','; for q in $QFORMAT; do + case "$q" in fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) : ;; * ) unsupported+="${unsupported:+,}$q" ;; esac + done; IFS=' ' + if [ -n "$unsupported" ]; then + echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" + exit 0 + fi
337-341
:PP
may be unset after parser changes; comparison can error.Drop the check or default
PP=1
safely.- if [ "$PP" -ne 1 ]; then - echo "Benchmark does not work with multi PP. Please run the c++ benchmark in the TensorRT-LLM repo..." - exit 1 - fi + # Benchmark assumes single pipeline parallelism.examples/llm_ptq/hf_ptq.py (2)
92-105
: Single source of truth for allowed qformats.Extract the list into a constant and reuse here and below.
+ALLOWED_UNIFIED_HF_QFORMATS = { + "fp8", + "int4_awq", + "nvfp4", + "nvfp4_awq", + "w4a8_awq", + "fp8_pb_wo", + "w4a8_mxfp4_fp8", + "nvfp4_mlp_only", +} ... - assert all( - qformat - in [ - "fp8", - "int4_awq", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - for qformat in qformat_list - ), "One or more quantization formats provided are not supported for unified checkpoint export" + assert all(q in ALLOWED_UNIFIED_HF_QFORMATS for q in qformat_list), ( + "One or more quantization formats provided are not supported for unified checkpoint export" + )
214-229
: Fix assert: currentor args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
makes it always true.Enforce qformat membership independently.
- if not args.auto_quantize_bits: - assert ( - args.qformat - in [ - "int4_awq", - "fp8", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES - ), f"Quantization format {args.qformat} not supported for HF export path" + if not args.auto_quantize_bits: + assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, ( + f"Quantization format {args.qformat} not supported for HF export path" + ) + assert args.kv_cache_qformat in KV_QUANT_CFG_CHOICES, ( + f"KV cache quantization format {args.kv_cache_qformat} is not supported" + )modelopt/deploy/llm/generate.py (2)
69-78
: Docstring param name is wrong (“engine_dir”); use “checkpoint_dir”.Aligns with the signature and deprecates engine semantics. This was called out earlier and still needs fixing.
- Args: - engine_dir: the directory path of the TensorRT-LLM engine. + Args: + checkpoint_dir: Path to the checkpoint/config directory (unified HF/TRT-LLM format). tokenizer: the tokenizer. For example, a tokenizer from the Huggingface model. kv_cache_config: the kv cache config as a dict. Please refer to https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/ medusa_choices: The medusa choices for the decoding config. tp: the tensor parallel size (for the torch backend). If 0, it will be set to the number of GPUs. trust_remote_code: whether to trust the remote code (for the torch backend). max_batch_size: Max batch size for the LLM backend. If 0, it is not specified.
121-126
: Honor user-provided kv_cache_config overrides.Currently hardcoded; apply provided values when present. This was previously flagged and remains.
- trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7) - trt_kv_cache_config.max_tokens = self._max_seq_len * ( - max_batch_size if max_batch_size > 0 else 8 - ) + trt_kv_cache_config = TRT_KvCacheConfig(free_gpu_memory_fraction=0.7) + trt_kv_cache_config.max_tokens = self._max_seq_len * ( + max_batch_size if max_batch_size > 0 else 8 + ) + if kv_cache_config: + for k, v in kv_cache_config.items(): + if hasattr(trt_kv_cache_config, k): + setattr(trt_kv_cache_config, k, v)examples/llm_ptq/scripts/parser.sh (1)
37-37
: Add missing --num_samples to getopt long options.Currently parsed in case but undeclared in getopt, causing “unrecognized option” errors when used. Mirrors an earlier review; still unresolved.
- ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") + ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,num_samples:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@")examples/llm_ptq/scripts/huggingface_example.sh (5)
37-47
: Remove duplicate QFORMAT validator (keep single source of truth).This block duplicates the validator below and drifts in the allowed set. Delete this one and keep the later validator.
-#Iterate over list of qformats provided and check if they are supported in HF export path -IFS="," -for qformat in $QFORMAT; do - case $qformat in - fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; - *) - echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 - exit 1 - ;; - esac -done -IFS=" "
182-185
: Update sparse-quant message; remove trtllm-build reference.Keep messaging consistent with TRT backend deprecation.
- if [[ "$SPARSITY_FMT" != "dense" ]]; then - echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Please deploy with the TRT-LLM using trtllm-build. Checkpoint export_path: $SAVE_PATH" - exit 0 - fi + if [[ "$SPARSITY_FMT" != "dense" ]]; then + echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Not supported in this HF example; use the TRT-LLM Torch backend sparse deployment guide. Checkpoint export_path: $SAVE_PATH" + exit 0 + fi
92-95
: Remove deprecated EXPORT_FORMAT from name; add guard for legacy env.Interpolating ${EXPORT_FORMAT} yields a trailing underscore and empty token.
-MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}_${EXPORT_FORMAT} +MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}Add an early guard (near argument parsing):
+if [ -n "${EXPORT_FORMAT:-}" ]; then + echo "ERROR: EXPORT_FORMAT is deprecated; remove it from your environment." >&2 + exit 1 +fi
337-341
: PP check can error when PP is unset.Use a safe default or remove the check per deprecation.
- if [ "$PP" -ne 1 ]; then + if [ "${PP:-1}" -ne 1 ]; then echo "Benchmark does not work with multi PP. Please run the c++ benchmark in the TensorRT-LLM repo..." exit 1 fi
196-199
: Fix QFORMAT membership check (SC2076) and comma-list handling.Current regex-in-string check is wrong for comma-separated values.
- if [[ ! " fp8 nvfp4 bf16 fp16 int4_awq w4a8_awq " =~ " ${QFORMAT} " ]]; then - echo "Quant $QFORMAT not supported with the TensorRT-LLM torch llmapi. Allowed values are: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" - exit 0 - fi + unsupported="" + IFS=','; for q in $QFORMAT; do + case "$q" in fp8|nvfp4|bf16|fp16|int4_awq|w4a8_awq) : ;; * ) unsupported+="${unsupported:+,}$q" ;; esac + done; IFS=' ' + if [ -n "$unsupported" ]; then + echo "Quant(s) [$unsupported] not supported with the TensorRT-LLM Torch llmapi. Allowed: fp8, nvfp4, bf16, fp16, int4_awq, w4a8_awq" + exit 0 + fi
🧹 Nitpick comments (26)
CHANGELOG.rst (2)
9-11
: Changelog accuracy:--export_fmt
is deprecated/ignored, not removed.hf_ptq.py still accepts
--export_fmt
(default "hf") and only warns. Update the note to “Deprecated and ignored in 0.37; removal planned next release.”Apply:
- - ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format. + - ``--export_fmt`` flag in ``examples/llm_ptq`` is deprecated and ignored. By default we export to the unified Hugging Face checkpoint format. It will be removed in the next release.
9-11
: Add TensorRT-LLM upgrade to 1.1.0rc2 in v0.37 CHANGELOGAdd an entry under v0.37 "New Features" (or "Misc"): "Upgrade TensorRT-LLM dependency to 1.1.0rc2." Evidence: docker/Dockerfile uses nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2 and modelopt/deploy/llm/generate.py prompts upgrading to 1.1.0rc2 or later.
Follow-ups to verify/clean:
- .github/CODEOWNERS still references examples/vlm_eval — remove or justify keeping.
- Update or document remaining uses of --export_fmt and trtllm-build (e.g., examples/llm_ptq/hf_ptq.py, tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py, various docs/examples).
examples/llm_sparsity/README.md (1)
151-151
: Clarify backend and minimum TRT-LLM version.Add a note that the linked instructions use the TRT backend; if your repo examples favor the Torch runtime elsewhere, point that out and specify the minimum TRT-LLM version that supports
--weight_sparsity
.docker/Dockerfile (1)
1-1
: Pin base image/version and reflect in docs.Consider pinning by digest or an ARG to avoid pulling moving RCs and update install docs accordingly.
Apply:
-FROM nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc2 +ARG TRT_LLM_TAG=1.1.0rc2 +FROM nvcr.io/nvidia/tensorrt-llm/release:${TRT_LLM_TAG}examples/llm_eval/README.md (3)
96-96
: Tighten placeholder: make it clearengine_dir
accepts a quantized HF checkpoint.Suggest renaming the placeholder to
<quantized_checkpoint_dir>
for consistency across this page.-python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <engine batch size> +python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<quantized_checkpoint_dir> --tasks <comma separated tasks> --batch_size <engine batch size>
143-143
: Same placeholder fix for MMLU section.-python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <Quantized checkpoint dir> +python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <quantized_checkpoint_dir>
166-166
: Same placeholder fix for MT-Bench section.-bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir> +bash run_fastchat.sh -h <HF model folder or model card> <quantized_checkpoint_dir>examples/vlm_ptq/scripts/huggingface_example.sh (1)
286-289
: Quote subshell to avoid word splitting when computing JSONL_PATH.Prevents breakage with spaces/newlines.
- JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl')) + JSONL_PATH="$(readlink -f "$(find "$SAVE_PATH" -type f -name '*.jsonl' -print -quit)")"examples/llm_ptq/hf_ptq.py (3)
126-143
: Avoid hidden global:auto_quantize
readsargs.kv_cache_qformat
.Pass KV format explicitly to keep function pure.
-def auto_quantize( - model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1 -): +def auto_quantize( + model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, kv_cache_qformat: str, batch_size: int = 1 +): ... - enable_quant_kv_cache = args.kv_cache_qformat != "none" + enable_quant_kv_cache = kv_cache_qformat != "none" ... - kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"] + kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]Call site (outside hunk):
- model = auto_quantize( + model = auto_quantize( model, args.qformat, args.auto_quantize_bits, calib_dataloader, calibrate_loop, + args.kv_cache_qformat, args.batch_size, )
700-705
: Sharpen deprecation message and type.Make the help explicit and warn with DeprecationWarning.
- parser.add_argument( + parser.add_argument( "--export_fmt", required=False, default="hf", choices=["tensorrt_llm", "hf"], - help="Deprecated. Please avoid using this argument.", + help="Deprecated and ignored in 0.37; will be removed next release.", )
759-761
: Use DeprecationWarning category.- if args.export_fmt != "hf": - warnings.warn("Deprecated. --export_fmt will be ignored.") + if args.export_fmt != "hf": + warnings.warn("Deprecated. --export_fmt is ignored and will be removed in the next release.", DeprecationWarning)examples/llm_ptq/scripts/parser.sh (3)
141-141
: Typo in printed field name (num_sample → num_samples).Avoid confusion with the flag name.
- echo "num_sample: $NUM_SAMPLES" + echo "num_samples: $NUM_SAMPLES"
117-123
: Outdated error message still references TRT-LLM deployment.The repo is deprecating TRT backend; update the message to reflect the torch path or stay backend-agnostic.
- echo "Sparsity is not compatible with 'awq' quantization for TRT-LLM deployment." + echo "Sparsity is not compatible with 'awq' quantization on the torch backend."
92-97
: Usage string omits newly supported flags.Include --num_samples to keep help accurate.
- echo "Optional args: --sparsity=<SPARSITY_FMT> --awq_block_size=<AWQ_BLOCK_SIZE> --calib=<CALIB_SIZE>" + echo "Optional args: --sparsity=<SPARSITY_FMT> --awq_block_size=<AWQ_BLOCK_SIZE> --calib=<CALIB_SIZE> --num_samples=<NUM_SAMPLES>"modelopt/deploy/llm/generate.py (6)
109-111
: Fail fast when no GPU is available (tp resolves to 0).Avoid passing tensor_parallel_size=0 to backend.
if tp < 1: - tp = torch.cuda.device_count() + tp = torch.cuda.device_count() + if tp < 1: + raise RuntimeError("No CUDA devices available; torch backend requires at least 1 GPU.")
183-189
: Avoid enabling beam search when beam_width == 1.Minor perf/clarity improvement.
- sampling_config = SamplingParams( - max_tokens=max_new_tokens, - use_beam_search=True, - best_of=beam_width, - stop=stop_words, - **kwargs, - ) + use_beam = beam_width > 1 + sampling_config = SamplingParams( + max_tokens=max_new_tokens, + use_beam_search=use_beam, + best_of=beam_width, + stop=stop_words, + **kwargs, + )
269-279
: Fix return type annotation (torch.Tensor).Use the correct Tensor type.
- ) -> list[torch.tensor]: + ) -> list[torch.Tensor]:
79-83
: Harden config.json loading with a clearer error.Give a precise message when the checkpoint is missing/invalid.
- with open(Path(checkpoint_dir) / "config.json") as config_file: - config = json.load(config_file) + cfg_path = Path(checkpoint_dir) / "config.json" + try: + with open(cfg_path) as config_file: + config = json.load(config_file) + except FileNotFoundError as e: + raise FileNotFoundError(f"config.json not found at: {cfg_path}") from e + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON in {cfg_path}: {e}") from e
112-120
: Heuristic MOE detection may misfire.Scanning keys for “num” and “experts” is brittle; consider reading a known path (e.g., config.get("moe", {}).get("num_experts")) or checking values are ints > 0 before toggling EP/attention DP.
57-66
: Remove unused medusa_choices parameter or implement support.It’s accepted then immediately asserted unsupported; drop it to avoid API confusion.
- medusa_choices: Any = None, + # medusa_choices removed; not supported by torch llmapiexamples/llm_ptq/scripts/huggingface_example.sh (6)
65-75
: Finalize allowed quant list and centralize it.Confirm the authoritative HF-export set and make this the single source of truth. Consider sourcing from a shared file to avoid drift.
-IFS="," -for qformat in $QFORMAT; do - case $qformat in - fp8 | fp8_pc_pt | fp8_pb_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; +IFS="," +for qformat in $QFORMAT; do + case $qformat in + fp16|bf16|fp8|fp8_pc_pt|fp8_pb_wo|int4_awq|nvfp4|nvfp4_awq|w4a8_awq|w4a8_nvfp4_fp8|w4a8_mxfp4_fp8) ;; *) - echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 + echo "Unknown quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 exit 1 ;; esac done IFS=" "Optional: move allowed lists to examples/llm_ptq/scripts/quant_formats.sh and source here.
176-180
: Enc–Dec guidance contradicts deprecating TRT build path.Linking to TRT-LLM enc_dec example is fine, but avoid implying engine build path is required here. Clarify Torch backend route or mark unsupported in this script.
- echo "Please continue to deployment with the TRT-LLM enc_dec example, https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec. Checkpoint export_path: $SAVE_PATH" + echo "Enc–Dec models are not deployed by this HF example. Please follow the TRT-LLM Torch backend enc_dec guide. Checkpoint export_path: $SAVE_PATH"
201-205
: Robust boolean handling for RUN_ARGS (avoid relying on shell builtins).Prefer explicit string checks to avoid issues when vars are unset.
- if $TRUST_REMOTE_CODE; then + if [ "${TRUST_REMOTE_CODE,,}" = "true" ]; then RUN_ARGS+=" --trust_remote_code " fi - python run_tensorrt_llm.py --engine_dir=$SAVE_PATH $RUN_ARGS + python run_tensorrt_llm.py --engine_dir="$SAVE_PATH" $RUN_ARGS
285-289
: Quote command substitution; handle multiple JSONLs deterministically.Prevent word-splitting and choose a single file (or collect all).
- JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl')) + JSONL_PATH="$(find "$SAVE_PATH" -type f -name '*.jsonl' -print0 | xargs -0 -r ls -1t | head -n1)"
297-305
: Confirm trtllm-serve is the intended Torch backend server.If Torch llmapi is the target, verify trtllm-serve supports this flow; otherwise switch to the correct serve entrypoint or gate this task.
121-135
: Consistent boolean parsing across the script.Multiple places rely on executing “true/false”. Switch to string comparisons for resilience.
-if $TRUST_REMOTE_CODE; then +if [ "${TRUST_REMOTE_CODE,,}" = "true" ]; then PTQ_ARGS+=" --trust_remote_code " fi -if $USE_SEQ_DEVICE_MAP; then +if [ "${USE_SEQ_DEVICE_MAP,,}" = "true" ]; then PTQ_ARGS+=" --use_seq_device_map " fi -if ! $VERBOSE; then +if [ "${VERBOSE,,}" != "true" ]; then PTQ_ARGS+=" --no-verbose " fiAlso applies to: 201-205
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
CHANGELOG.rst
(1 hunks)docker/Dockerfile
(1 hunks)examples/llm_eval/README.md
(3 hunks)examples/llm_ptq/hf_ptq.py
(6 hunks)examples/llm_ptq/modelopt_to_tensorrt_llm.py
(0 hunks)examples/llm_ptq/run_tensorrt_llm.py
(1 hunks)examples/llm_ptq/scripts/huggingface_example.sh
(14 hunks)examples/llm_ptq/scripts/parser.sh
(2 hunks)examples/llm_sparsity/README.md
(1 hunks)examples/vlm_eval/README.md
(0 hunks)examples/vlm_eval/convert_gqa_for_eval.py
(0 hunks)examples/vlm_eval/gqa.sh
(0 hunks)examples/vlm_eval/model_gqa_loader.py
(0 hunks)examples/vlm_eval/requirements-vila.txt
(0 hunks)examples/vlm_eval/vlm_eval_utils.py
(0 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(3 hunks)examples/vlm_ptq/vlm_run.py
(0 hunks)examples/vlm_ptq/vlm_visual_engine.py
(0 hunks)modelopt/deploy/llm/generate.py
(8 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(0 hunks)
💤 Files with no reviewable changes (10)
- examples/vlm_eval/requirements-vila.txt
- examples/vlm_ptq/vlm_visual_engine.py
- tests/examples/llm_ptq/test_llm_ptq.py
- examples/vlm_eval/vlm_eval_utils.py
- examples/vlm_eval/convert_gqa_for_eval.py
- examples/vlm_eval/README.md
- examples/vlm_eval/gqa.sh
- examples/vlm_eval/model_gqa_loader.py
- examples/llm_ptq/modelopt_to_tensorrt_llm.py
- examples/vlm_ptq/vlm_run.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/run_tensorrt_llm.py (1)
modelopt/deploy/llm/generate.py (1)
generate_context_logits
(264-292)
modelopt/deploy/llm/generate.py (1)
examples/llm_eval/lm_eval_tensorrt_llm.py (1)
TRTLLM
(41-188)
🪛 Shellcheck (0.10.0)
examples/llm_ptq/scripts/huggingface_example.sh
[warning] 196-196: Remove quotes from right-hand side of =~ to match as a regex rather than literally.
(SC2076)
[warning] 288-288: Quote this to prevent word splitting.
(SC2046)
🔇 Additional comments (6)
examples/llm_ptq/hf_ptq.py (1)
585-614
: Confirm export routing rules.For “t5/bart/whisper or sparsity!=dense” you still export TRT-LLM checkpoints. Ensure docs/tutorials reflect that these models require TRT-LLM export (not unified HF) and that CI covers this path.
examples/vlm_ptq/scripts/huggingface_example.sh (1)
373-375
: Confirm trtllm-bench flag usage (--model vs --model_path)Script passes both
--model $MODEL_PATH
and--model_path $SAVE_PATH
; verification failed (trtllm-bench not available). Runtrtllm-bench --help
locally to confirm which flag expects the model/engine vs a save/checkpoint path and update the script to use only the correct flag.
Location: examples/vlm_ptq/scripts/huggingface_example.sh:373-375modelopt/deploy/llm/generate.py (1)
24-36
: ImportError message LGTM.Clear guidance on required tensorrt-llm version.
examples/llm_ptq/scripts/huggingface_example.sh (3)
236-246
: Output paths refactor to SAVE_PATH looks good.
317-322
: Good: results consolidated under SAVE_PATH with stable naming.
342-375
: Benchmark path updates look consistent with SAVE_PATH and Torch backend.
--pyt_ckpt_path=$MODEL_PATH \ | ||
--export_path=$SAVE_PATH \ | ||
--sparsity_fmt=$SPARSITY_FMT \ | ||
--qformat=$QFORMAT \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_FULL_NAME
includes deprecated ${EXPORT_FORMAT}
and produces trailing underscore.
Remove ${EXPORT_FORMAT}
and optionally fail fast if set.
-MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}_${EXPORT_FORMAT}
+MODEL_FULL_NAME=${MODEL_NAME}_${QFORMAT_MODIFIED}${KV_CACHE_QUANT:+_kv_${KV_CACHE_QUANT}}
+# Optional guard for legacy env:
+[ -n "${EXPORT_FORMAT:-}" ] && { echo "EXPORT_FORMAT is deprecated and unused"; exit 1; }
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In examples/vlm_ptq/scripts/huggingface_example.sh around line 92,
MODEL_FULL_NAME is built using the deprecated ${EXPORT_FORMAT}, producing a
trailing underscore; remove ${EXPORT_FORMAT} from the MODEL_FULL_NAME assembly
and update any variable expansions to avoid leaving stray underscores, and add a
fast-fail check near script start that exits with an error if EXPORT_FORMAT is
set (or non-empty) so the deprecated variable is not used unintentionally.
kv_cache_config: dict[str, int | float] = {}, | ||
medusa_choices: Any = None, | ||
tp: int = 0, | ||
trust_remote_code: bool = False, | ||
max_batch_size: int = 0, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mutable default for kv_cache_config.
Use None and initialize inside to prevent cross-instance state leaks.
- kv_cache_config: dict[str, int | float] = {},
+ kv_cache_config: dict[str, int | float] | None = None,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
kv_cache_config: dict[str, int | float] = {}, | |
medusa_choices: Any = None, | |
tp: int = 0, | |
trust_remote_code: bool = False, | |
max_batch_size: int = 0, | |
): | |
kv_cache_config: dict[str, int | float] | None = None, | |
medusa_choices: Any = None, | |
tp: int = 0, | |
trust_remote_code: bool = False, | |
max_batch_size: int = 0, | |
): |
🤖 Prompt for AI Agents
In modelopt/deploy/llm/generate.py around lines 61 to 66, the function parameter
kv_cache_config uses a mutable default ({}), which can leak state across calls;
change the signature to accept None (e.g., kv_cache_config: Optional[dict[str,
int | float]] = None) and inside the function set kv_cache_config = {} if it's
None, preserving the same type and behavior while avoiding shared mutable
defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
tests/examples/speculative_decoding/test_medusa.py (1)
32-44
: HF helper and CLI out of sync — hf_ptq.py still exposes --export_fmt; sync required.
- examples/llm_ptq/hf_ptq.py: parser.add_argument("--export_path") at line 655; "--export_fmt" still present (~line 691) with deprecation warning (~line 751).
- Verification did not detect --pyt_ckpt_path or --qformat in the check. Either remove the deprecated --export_fmt from hf_ptq.py to complete the HF-only change, or update the helper to pass --export_fmt and ensure --pyt_ckpt_path and --qformat are exposed.
examples/speculative_decoding/example.ipynb (4)
295-343
: Remove TRT‑LLM conversion/build sections; they contradict this PR’s deprecation.Notebook still converts to TRT‑LLM and calls
trtllm-build
. Per PR: “remove usage of TRT‑LLM’s TRT backend in repository examples” and “delete trtllm-build and related wrappers.” Keep this example fully HF/vLLM/torch-only.Apply this diff to drop the deprecated flow and replace with a simple HF inference example:
-Then convert the Unified ckeckpoint to TRTLLM checkpoint. - -!python TensorRT-LLM/examples/eagle/convert_checkpoint.py --model_dir /tmp/hf_ckpt --output_dir /tmp/trtllm_ckpt --num_eagle_layers 5 --max_non_leaves_per_layer 4 --max_draft_len 25 --dtype float16 - -Last, build a TensorRT-LLM engine. - -!trtllm-build --checkpoint_dir /tmp/trtllm_ckpt --output_dir /tmp/trtllm_engine --gemm_plugin float16 --use_paged_context_fmha enable --speculative_decoding_mode eagle --max_batch_size 4 - -To run the EAGLE engine, please refer to [TensorRT-LLM/examples/eagle](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/eagle): - -!python ../run.py --engine_dir /tmp/trtllm_engine \ - --tokenizer_dir /tmp/eagle_fp8_qat \ - --max_output_len=100 \ - --eagle_choices="[[0],[1],[2],[3],[0,0],[0,1],[0,2],[1,0],[1,1],[2,0],[2,1],[3,0],[0,0,0],[0,0,1],[0,0,2],[0,1,0],[0,1,1],[0,2,0],[0,2,1],[1,0,0],[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,0,0,0],[0,0,0,0,1]]" \ - --temperature 1.0 \ - --input_text "Once upon" +Run a quick HF inference from the quantized (or QAT) checkpoint: + +```python +import torch, transformers +tok = transformers.AutoTokenizer.from_pretrained("/tmp/eagle_fp8_qat") +mdl = transformers.AutoModelForCausalLM.from_pretrained("/tmp/eagle_fp8_qat", torch_dtype="auto", device_map="auto") +out = mdl.generate(**tok("Once upon", return_tensors="pt").to(mdl.device), max_new_tokens=100, temperature=1.0) +print(tok.decode(out[0], skip_special_tokens=True)) +```If you plan to showcase TRT‑LLM’s torch backend later, add a “TODO” cell pointing to its docs, without invoking
trtllm-build
.
213-219
: Avoid mutating global default quant config.
FP8_DEFAULT_CFG
is likely shared; in-place edits can leak globally. Deep‑copy before modifying.Apply this diff:
+import copy ... -quant_cfg = getattr(mtq, "FP8_DEFAULT_CFG") +quant_cfg = copy.deepcopy(getattr(mtq, "FP8_DEFAULT_CFG")) quant_cfg["quant_cfg"]["*output_quantizer"] = { "num_bits": (4, 3), "axis": None, "enable": True, }
220-221
: Duplicate argument passed to create_forward_loop.Likely signature is
create_forward_loop(dataloader, ...)
; passing it positionally and by name can raise “multiple values for argument ‘dataloader’.”Apply this diff:
-calibrate_loop = dataset_utils.create_forward_loop(calib_dataloader, dataloader=calib_dataloader) +calibrate_loop = dataset_utils.create_forward_loop(calib_dataloader)
295-343
: Sweep and replace legacy TRT‑LLM / trtllm-build references (repo‑wide)Multiple files still reference trtllm-build / TRT‑LLM / --engine_dir / speculative_decoding_mode; update or mark as legacy and point to the current export/deploy workflow. Affected files found:
- examples/speculative_decoding/example.ipynb (convert_checkpoint.py, trtllm-build, run.py --engine_dir)
- examples/speculative_decoding/README.md (external TRT‑LLM doc link)
- examples/llm_ptq/README.md, run_tensorrt_llm.py, scripts/huggingface_example.sh (trtllm-build guidance, --engine_dir calls, sparse-quantization message)
- examples/llm_eval/run_fastchat.sh and examples/llm_eval/README.md (--engine_dir usage)
- examples/windows/accuracy_benchmark/trtllm_utils.py and README.md (--engine_dir / trt-llm examples)
- examples/vlm_ptq/utils.py (--engine_dir)
- examples/llm_sparsity/README.md (trtllm-build guidance)
Action: remove or update these references to the current recommended workflow (or clearly label them as legacy and link to migration steps).
examples/llm_ptq/README.md (2)
83-104
: Remove trtllm-build and legacy TRT backend export from examples docs.This README still instructs users to export legacy TRT‑LLM checkpoints and run
trtllm-build
, which contradicts this PR’s goal to deprecate the TRT backend in examples. Recommend deleting these legacy sections or moving them to a clearly marked deprecated appendix, and pointing users to unified HF export + TRT‑LLM PyTorch LLM API instead.Apply this doc diff to excise legacy guidance and add a short deprecation note:
@@ -#### (Legacy) TensorRT-LLM Checkpoints -... -After the TensorRT-LLM checkpoint export, you can use the `trtllm-build` build command to build the engines from the exported checkpoints. Please check the [TensorRT-LLM Build API](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis) documentation for reference. +> Deprecated: Legacy TensorRT-LLM (C++ backend) checkpoint export and engine build are no longer covered by these examples. Use unified Hugging Face checkpoints and deploy via TRT-LLM’s PyTorch backend (LLM API), vLLM, or SGLang.Also applies to: 369-391
346-368
: Standardize on "NVFP4" and update the TensorRT‑LLM version note to v0.17.0Replace all "FP4" occurrences with "NVFP4" (or define "FP4 = NVFP4" at first mention) in examples/llm_ptq/README.md (lines 346–368; also 110–121, 132–133). Update the footnote to: "NVFP4 inference requires Blackwell GPUs and TensorRT‑LLM v0.17.0 or later." Verify the vLLM/SGLang support-matrix rows match vendor docs (vendor docs use "NVFP4" and NVFP4 support was added in TRT‑LLM v0.17.0).
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (1)
687-695
: Fix shell working-dir bug and use absolute model path in conversion step.
cd ..
will not persist across%%sh
cells and likely breaks the relative path. Also prefer the explicit container‑mounted path for the model.Apply:
%%sh # [run in TensorRT-LLM container] # set export path for converted checkpoints. The script saves the converted checkpoint in ${ROOT_SAVE_PATH}/saved_models_${MODEL_FULL_NAME} export ROOT_SAVE_PATH=/app/tensorrt_llm -# run conversion script -cd .. -bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh --model $(pwd)/qat/checkpoint-450/ --quant nvfp4 +# run conversion script (no cd; use absolute, volume-mounted model path) +bash TensorRT-Model-Optimizer/examples/llm_ptq/scripts/huggingface_example.sh \ + --model /app/tensorrt_llm/qat/checkpoint-450 \ + --quant nvfp4
♻️ Duplicate comments (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
113-114
: Set TP explicitly so multi‑GPU is actually exercised (tp=2).Defaults often leave TP=1; make intent explicit.
Apply this diff:
- PTQCommand(quant="nvfp4", min_gpu=2), + PTQCommand(quant="nvfp4", min_gpu=2, tp=2),
🧹 Nitpick comments (21)
tests/examples/llm_ptq/test_llm_ptq.py (1)
94-96
: Auto‑quant combos include nvfp4; gate on SM100 or split matrix.Without gating, CI on SM89 may select nvfp4 paths depending on heuristics. Add
min_sm=100
to nvfp4‑containing cases (or split into SM‑specific param sets).Apply this diff:
PTQCommand( - quant="nvfp4,fp8", + quant="nvfp4,fp8", min_sm=100, kv_cache_quant="fp8", calib_batch_size=4, auto_quantize_bits=6.4, ), PTQCommand( - quant="nvfp4,fp8", + quant="nvfp4,fp8", min_sm=100, kv_cache_quant="nvfp4", calib_batch_size=4, auto_quantize_bits=6.4, ),If auto‑quant logic already filters by capability, feel free to skip—but please confirm by pointing to the selection code.
Also applies to: 100-104
tests/examples/speculative_decoding/test_medusa.py (2)
68-68
: Add a quick artifact smoke‑check after PTQ.Assert the HF export actually produced quantized files; catches silent no‑ops and future CLI drift.
Apply:
- _run_hf_ptq(medusa_path, tmp_path / "medusa-tinyllama-hf", "fp8") + hf_export = tmp_path / "medusa-tinyllama-hf" + _run_hf_ptq(medusa_path, hf_export, "fp8") + assert hf_export.exists(), "HF PTQ export directory not created" + assert any(hf_export.glob("*.safetensors")), "Expected .safetensors in HF export"
22-28
: Avoid session‑wide pip installs in tests; prefer skip/xfail guard.The autouse fixture mutates the env and can break offline CI. Recommend skipping when transformers≥4.50 instead of installing a different version at runtime.
Replace the fixture with a module‑level skip:
-# TODO: Medusa QAT FSDP test hangs if transformers>=4.50 -@pytest.fixture(scope="session", autouse=True) -def install_transformers_lt_4_50(): - subprocess.run( - ["pip", "install", "transformers<4.50"], - check=True, - ) +# TODO: Medusa QAT FSDP test hangs if transformers>=4.50 — skip to keep CI hermetic +try: + import transformers # type: ignore + _ver = tuple(int(x) for x in transformers.__version__.split(".")[:2]) +except Exception: + _ver = None +pytestmark = pytest.mark.skipif(_ver is None or _ver >= (4, 50), + reason="Known hang with transformers>=4.50; skip until fixed")examples/speculative_decoding/example.ipynb (6)
17-18
: Fix wording to match new default (HF), not “unified export format”.The markdown still says “export to unified export format,” which conflicts with the PR objective “export to Hugging Face (HF) format by default.”
Apply this diff to align the text:
-First, quantize the base model (Llama-3.2-1B-Instruct) into FP8 and export to unified export format. +First, quantize the base model (Llama-3.2-1B-Instruct) into FP8 and export to Hugging Face (HF) format (default).
58-59
: Don’t hardcode API key in example.Remove
--api-key token-abc123
or document using an env var. Hardcoding encourages bad practice.Apply this diff:
-!vllm serve /tmp/llama3.2_1B_fp8 --api-key token-abc123 --port 8000 --tensor-parallel-size 1 --quantization=modelopt +!vllm serve /tmp/llama3.2_1B_fp8 --port 8000 --tensor-parallel-size 1 --quantization=modelopt +# If auth is desired: +# VLLM_API_KEY=$YOUR_TOKEN vllm serve ... --api-key "$VLLM_API_KEY"
129-136
: Import path robustness for local utilities.
from speculative_decoding.eagle_utils ...
assumes notebook CWD makes the package importable. Add a short path guard to avoid ImportError when run from repo root.Apply this diff at the top of the cell:
+import os, sys +nb_dir = os.path.dirname(os.getcwd()) if os.path.basename(os.getcwd()) == "speculative_decoding" else os.getcwd() +if os.path.isdir(os.path.join(nb_dir, "speculative_decoding")) and nb_dir not in sys.path: + sys.path.insert(0, nb_dir) from speculative_decoding.eagle_utils import DataCollatorWithPadding, LazySupervisedDataset
187-188
: Grammar: “an EAGLE model”.Minor copy edit.
-Now we have a EAGLE model in BF16 format. Next, we quantize this model into FP8 (PTQ). +Now we have an EAGLE model in BF16 format. Next, we quantize this model into FP8 (PTQ).
231-232
: Hyphenate “fine-tune”.Minor copy edit.
-To maintain the accuracy, we need to finetune the model (QAT). +To maintain accuracy, we need to fine-tune the model (QAT).
295-296
: Typo: “checkpoint”.If you keep the section title (even after removing TRT‑LLM steps), fix the typo.
-Then convert the Unified ckeckpoint to TRTLLM checkpoint. +Then convert the unified checkpoint...examples/llm_ptq/README.md (5)
35-38
: Clarify TRT‑LLM dependency as optional and prefer PyTorch backend wording.Reword to emphasize TRT‑LLM install is optional and, when used, the PyTorch backend (LLM API) is the supported path in examples. Avoid implying C++ backend is required here.
-If you want to deploy the quantized model on TRT-LLM, you will also need to install the TRT-LLM dependencies as per the ... -For NeMo models, use the NeMo container `nvcr.io/nvidia/nemo:25.04` or later which has all the dependencies including TRT-LLM installed. +If you plan to deploy via TRT‑LLM, optionally install TRT‑LLM and use its PyTorch backend (LLM API) as per the ... +For NeMo models, use the NeMo container `nvcr.io/nvidia/nemo:25.04` or later; install TRT‑LLM only if you deploy with its PyTorch backend.
151-153
: Fix duplicated word.“the the final …” → “the final …”.
-... so that -the the final mixed precision quantized model has ... +... so that +the final mixed precision quantized model has ...
280-286
: Grammar: “supports provide two paths”.Tighten the sentence.
-Model Optimizer supports provide two paths to export the quantized model: +Model Optimizer provides two paths to export the quantized model:
270-276
: Broaden section title; remove TRT‑specific framing.These benchmarks are framework‑agnostic. Rename to “Accuracy Validation” and avoid implying TRT‑only.
-### TensorRT-LLM Validation +### Accuracy Validation
241-247
: Nice practical notes; minor casing consistency.Use “Hugging Face” consistently (not “Huggingface”).
-> *If the Huggingface model calibration fails ... +> *If the Hugging Face model calibration fails ... -> *Huggingface models trained with ... +> *Hugging Face models trained with ... -> *... for the Huggingface tokenizer ... +> *... for the Hugging Face tokenizer ...Also applies to: 248-253
examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (7)
387-390
: Run calibration without grads and in eval mode.Saves memory and prevents autograd graph buildup.
Apply:
-def forward_loop(model): - for data in data_loader: - model(**data) +def forward_loop(model): + model.eval() + with torch.inference_mode(): + for data in data_loader: + model(**data)
16-19
: Fix QAT definition and grammar.QAT learns quantization effects during training (not “post‑training”).
Apply:
-**Quantization Aware Training (QAT)** is a method that learn the effects of quantization during neural network post-training to preserve accuracy when deploying models in very-low-precision formats. +**Quantization Aware Training (QAT)** is a method that learns the effects of quantization during training to preserve accuracy when deploying models in low‑precision formats. @@ -This notebook demonstrates how to apply Quantization Aware Training (QAT) to an LLM, Qwen3-8b in this example, with NVIDIA's TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrates on a small eval subset, applying NVFP4 quantization and finally deploying the quantized model to TensorRT-LLM. +This notebook demonstrates how to apply QAT to an LLM (Qwen3‑8B here) with NVIDIA’s TensorRT Model Optimizer (ModelOpt) QAT toolkit. We walk through downloading and loading the model, calibrating on a small eval subset, applying NVFP4 quantization, and finally deploying the quantized model with TensorRT‑LLM.
26-27
: Typo and wording fixes (user‑facing text).Polish for readability and consistency; aligns with PR copy edits.
Apply:
-## Installing Prerequisites and Dependancies +## Installing Prerequisites and Dependencies @@ -If you haven't already, install the required dependencies for this notebook. Key dependancies include: +If you haven't already, install the required dependencies for this notebook. Key dependencies include: @@ -This repo contains a `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependancies. +This repo contains an `examples/llm_qat/notebooks/requirements.txt` file that can be used to install all required dependencies. @@ -## Setting HuggingFace Token and Model for Download (Optional) +## Setting Hugging Face token and model for download (optional) @@ -include you token +include your token @@ -These can be controled by +These can be controlled by @@ -# Deploying the QAT Model with TensorRT-LLM +# Deploying the QAT model with TensorRT‑LLM (Torch backend) @@ -## Exporting Quantized Model for deployment -Before deploying the model with TensorRT-LLM you will need to export the model checkpoint files. This is similar to the step you take for a quantized PTQ Model. To export the unified Hugging Face checkpoints, which can be deployed on TensorRT-LLM Pytorch, vLLM and SGLang you will need to run the [huggingface_example.sh](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/scripts/huggingface_example.sh) script found in the TensorRT Model Optimizer repo. +## Exporting quantized model for deployment +Before deploying with TensorRT‑LLM you need to export the model checkpoint files. To export unified Hugging Face checkpoints (deployable on TensorRT‑LLM Torch, vLLM, and SGLang), run the [huggingface_example.sh](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/scripts/huggingface_example.sh) script from the TensorRT Model Optimizer repo. @@ -## Stop the TensorRT-LLM Docker contrainer +## Stop the TensorRT‑LLM Docker containerAlso applies to: 34-41, 61-61, 69-69, 417-417, 574-575, 625-627, 802-802
73-83
: Avoid embedding secrets in notebooks; prefer interactive Hugging Face login.Safer UX and fewer accidental leaks.
Apply:
-%env HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> +python - <<'PY' +from huggingface_hub import login +login() # Prompts securely; stores token in keyring or HF cache +PY
724-725
: TP size should reflect available GPUs; make it adaptive.Hardcoding
--tp_size 8
can fail on smaller hosts.Apply:
- --max_seq_len 4096 --tp_size 8 --pp_size 1 \ + --max_seq_len 4096 --tp_size "${TP_SIZE:-$(nvidia-smi -L | wc -l)}" --pp_size 1 \And document overriding
TP_SIZE
when needed.
720-727
: Avoid hardcoding the model output directory — use runtime discovery (CLI flags confirmed)Replace the hardcoded path with a discovery step; confirmed trtllm-serve flags are --max_seq_len, --max_batch_size, --max_num_tokens (TensorRT‑LLM 1.1.x, Torch backend).
Apply:
%%sh # [run in TensorRT-LLM container] -trtllm-serve /app/tensorrt_llm/saved_models_checkpoint-450_nvfp4_hf/ \ +# discover the most recent NVFP4 export +MODEL_OUT_DIR="$(ls -dt /app/tensorrt_llm/saved_models_*_nvfp4* 2>/dev/null | head -n 1)" +test -n "$MODEL_OUT_DIR" || { echo "No exported model found under /app/tensorrt_llm"; exit 1; } +trtllm-serve "$MODEL_OUT_DIR" \ --max_batch_size 1 --max_num_tokens 1024 \ --max_seq_len 4096 --tp_size 8 --pp_size 1 \ --host 0.0.0.0 --port 8000 \ --kv_cache_free_gpu_memory_fraction 0.95
612-618
: Parameterize TRT‑LLM container tag; don't hard‑code an RCReplace the hard‑coded image in examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb (lines 612–618) with an env var (TRTLLM_IMAGE) so users can override; as of 2025-09-17 the latest NGC tag for TensorRT‑LLM (Torch backend) is 1.1.0rc3 (modified 2025-09-04) — update the notebook/docs to recommend the preferred tag or leave no default.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/CODEOWNERS
(0 hunks)docs/source/getting_started/_installation_for_Linux.rst
(1 hunks)examples/llm_ptq/README.md
(3 hunks)examples/llm_ptq/hf_ptq.py
(5 hunks)examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
(1 hunks)examples/speculative_decoding/example.ipynb
(1 hunks)examples/vlm_ptq/README.md
(1 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(1 hunks)tests/examples/speculative_decoding/test_medusa.py
(2 hunks)tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
(0 hunks)
💤 Files with no reviewable changes (2)
- .github/CODEOWNERS
- tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (2)
- docs/source/getting_started/_installation_for_Linux.rst
- examples/vlm_ptq/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm_ptq/hf_ptq.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
PTQCommand
(28-87)param_str
(82-87)run
(44-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (9)
tests/examples/llm_ptq/test_llm_ptq.py (4)
109-109
: SM‑bounded case looks good.
w4a8_awq
constrained to SM89 only is clear and safe.
110-110
: Correct: nvfp4 gated to SM100.This avoids accidental runs on unsupported GPUs.
117-118
: LGTM: simpler invocation via TINY_LLAMA_PATH.Fixture removal and direct constant use are fine.
79-80
: Keep explicit 'none' — code expects that literal to mean “no KV-cache quant.” examples/llm_ptq/hf_ptq.py checks args.kv_cache_qformat != 'none' (around lines 246 and 454), so tests using kv_cache_quant='none' are correct; do not remove.examples/speculative_decoding/example.ipynb (3)
26-26
: LGTM: dropping --export_fmt aligns with new default HF export.This change matches the PR’s “HF by default” direction. No functional concerns.
9-11
: Fix typos in markdown.
- “medule” → “module”.
[ suggest_nitpick_refactor ]
Apply this diff:-The speculative decoding medule needs to learn to predict tokens from the base model. Therefore, we need to prepare the data generated from the base model. +The speculative decoding module needs to learn to predict tokens from the base model. Therefore, we need to prepare the data generated from the base model.
75-77
: Path/availability note for prepare_data.sh.Ensure this script exists in repo or provide a link; otherwise, the cell will fail.
If it’s in this example folder, consider making the call relative:
-!bash prepare_data.sh --data_path /tmp/Daring-Anteater/train.jsonl --output_path /tmp/finetune/data.jsonl --max_token 2048 +!bash examples/speculative_decoding/prepare_data.sh --data_path /tmp/Daring-Anteater/train.jsonl --output_path /tmp/finetune/data.jsonl --max_token 2048examples/llm_ptq/README.md (2)
256-259
: Avoid steering users to legacy export in DeepSeek example.Change “export to TensorRT‑LLM” to “export unified HF checkpoints and deploy with TRT‑LLM/vLLM/SGLang”.
[suggest_minor_refactor]
-[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM. +[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export unified HF checkpoints for deployment with TRT‑LLM/vLLM/SGLang.
206-206
: Update README to match parser's task list and lm_eval defaults. examples/llm_ptq/scripts/parser.sh defines VALID_TASKS=("quant" "mmlu" "mtbench" "lm_eval" "livecodebench" "simple_eval"), default TASKS="quant", and LM_EVAL_TASKS="mmlu,gsm8k" (--lm_eval_tasks is supported). README (examples/llm_ptq/README.md, around line 206) currently lists only "quant,mmlu,lm_eval,livecodebench" — add "mtbench" and "simple_eval" and optionally document the LM_EVAL_TASKS default.Likely an incorrect or invalid review comment.
> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.* | ||
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.* | ||
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Ensure --low_memory_mode is implemented and gated as documented.
README states it’s only supported for FP8/NVFP4 with max calibration. Verify hf_ptq.py
enforces this and that weight compression happens before calibration.
🏁 Script executed:
#!/usr/bin/env bash
# Confirm flag definition and usage constraints
rg -n -C2 --type=py 'low_memory_mode|--low_memory_mode' examples/llm_ptq/hf_ptq.py
rg -n -C3 --type=py 'calibrate\(|mtq\.calibrate|init_quantized_weights|NVFP4|FP8' examples/llm_ptq/hf_ptq.py
Length of output: 2446
Enforce --low_memory_mode gating and compress weights before calibration
README says low_memory_mode is only supported for FP8 / NVFP4 with max calibration. The flag is defined and referenced but there is no explicit enforcement or obvious compression-before-calibration:
- examples/llm_ptq/hf_ptq.py:730–731 (parser.add_argument for --low_memory_mode) and 232 (if not args.low_memory_mode).
- Calibration calls: 138–142 (mtq.calibrate(..., algorithm="max")) and 186–189 (mtq.calibrate / mtq.quantize branch).
- init_quantized_weights context: 252–255 (used around model loading/calibration).
Action items:
- Validate and error out if args.low_memory_mode is set while the chosen quant config is not FP8 or NVFP4, or the calibration algorithm is not "max".
- Ensure weight compression path for low_memory_mode runs before any mtq.calibrate / mtq.quantize calls (move compression into the HF checkpoint load path used when args.low_memory_mode is True).
- Add a small test or assertion exercising low_memory_mode to fail fast on unsupported configs.
### Quantize and Export | ||
|
||
```bash | ||
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_fmt hf --export_path <quantized_ckpt_path> --trust_remote_code | ||
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Command examples: confirm defaults now export unified HF without extra flags.
Looks good; verify hf_ptq.py
and huggingface_example.sh
default to unified HF format and no longer accept --export_fmt
.
Also applies to: 311-313
🏁 Script executed:
#!/usr/bin/env bash
# Check CLI surfaces for --export_fmt remnants and default export path
rg -n --type=sh --type=py 'export_fmt|export_hf_checkpoint|export_tensorrt_llm_checkpoint' examples/llm_ptq/{hf_ptq.py,scripts/huggingface_example.sh}
Length of output: 474
Remove deprecated --export_fmt
from hf_ptq CLI and enforce HF-only export
examples/llm_ptq/hf_ptq.py still defines/uses --export_fmt
(arg at L691; forced/warned at L750–751) and has related references at L40–41 and call sites at L594/L607 — remove the CLI flag and deprecation handling, and update docs/examples accordingly (examples/llm_ptq/scripts/huggingface_example.sh: no matches).
🤖 Prompt for AI Agents
In examples/llm_ptq/README.md around lines 301–305, update the docs and code to
remove the deprecated --export_fmt CLI flag and enforce HuggingFace-only export:
remove the --export_fmt argument definition in examples/llm_ptq/hf_ptq.py
(around L691), delete any deprecation warnings/handling (around L750–751) and
references at the top of the file (L40–41), update or remove call sites that
pass --export_fmt (L594 and L607), and adjust
examples/llm_ptq/scripts/huggingface_example.sh and README usage lines to call
hf_ptq without --export_fmt and to document only HF export (e.g., keep
--export_path and --trust_remote_code). Ensure any code paths or conditionals
that previously branched on export_fmt are simplified to HF-only export and
remove related tests or grep matches.
Signed-off-by: Keval Morabia <[email protected]>
6a47146
to
48440c8
Compare
…o chenjiel/deprecate_trtllm_build_2 Signed-off-by: Chenjie Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tests/examples/llm_ptq/test_llm_ptq.py (1)
113-114
: Enable multi‑GPU semantics: set TP explicitly.Default inference_tensor_parallel is 1; this won’t exercise TP across GPUs. Pass tp=2 to actually test multi‑GPU.
- PTQCommand(quant="nvfp4", min_gpu=2), + PTQCommand(quant="nvfp4", min_gpu=2, tp=2),examples/llm_ptq/hf_ptq.py (1)
214-229
: Fix weakened qformat validation (always‑true because of kv‑cache clause).The condition “... or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES” makes the assert pass even for unsupported qformats (kv_cache_qformat always valid due to argparse choices). Validate qformat independently.
- if not args.auto_quantize_bits: - assert ( - args.qformat - in [ - "int4_awq", - "fp8", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES - ), f"Quantization format {args.qformat} not supported for HF export path" + if not args.auto_quantize_bits: + assert args.qformat in QUANT_CFG_CHOICES, ( + f"Quantization format {args.qformat} not supported for HF export path" + )
🧹 Nitpick comments (7)
examples/gpt-oss/requirements.txt (1)
6-6
: Align Torch lower bound with code using torch.compiler.set_stance or guard the call.hf_ptq.py calls torch.compiler.set_stance("force_eager"), which may not exist in some 2.7.x releases. Either:
- raise lower bound to torch>=2.8.0, or
- keep >2.7.1 but guard the call with hasattr to avoid AttributeError.
Apply one of:
- torch>2.7.1 + torch>=2.8.0Or in hf_ptq.py (if you prefer keeping >2.7.1):
- torch.compiler.set_stance("force_eager") + if hasattr(torch.compiler, "set_stance"): + torch.compiler.set_stance("force_eager")tests/examples/speculative_decoding/test_medusa.py (1)
32-41
: Drop export_fmt usage: good; but avoid session‑global pip install side‑effects.
- The _run_hf_ptq change removing --export_fmt is correct and aligns with unified HF export. LGTM.
- The session‑autouse fixture that pip installs transformers<4.50 affects the entire test session and can conflict with other tests’ version pins. Scope the requirement to this test (e.g., WithRequirements pattern) or switch to a function/class‑scoped fixture.
Here’s a scoped alternative:
-@pytest.fixture(scope="session", autouse=True) -def install_transformers_lt_4_50(): - subprocess.run( - ["pip", "install", "transformers<4.50"], - check=True, - ) +class TestMedusa(WithRequirements): + requirements = [("transformers", "<4.50")] + + def test_llama_medusa_fp8_qat(...): + ...Also applies to: 68-68
examples/llm_ptq/hf_ptq.py (5)
92-106
: Avoid duplicating allowed qformats; derive from QUANT_CFG_CHOICES to prevent drift.The hard‑coded list can diverge from QUANT_CFG_CHOICES. Use its keys.
- assert all( - qformat - in [ - "fp8", - "int4_awq", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - for qformat in qformat_list - ), "One or more quantization formats provided are not supported for unified checkpoint export" + allowed = set(QUANT_CFG_CHOICES.keys()) + assert all(fmt in allowed for fmt in qformat_list), ( + "One or more quantization formats provided are not supported for unified checkpoint export" + )
126-143
: auto_quantize leaks CLI globals (args); pass needed flags explicitly.Using args.kv_cache_qformat inside auto_quantize couples it to the CLI flow and breaks reuse/imports. Pass kv_cache_qformat (and any other needed fields) as parameters.
-def auto_quantize( - model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1 -): +def auto_quantize( + model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1, kv_cache_qformat="none" +): @@ - enable_quant_kv_cache = args.kv_cache_qformat != "none" + enable_quant_kv_cache = kv_cache_qformat != "none" @@ - kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"] + kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]And update the call site in quantize_model:
- model = auto_quantize( + model = auto_quantize( model, args.qformat, args.auto_quantize_bits, calib_dataloader, calibrate_loop, - args.batch_size, + args.batch_size, + kv_cache_qformat=args.kv_cache_qformat, )
586-590
: Message clarity: call out “TRT‑LLM checkpoint export (no engine build)” explicitly.To avoid confusion with the deprecated TRT engine path, clarify the warning text that this exports a TensorRT‑LLM checkpoint only (for torch runtime unsupported models), not engines.
- warnings.warn( - "Still exporting TensorRT-LLM checkpoints for models not supported by the TensorRT-LLM torch runtime." - ) + warnings.warn( + "Exporting a TensorRT-LLM CHECKPOINT (no engine build) for models not yet supported by the TRT-LLM torch runtime." + )
601-606
: Redundant sparsity assert.This branch already implies dense sparsity. The explicit assert is redundant; drop to reduce noise.
- assert args.sparsity_fmt == "dense", ( - f"Sparsity format {args.sparsity_fmt} not supported by unified export api." - )
693-696
: Use DeprecationWarning category for --export_fmt.So users can filter warnings; current UserWarning is noisier.
- help="Deprecated. Please avoid using this argument.", + help="Deprecated; ignored. Will be removed in a future release.",- if args.export_fmt != "hf": - warnings.warn("Deprecated. --export_fmt forced to hf.") + if args.export_fmt != "hf": + warnings.warn("Deprecated. --export_fmt is ignored and forced to 'hf'.", DeprecationWarning)Also applies to: 750-752
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/CODEOWNERS
(0 hunks)docs/source/getting_started/_installation_for_Linux.rst
(1 hunks)examples/gpt-oss/requirements.txt
(1 hunks)examples/llm_ptq/README.md
(3 hunks)examples/llm_ptq/hf_ptq.py
(5 hunks)examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
(1 hunks)examples/speculative_decoding/example.ipynb
(1 hunks)examples/vlm_ptq/README.md
(1 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(1 hunks)tests/examples/speculative_decoding/test_medusa.py
(2 hunks)tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
(0 hunks)
💤 Files with no reviewable changes (2)
- .github/CODEOWNERS
- tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (1)
- docs/source/getting_started/_installation_for_Linux.rst
🚧 Files skipped from review as they are similar to previous changes (4)
- examples/speculative_decoding/example.ipynb
- examples/llm_qat/notebooks/QAT_QAD_Walkthrough.ipynb
- examples/vlm_ptq/README.md
- examples/llm_ptq/README.md
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/hf_ptq.py (1)
modelopt/torch/export/model_config_export.py (1)
export_tensorrt_llm_checkpoint
(443-553)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
PTQCommand
(28-87)param_str
(82-87)run
(44-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
Internal CI with latest container running: https://gitlab-master.nvidia.com/omniml/modelopt/-/pipelines/35039819 |
Signed-off-by: Chenjie Luo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
tests/examples/llm_ptq/test_llm_ptq.py (2)
113-113
: Explicit TP=2 for multi‑GPU: good fix and matches prior feedback.Keeps behavior unambiguous across runners.
Run to confirm parser wiring of
--tp
:#!/bin/bash rg -n -C2 --hidden '\b(tp|tensor[_-]?parallel)\b' examples/llm_ptq
93-104
: Make autoquant_kv_cache deterministic per arch.Ensure fp8 path runs on sm89 and nvfp4 path is skipped unless sm100.
-PTQCommand( - quant="nvfp4,fp8", - kv_cache_quant="fp8", - calib_batch_size=4, - auto_quantize_bits=6.4, -), -PTQCommand( - quant="nvfp4,fp8", - kv_cache_quant="nvfp4", - calib_batch_size=4, - auto_quantize_bits=6.4, -), +PTQCommand( + quant="fp8", + kv_cache_quant="fp8", + calib_batch_size=4, + auto_quantize_bits=6.4, +), +PTQCommand( + quant="nvfp4", + kv_cache_quant="nvfp4", + calib_batch_size=4, + auto_quantize_bits=6.4, + min_sm=100, +),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/examples/llm_ptq/test_llm_ptq.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
PTQCommand
(28-87)param_str
(82-87)run
(44-80)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (2)
tests/examples/llm_ptq/test_llm_ptq.py (2)
108-111
: SM‑targeted smoke tests look good.w4a8_awq pinned to sm89 and nvfp4 to sm100 aligns with hardware support.
117-118
: LGTM: using TINY_LLAMA_PATH directly simplifies the test.
> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.* | ||
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.* | ||
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: I remember we had some accuracy degradation with --low_memory_mode
- is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I did an overall review - I might have missed some details.
070063c
to
f4b3f73
Compare
Signed-off-by: Chenjie Luo <[email protected]>
f4b3f73
to
1885e81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (10)
pyproject.toml (1)
119-119
: Re‑enabling pytest defaults (strict markers, durations, coverage) — was this intentional?Commenting out addopts disables strict marker enforcement, slow‑test timing, and coverage reports; CI expecting coverage.xml may break. Prior feedback said this was reverted.
Option A — restore previous behavior (add explicit --cov target to avoid no-op):
-#addopts = "-ra --cov-report=term-missing --cov-report=html --cov-report=xml:coverage.xml --cov-config=pyproject.toml --durations=20 --strict-markers" +addopts = "-ra --cov=modelopt --cov-report=term-missing --cov-report=xml:coverage.xml --cov-config=pyproject.toml --durations=20 --strict-markers"Option B — minimal local defaults; move coverage flags to CI:
-#addopts = "-ra --cov-report=term-missing --cov-report=html --cov-report=xml:coverage.xml --cov-config=pyproject.toml --durations=20 --strict-markers" +addopts = "-ra --strict-markers --durations=20"Run to verify CI coupling before choosing:
#!/bin/bash # Find CI steps/artifacts depending on coverage.xml or pytest-cov flags fd -HI --type f '.github' '.gitlab' | xargs -I{} rg -n --hidden -C2 -e 'coverage\.xml|--cov(=|\b)|pytest.+--cov' {} # Also check for Codecov/Sonar references rg -n -C2 -e 'codecov|sonar.*coverage' .github .gitlab || truetests/examples/llm_ptq/test_llm_ptq.py (3)
83-85
: Gate NVFP4 variants to Blackwell (SM100).These entries will run on SM89 runners and fail.
- PTQCommand(quant="nvfp4"), - PTQCommand(quant="nvfp4_awq"), + PTQCommand(quant="nvfp4", min_sm=100), + PTQCommand(quant="nvfp4_awq", min_sm=100),
86-92
: Split auto‑quant by arch; keep NVFP4 behind SM100.Prevents mixed list from breaking on SM89.
- PTQCommand( - quant="int4_awq,nvfp4,fp8,w4a8_awq", - calib_batch_size=4, - auto_quantize_bits=6.4, - kv_cache_quant="none", - ), + PTQCommand( + quant="int4_awq,fp8,w4a8_awq", + calib_batch_size=4, + auto_quantize_bits=6.4, + kv_cache_quant="none", + ), + PTQCommand( + quant="int4_awq,nvfp4", + calib_batch_size=4, + auto_quantize_bits=6.4, + kv_cache_quant="none", + min_sm=100, + ),
113-114
: Set TP explicitly for multi‑GPU cases to actually engage tensor‑parallel.min_gpu=2 alone won’t enable TP.
- PTQCommand(quant="fp8", min_gpu=2, min_sm=89), - PTQCommand(quant="nvfp4", min_gpu=2, min_sm=100), + PTQCommand(quant="fp8", min_gpu=2, min_sm=89, tp=2), + PTQCommand(quant="nvfp4", min_gpu=2, min_sm=100, tp=2),examples/llm_ptq/scripts/huggingface_example.sh (3)
131-138
: Bug: using $qformat outside its loop; use $QFORMAT and quote paths.Otherwise the bf16/fp16 fast‑path may never trigger; also fix brittle symlink loop.
- if [ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]; then + if [ "$QFORMAT" = "bf16" ] || [ "$QFORMAT" = "fp16" ]; then if [ -d "$MODEL_PATH" ]; then MODEL_CONFIG_EXIST=true MODEL_CONFIG=$MODEL_PATH/config.json - for file in $MODEL_PATH/*; do ln -sf "$file" $SAVE_PATH/; done + for file in "$MODEL_PATH"/*; do ln -sf "$file" "$SAVE_PATH"/; doneAlso make the “empty dir” check robust under set -e:
-if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then +if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || ! ls -A "$SAVE_PATH" >/dev/null 2>&1; then
167-170
: Message still references deprecated trtllm-build.Align with Torch backend guidance.
- echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). Please deploy with the TRT-LLM using trtllm-build. Checkpoint export_path: $SAVE_PATH" + echo "Sparse quantization detected (SPARSITY_FMT=$SPARSITY_FMT). This HF example does not deploy sparse models; use the TensorRT‑LLM Torch backend sparse deployment flow. Checkpoint export_path: $SAVE_PATH"
181-184
: Fix SC2076 and support comma‑separated QFORMAT values.Current regex‑in‑string check is wrong and fails for lists.
- if [[ ! " fp8 nvfp4 bf16 fp16 " =~ " ${QFORMAT} " ]]; then - echo "Quant $QFORMAT specified. Please read TensorRT-LLM quantization support matrix https://nvidia.github.io/TensorRT-LLM/features/quantization.html#quantization-in-tensorrt-llm and use TensorRT-LLM for deployment. Checkpoint export_path: $SAVE_PATH" - exit 0 - fi + unsupported="" + IFS=','; for q in $QFORMAT; do + case "$q" in fp8|nvfp4|bf16|fp16) : ;; *) unsupported+="${unsupported:+,}$q" ;; esac + done; IFS=' ' + if [ -n "$unsupported" ]; then + echo "Quant(s) [$unsupported] not supported by the TensorRT‑LLM Torch runtime; please consult the support matrix and deploy with TensorRT‑LLM as appropriate. Checkpoint export_path: $SAVE_PATH" + exit 0 + fiexamples/vlm_ptq/scripts/huggingface_example.sh (2)
89-103
: Make empty‑dir test robust and standardize on python3.Prevents brittle failures under set -e and mismatched interpreters.
-if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH) ]]; then +if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || ! ls -A "$SAVE_PATH" >/dev/null 2>&1; then @@ - python ../llm_ptq/hf_ptq.py \ + python3 ../llm_ptq/hf_ptq.py \ --pyt_ckpt_path=$MODEL_PATH \ - --export_path=$SAVE_PATH \ - --qformat=$QFORMAT \ + --export_path="$SAVE_PATH" \ + --qformat="$QFORMAT" \ --calib_size=$CALIB_SIZE \ --batch_size=$CALIB_BATCH_SIZE \ $PTQ_ARGS
104-116
: Reorder NVFP4 Blackwell guard before early exit.Currently unreachable if QFORMAT != fp8.
-if [[ "$QFORMAT" != "fp8" ]]; then - echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH." - exit 0 -fi - -if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "$KV_CACHE_QUANT" == *"nvfp4"* ]]; then +if [[ "$QFORMAT" == *"nvfp4"* ]] || [[ "${KV_CACHE_QUANT:-}" == *"nvfp4"* ]]; then cuda_major=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader -i 0 | cut -d. -f1) if [ "$cuda_major" -lt 10 ]; then echo "Please deploy the NVFP4 checkpoint on a Blackwell GPU. Checkpoint export_path: $SAVE_PATH" exit 0 fi fi + +if [[ "$QFORMAT" != "fp8" ]]; then + echo "For quant format $QFORMAT, please refer to the TensorRT-LLM documentation for deployment. Checkpoint saved to $SAVE_PATH." + exit 0 +fiexamples/llm_ptq/hf_ptq.py (1)
214-229
: Assertion is ineffective; it always passes because kv_cache_qformat is always valid.This lets unsupported qformats through to export.
- if not args.auto_quantize_bits: - assert ( - args.qformat - in [ - "int4_awq", - "fp8", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES - ), f"Quantization format {args.qformat} not supported for HF export path" + if not args.auto_quantize_bits: + assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, ( + f"Quantization format {args.qformat} not supported for unified HF export" + )Add near the top (after QUANT_CFG_CHOICES) to centralize:
+ALLOWED_UNIFIED_HF_QFORMATS = { + "int4_awq", + "fp8", + "nvfp4", + "nvfp4_awq", + "w4a8_awq", + "fp8_pb_wo", + "w4a8_mxfp4_fp8", + "nvfp4_mlp_only", +}
🧹 Nitpick comments (9)
tests/examples/llm_ptq/test_llm_ptq.py (2)
80-81
: Add min_gpu=2 when requesting tp=2 to avoid single‑GPU CI failures.Without gating, this will try TP=2 on 1 GPU and fail.
- PTQCommand(quant="int8_sq", kv_cache_quant="none", tp=2, pp=2), + PTQCommand(quant="int8_sq", kv_cache_quant="none", tp=2, pp=2, min_gpu=2),
108-111
: Fix misleading comment or gate accordingly.Comment says “sm100” but min_sm=89.
- PTQCommand(quant="fp8", kv_cache_quant="none", min_sm=89), # sm100 + PTQCommand(quant="fp8", kv_cache_quant="none", min_sm=89), # sm89examples/llm_ptq/scripts/huggingface_example.sh (2)
275-276
: Quote command substitution to avoid word splitting; pick one JSONL.Prevents SC2046 and multi‑file expansion issues.
- JSONL_PATH=$(readlink -f $(find $SAVE_PATH -type f -name '*.jsonl')) + JSONL_PATH=$(readlink -f "$(find "$SAVE_PATH" -type f -name '*.jsonl' | head -n1)")
52-62
: Avoid hard‑coding qformat allow‑lists in multiple places.Source a single quant_formats.sh to keep HF/Torch lists in sync.
+# quant_formats.sh exports HF_EXPORT_QFORMATS and TORCH_LLMAPI_QFORMATS +. "$script_dir/quant_formats.sh" -#Iterate over list of qformats provided and check if they are valid -IFS="," -for qformat in $QFORMAT; do - case $qformat in - fp8 | fp8_pc_pt | fp8_pb_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; - *) - echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2 - exit 1 - ;; - esac -done +IFS=','; for q in $QFORMAT; do + case " $HF_EXPORT_QFORMATS " in *" $q "*) : ;; *) echo "Unknown quant: $q"; exit 1 ;; esac +done; IFS=' 'examples/llm_ptq/hf_ptq.py (3)
92-105
: Deduplicate allowed‑format checks; reuse a single constant.Keeps auto‑quant validation consistent with main arg validation.
- assert all( - qformat - in [ - "fp8", - "int4_awq", - "nvfp4", - "nvfp4_awq", - "w4a8_awq", - "fp8_pb_wo", - "w4a8_mxfp4_fp8", - "nvfp4_mlp_only", - ] - for qformat in qformat_list - ), "One or more quantization formats provided are not supported for unified checkpoint export" + assert all(q in ALLOWED_UNIFIED_HF_QFORMATS for q in qformat_list), ( + "One or more quantization formats are not supported for unified checkpoint export" + )
86-93
: Avoid hidden dependency on global args inside auto_quantize.Pass needed values explicitly to prevent surprises on import.
-def auto_quantize( - model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1 -): +def auto_quantize( + model, + qformat, + auto_quantize_bits, + calib_dataloader, + calibrate_loop, + kv_cache_qformat, + batch_size=1, +): @@ - enable_quant_kv_cache = args.kv_cache_qformat != "none" + enable_quant_kv_cache = kv_cache_qformat != "none" @@ - kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"] + kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]And at the call site:
- model = auto_quantize( + model = auto_quantize( model, args.qformat, args.auto_quantize_bits, calib_dataloader, calibrate_loop, + args.kv_cache_qformat, args.batch_size, )Also applies to: 178-185
545-548
: Typo: “quantizaton”.Minor doc fix.
- assert model_type != "dbrx", f"Does not support export {model_type} without quantizaton" + assert model_type != "dbrx", f"Does not support export {model_type} without quantization"examples/llm_ptq/scripts/parser.sh (1)
135-141
: Remove stale prints for unset variables (BUILD_MAX_INPUT_LEN, NUM_SAMPLES).These were dropped from parsing; printing them is confusing.
- echo "input: $BUILD_MAX_INPUT_LEN" echo "output: $BUILD_MAX_OUTPUT_LEN" echo "batch: $BUILD_MAX_BATCH_SIZE" @@ - echo "num_sample: $NUM_SAMPLES"CHANGELOG.rst (1)
9-12
: Clarify deprecation wording.Tighten language and fix grammar.
-- TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly. -- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format. -- ``int8_sq`` quantization format is deprecated from the ``examples/vlm_ptq`` respect to the TensorRT-LLM's torch backend switch. Please refer to the previous releases if this quantization format is needed. -- ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend. +- Deprecate the TensorRT‑LLM TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. The ``build`` and ``benchmark`` tasks are removed; use ``quant`` plus ``trtllm-bench`` for performance evaluation. +- Remove ``--export_fmt`` in ``examples/llm_ptq``. Export defaults to the unified Hugging Face checkpoint format. +- Deprecate ``int8_sq`` in ``examples/vlm_ptq`` due to the switch to the TensorRT‑LLM Torch backend. Refer to previous releases if needed. +- Deprecate ``examples/vlm_eval`` (it depends on the removed TRT backend).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
CHANGELOG.rst
(1 hunks)examples/llm_ptq/hf_ptq.py
(5 hunks)examples/llm_ptq/scripts/huggingface_example.sh
(9 hunks)examples/llm_ptq/scripts/parser.sh
(3 hunks)examples/vlm_ptq/scripts/huggingface_example.sh
(4 hunks)pyproject.toml
(1 hunks)tests/examples/llm_ptq/test_llm_ptq.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/hf_ptq.py (1)
modelopt/torch/export/model_config_export.py (1)
export_tensorrt_llm_checkpoint
(443-553)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (3)
PTQCommand
(28-87)param_str
(82-87)run
(44-80)
🪛 Shellcheck (0.10.0)
examples/llm_ptq/scripts/huggingface_example.sh
[warning] 181-181: Remove quotes from right-hand side of =~ to match as a regex rather than literally.
(SC2076)
[warning] 275-275: Quote this to prevent word splitting.
(SC2046)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
|
||
# Parse command-line options | ||
ARGS=$(getopt -o "" -l "type:,model:,quant:,kv_cache_quant:,tp:,calib_tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,input:,output:,batch:,tasks:,export_fmt:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") | ||
ARGS=$(getopt -o "" -l "model:,quant:,kv_cache_quant:,tp:,pp:,sparsity:,awq_block_size:,calib:,calib_batch_size:,auto_quantize_bits:,output:,batch:,tasks:,lm_eval_tasks:,lm_eval_limit:,simple_eval_tasks:,trust_remote_code,use_seq_device_map,gpu_max_mem_percentage:,kv_cache_free_gpu_memory_fraction:,low_memory_mode,no-verbose,calib_dataset:" -n "$0" -- "$@") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Optional: add --num_samples back or delete related code across scripts.
If still referenced externally, re‑add to getopt; else ensure nothing depends on it.
🏁 Script executed:
#!/bin/bash
# Find any remaining references to --num_samples / NUM_SAMPLES
rg -n --hidden -C2 '\b(--num_samples|NUM_SAMPLES)\b' || true
Length of output: 420
Re-add --num_samples to getopt or remove NUM_SAMPLES usage
examples/llm_ptq/scripts/parser.sh still references NUM_SAMPLES (line 141: echo "num_sample: $NUM_SAMPLES"
); either add --num_samples
to the ARGS getopt long options (around line 39) or remove the NUM_SAMPLES variable, its echo, and any dependent code.
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/parser.sh around line 39, the getopt long options
list is missing --num_samples while the script later references NUM_SAMPLES; fix
by adding "num_samples:" to the long options string (so getopt recognizes and
parses --num_samples) and ensure the subsequent case/assignment logic maps that
option into the NUM_SAMPLES variable, or alternatively remove all references to
NUM_SAMPLES and its echo if you intend to drop the feature.
Signed-off-by: Keval Morabia <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]>
Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Keval Morabia <[email protected]> Co-authored-by: Chenjie Luo <[email protected]> Co-authored-by: Keval Morabia <[email protected]> Signed-off-by: Ye Yu <[email protected]>
What does this PR do?
Clean up
Overview: ?
As TRT-LLM is moving to the torch backend, we also removed all the TRT-LLM's TRT backend in our examples.
It includes
Summary by CodeRabbit
New Features
Refactor
Deprecations/Bug Fixes
Documentation
Chores