Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Model Optimizer Changelog (Linux)

- Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM or TensorRT docker image directly or refer to the `installation guide <https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html>`_ for more details.
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. ``engine_dir`` is replaced with ``checkpoint_dir`` in ``examples/llm_ptq`` and ``examples/vlm_ptq``. For performance evaluation, please use ``trtllm-bench`` directly.
- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
- Deprecated ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.

Expand Down
8 changes: 4 additions & 4 deletions examples/llm_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ If `trust_remote_code` needs to be true, please append the command with the `--t
### TensorRT-LLM

```sh
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <engine batch size>
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,checkpoint_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <max batch size>
```

## MMLU
Expand Down Expand Up @@ -137,10 +137,10 @@ python mmlu.py --model_name causal --model_path <HF model folder or model card>
python mmlu.py --model_name causal --model_path <HF model folder or model card> --quant_cfg $MODELOPT_QUANT_CFG_TO_SEARCH --auto_quantize_bits $EFFECTIVE_BITS --batch_size 4
```

### Evaluate the TensorRT-LLM engine
### Evaluate with TensorRT-LLM

```bash
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <Quantized checkpoint dir>
python mmlu.py --model_name causal --model_path <HF model folder or model card> --checkpoint_dir <Quantized checkpoint dir>
```

## MT-Bench
Expand All @@ -160,7 +160,7 @@ bash run_fastchat.sh -h <HF model folder or model card>
bash run_fastchat.sh -h <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFG
```

### Evaluate the TensorRT-LLM engine
### Evaluate with TensorRT-LLM

```bash
bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>
Expand Down
24 changes: 12 additions & 12 deletions examples/llm_eval/gen_model_answer.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def run_eval(
max_gpu_memory,
dtype,
revision,
engine_dir,
checkpoint_dir,
nim_model,
args,
):
Comment on lines +121 to 124
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Rename to checkpoint_dir in run_eval: looks good

Parameter rename is consistent.

Run to confirm no stale uses remain:


🏁 Script executed:

#!/bin/bash
# Find lingering engine_dir references (flags, kwargs, vars)
rg -n -C2 -g '!**/venv/**' -g '!**/build/**' -g '!**/dist/**' -e '\bengine_dir\b' -e '--engine-dir'

Length of output: 148


🏁 Script executed:

#!/bin/bash
# Re-run search including all files and excluding venv, build, dist
rg -n -C2 -g '*' -g '!venv/**' -g '!build/**' -g '!dist/**' -e '\bengine_dir\b' -e '--engine-dir'

Length of output: 5278


Stale engine_dir references in Windows accuracy benchmark examples

  • examples/windows/accuracy_benchmark/README.md (lines 165, 177): replace --engine_dir with --checkpoint-dir
  • examples/windows/accuracy_benchmark/mmlu_benchmark.py (line 523): rename CLI arg and variable from engine_dir to checkpoint_dir
  • examples/windows/accuracy_benchmark/trtllm_utils.py (around line 317): update parser.add_argument("--engine_dir",…) and all helper functions to use checkpoint_dir instead
🤖 Prompt for AI Agents
In examples/llm_eval/gen_model_answer.py around lines 121-124, ensure any
remaining references to the old CLI/variable name engine_dir are replaced with
checkpoint_dir; across the Windows accuracy benchmark examples update the README
(lines 165,177) to use --checkpoint-dir, rename CLI args and variables in
examples/windows/accuracy_benchmark/mmlu_benchmark.py (line ~523) from
engine_dir to checkpoint_dir, and in
examples/windows/accuracy_benchmark/trtllm_utils.py update
parser.add_argument("--engine_dir", ...) to --checkpoint-dir and rename all
helper function parameters and internal variables from engine_dir to
checkpoint_dir so names and docs are consistent.

Expand Down Expand Up @@ -150,7 +150,7 @@ def run_eval(
revision=revision,
top_p=top_p,
temperature=temperature,
engine_dir=engine_dir,
checkpoint_dir=checkpoint_dir,
nim_model=nim_model,
)
Comment on lines +153 to 155
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Pass explicit flags to remote worker; avoid relying on globals inside Ray

When use_ray is true, get_model_answers runs out-of-process. Pass required options explicitly to avoid NameError on globals.

Apply:

             checkpoint_dir=checkpoint_dir,
+            trust_remote_code=args.trust_remote_code,
+            quant_cfg=args.quant_cfg,
+            calib_batch_size=args.calib_batch_size,
+            calib_size=args.calib_size,
+            auto_quantize_bits=args.auto_quantize_bits,
             nim_model=nim_model,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
checkpoint_dir=checkpoint_dir,
nim_model=nim_model,
)
checkpoint_dir=checkpoint_dir,
trust_remote_code=args.trust_remote_code,
quant_cfg=args.quant_cfg,
calib_batch_size=args.calib_batch_size,
calib_size=args.calib_size,
auto_quantize_bits=args.auto_quantize_bits,
nim_model=nim_model,
)
🤖 Prompt for AI Agents
In examples/llm_eval/gen_model_answer.py around lines 153-155, the call that
runs get_model_answers out-of-process with use_ray=True relies on module globals
and will raise NameError inside the remote worker; update the remote invocation
to pass all required options explicitly (e.g., flags/variables previously
referenced as globals such as
model/config/checkpoint_dir/nim_model/tokenizer/device/seed/other runtime
options) and update get_model_answers' remote-compatible signature to accept
those parameters so the worker receives everything it needs; ensure only
serializable types are passed and remove reliance on globals inside the function
body.

for i in range(0, len(questions), chunk_size)
Expand All @@ -174,25 +174,25 @@ def get_model_answers(
revision,
top_p=None,
temperature=None,
engine_dir=None,
checkpoint_dir=None,
nim_model=None,
):
# Model Optimizer modification
if engine_dir:
if checkpoint_dir:
tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
if engine_dir:
if checkpoint_dir:
# get model type
last_part = os.path.basename(engine_dir)
last_part = os.path.basename(checkpoint_dir)
model_type = last_part.split("_")[0]
# Some models require to set pad_token and eos_token based on external config (e.g., qwen)
if model_type == "qwen":
tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)

assert LLM is not None, "tensorrt_llm APIs could not be imported."
model = LLM(engine_dir, tokenizer=tokenizer)
model = LLM(checkpoint_dir, tokenizer=tokenizer)
else:
raise ValueError("engine_dir is required for TensorRT LLM inference.")
raise ValueError("checkpoint_dir is required for TensorRT LLM inference.")
elif not nim_model:
model, _ = load_model(
model_path,
Expand Down Expand Up @@ -259,7 +259,7 @@ def get_model_answers(

# some models may error out when generating long outputs
try:
if not engine_dir:
if not checkpoint_dir:
output_ids = model.generate(
torch.as_tensor(input_ids).cuda(),
do_sample=do_sample,
Expand Down Expand Up @@ -427,9 +427,9 @@ def reorg_answer_file(answer_file):
help="The model revision to load.",
)
parser.add_argument(
"--engine-dir",
"--checkpoint-dir",
type=str,
help="The path to the TensorRT LLM engine directory.",
help="The path to the model checkpoint directory.",
)
parser.add_argument(
"--nim-model",
Expand Down Expand Up @@ -502,7 +502,7 @@ def reorg_answer_file(answer_file):
max_gpu_memory=args.max_gpu_memory,
dtype=str_to_torch_dtype(args.dtype),
revision=args.revision,
engine_dir=args.engine_dir,
checkpoint_dir=args.checkpoint_dir,
nim_model=args.nim_model,
args=args,
)
Expand Down
8 changes: 4 additions & 4 deletions examples/llm_eval/lm_eval_tensorrt_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ class TRTLLM(TemplateAPI):
def __init__(
self,
tokenizer: str,
engine_dir: str,
checkpoint_dir: str,
batch_size: int = 1,
**kwargs,
):
Expand All @@ -56,11 +56,11 @@ def __init__(
if self.tokenizer.pad_token_id is None:
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

assert isinstance(engine_dir, str)
assert isinstance(checkpoint_dir, str)

self.llm = LLM(checkpoint_dir=engine_dir, tokenizer=self.tokenizer)
self.llm = LLM(checkpoint_dir=checkpoint_dir, tokenizer=self.tokenizer)
self.max_length = self.llm.max_seq_len - 1
logger.info("Loaded TRT-LLM engine")
logger.info("Loaded TRT-LLM")

def model_call(
self,
Expand Down
8 changes: 5 additions & 3 deletions examples/llm_eval/mmlu.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,9 +252,9 @@ def main(
mto.enable_huggingface_checkpointing()
model_path = kwargs["model_path"]
tokenizer = get_tokenizer(model_path, trust_remote_code=kwargs.get("trust_remote_code", False))
if kwargs.get("engine_dir"):
if kwargs.get("checkpoint_dir"):
# get model type
last_part = os.path.basename(kwargs["engine_dir"])
last_part = os.path.basename(kwargs["checkpoint_dir"])
model_type = last_part.split("_")[0]
# Some models require to set pad_token and eos_token based on external config (e.g., qwen)
if model_type == "qwen":
Expand All @@ -264,7 +264,9 @@ def main(
assert LLM is not None, "tensorrt_llm APIs could not be imported."
medusa_choices = kwargs.get("medusa_choices")
model = LLM(
checkpoint_dir=kwargs["engine_dir"], tokenizer=tokenizer, medusa_choices=medusa_choices
checkpoint_dir=kwargs["checkpoint_dir"],
tokenizer=tokenizer,
medusa_choices=medusa_choices,
)
else:
model = select_model(
Expand Down
20 changes: 10 additions & 10 deletions examples/llm_eval/run_fastchat.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,18 @@
# If you are using NIM, ensure that you export the NIM API key using:
# export OPENAI_API_KEY=<NIM_API_KEY>
#
# Usage: bash run_fastchat.sh -h <HF model folder or model card> -e <engine_dir> -n <NIM model model card>
# Usage: bash run_fastchat.sh -h <HF model folder or model card> -e <checkpoint_dir> -n <NIM model model card>
# model_name: The HuggingFace handle or folder of the model to evaluate.
# engine_dir: The directory where the TRT-LLM engine is stored.
# checkpoint_dir: The directory where the checkpoint is stored.
# nim_model_name: The handle of the NIM model to be used for evaluation.
#
# Example commands:
#
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model:
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct
#
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model with TRT-LLM engine:
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct -e /path/to/engine_dir
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model with TRT-LLM:
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct -e /path/to/checkpoint_dir
#
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model with NIM:
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct -n meta-llama/Meta-Llama-3-8B-Instruct
Expand All @@ -41,7 +41,7 @@ set -e
set -x

hf_model_name=""
engine_dir=""
checkpoint_dir=""
nim_model_name=""
answer_file=""
quant_cfg=""
Expand All @@ -56,9 +56,9 @@ while [[ "$1" != "" ]]; do
shift
hf_model_name=$1
;;
-e | --engine_dir )
-e | --checkpoint_dir )
shift
engine_dir=$1
checkpoint_dir=$1
;;
-n | --nim_model_name )
shift
Expand Down Expand Up @@ -96,8 +96,8 @@ if [ "$hf_model_name" == "" ]; then
exit 1
fi

if [ "$engine_dir" != "" ]; then
engine_dir=" --engine-dir $engine_dir "
if [ "$checkpoint_dir" != "" ]; then
checkpoint_dir=" --checkpoint-dir $checkpoint_dir "
fi

if [ "$nim_model_name" != "" ]; then
Expand Down Expand Up @@ -143,7 +143,7 @@ PYTHONPATH=FastChat:$PYTHONPATH python gen_model_answer.py \
--model-id $hf_model_name \
--temperature 0.0001 \
--top-p 0.0001 \
$engine_dir \
$checkpoint_dir \
$nim_model_name \
$answer_file \
$quant_args
10 changes: 0 additions & 10 deletions examples/llm_ptq/example_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,16 +36,6 @@ def is_speculative(hf_config):
)


def get_mode_type_from_engine_dir(engine_dir_str):
# Split the path by '/' and get the last part
last_part = os.path.basename(engine_dir_str)

# Split the last part by '_' and get the first segment
model_type = last_part.split("_")[0]

return model_type


def get_tokenizer(ckpt_path, trust_remote_code=False, **kwargs):
print(f"Initializing tokenizer from {ckpt_path}")

Expand Down
10 changes: 5 additions & 5 deletions examples/llm_ptq/run_tensorrt_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""An example script to run the tensorrt_llm engine."""
"""An example script to run the tensorrt_llm inference."""

import argparse

Expand All @@ -28,7 +28,7 @@ def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument("--tokenizer", type=str, default="")
parser.add_argument("--max_output_len", type=int, default=100)
parser.add_argument("--engine_dir", type=str, default="/tmp/modelopt")
parser.add_argument("--checkpoint_dir", type=str)
parser.add_argument(
"--input_texts",
type=str,
Expand All @@ -49,8 +49,8 @@ def parse_arguments():

def run(args):
if not args.tokenizer:
# Assume the tokenizer files are saved in the engine_dr.
args.tokenizer = args.engine_dir
# Assume the tokenizer files are saved in the checkpoint_dir.
args.tokenizer = args.checkpoint_dir

if isinstance(args.tokenizer, PreTrainedTokenizerBase):
tokenizer = args.tokenizer
Expand All @@ -66,7 +66,7 @@ def run(args):

print("TensorRT-LLM example outputs:")

llm = LLM(args.engine_dir, tokenizer=tokenizer, max_batch_size=len(input_texts))
llm = LLM(args.checkpoint_dir, tokenizer=tokenizer, max_batch_size=len(input_texts))
torch.cuda.cudart().cudaProfilerStart()
outputs = llm.generate_text(input_texts, args.max_output_len)
torch.cuda.cudart().cudaProfilerStop()
Expand Down
8 changes: 4 additions & 4 deletions examples/llm_ptq/scripts/huggingface_example.sh
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH
echo "Quantized model config $MODEL_CONFIG exists, skipping the quantization stage"
fi

# for enc-dec model, users need to refer TRT-LLM example to build engines and deployment
# for enc-dec model, users need to refer TRT-LLM example for deployment
if [[ -f "$SAVE_PATH/encoder/config.json" && -f "$SAVE_PATH/decoder/config.json" && ! -f $MODEL_CONFIG ]]; then
echo "Please continue to deployment with the TRT-LLM enc_dec example, https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/enc_dec. Checkpoint export_path: $SAVE_PATH"
exit 0
Expand Down Expand Up @@ -187,7 +187,7 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH
RUN_ARGS+=" --trust_remote_code "
fi

python run_tensorrt_llm.py --engine_dir=$SAVE_PATH $RUN_ARGS
python run_tensorrt_llm.py --checkpoint_dir=$SAVE_PATH $RUN_ARGS
fi

if [[ -d "${MODEL_PATH}" ]]; then
Expand Down Expand Up @@ -229,7 +229,7 @@ if [[ $TASKS =~ "lm_eval" ]]; then

python lm_eval_tensorrt_llm.py \
--model trt-llm \
--model_args tokenizer=$MODEL_PATH,engine_dir=$SAVE_PATH,max_gen_toks=$BUILD_MAX_OUTPUT_LEN \
--model_args tokenizer=$MODEL_PATH,checkpoint_dir=$SAVE_PATH,max_gen_toks=$BUILD_MAX_OUTPUT_LEN \
--tasks $LM_EVAL_TASKS \
--batch_size $BUILD_MAX_BATCH_SIZE $lm_eval_flags | tee $LM_EVAL_RESULT

Expand Down Expand Up @@ -259,7 +259,7 @@ if [[ $TASKS =~ "mmlu" ]]; then
python mmlu.py \
--model_name causal \
--model_path $MODEL_ABS_PATH \
--engine_dir $SAVE_PATH \
--checkpoint_dir $SAVE_PATH \
--data_dir $MMLU_DATA_PATH | tee $MMLU_RESULT
popd

Expand Down
2 changes: 1 addition & 1 deletion examples/vlm_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Please refer to the [llm_ptq/README.md](../llm_ptq/README.md#current-out-of-the-

Please refer to the [llm_ptq/README.md](../llm_ptq/README.md) about the details of model quantization.

The following scripts provide an all-in-one and step-by-step model quantization example for Llava, VILA, Phi-3-vision and Qwen2.5-VL models. The quantization format and the number of GPUs will be supplied as inputs to these scripts. By default, we build the engine for the fp8 format and 1 GPU.
The following scripts provide an all-in-one and step-by-step model quantization example for the supported Hugging Face multi-modal models. The quantization format and the number of GPUs will be supplied as inputs to these scripts.

### Hugging Face Example [Script](./scripts/huggingface_example.sh)

Expand Down
Loading
Loading