-
Notifications
You must be signed in to change notification settings - Fork 161
Deprecate TRTLLM-build in examples #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
c0b62b4
888a89e
34cc205
be355d1
92e6900
ed6e98b
ab26c4e
62f10a0
3a9a3dc
0c56584
a355999
48440c8
bfbcdcd
3b3d08b
1885e81
be09664
8c4a6da
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -203,7 +203,7 @@ scripts/huggingface_example.sh --type llama --model $HF_PATH --quant w4a8_awq,fp | |
The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers | ||
are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`). | ||
|
||
The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `build,mmlu,benchmark,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks). | ||
The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `quant,mmlu,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks). | ||
|
||
> *If GPU out-of-memory error is reported running the scripts, please try editing the scripts and reducing the max batch size to save GPU memory.* | ||
|
||
|
@@ -251,7 +251,7 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_ | |
|
||
> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.* | ||
|
||
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.* | ||
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Quick question: I remember we had some accuracy degradation with |
||
|
||
#### Deepseek R1 | ||
|
||
|
@@ -301,15 +301,15 @@ with torch.inference_mode(): | |
### Quantize and Export | ||
|
||
```bash | ||
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_fmt hf --export_path <quantized_ckpt_path> --trust_remote_code | ||
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code | ||
``` | ||
Comment on lines
301
to
305
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainCommand examples: confirm defaults now export unified HF without extra flags. Looks good; verify Also applies to: 311-313 🏁 Script executed: #!/usr/bin/env bash
# Check CLI surfaces for --export_fmt remnants and default export path
rg -n --type=sh --type=py 'export_fmt|export_hf_checkpoint|export_tensorrt_llm_checkpoint' examples/llm_ptq/{hf_ptq.py,scripts/huggingface_example.sh} Length of output: 474 Remove deprecated examples/llm_ptq/hf_ptq.py still defines/uses 🤖 Prompt for AI Agents
|
||
|
||
### Hugging Face framework [Script](./scripts/huggingface_example.sh) | ||
|
||
Alternatively, the framework script `huggingface_example.sh` also supports quantize and export: | ||
|
||
```bash | ||
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf | ||
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 | ||
``` | ||
|
||
### Deployment | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,28 +89,20 @@ def auto_quantize( | |
qformat_list = qformat.split(",") | ||
assert qformat_list, "No quantization formats provided" | ||
# Check if all provided quantization formats are supported | ||
if args.export_fmt == "hf": | ||
assert all( | ||
qformat | ||
in [ | ||
"fp8", | ||
"int4_awq", | ||
"nvfp4", | ||
"nvfp4_awq", | ||
"w4a8_awq", | ||
"fp8_pb_wo", | ||
"w4a8_mxfp4_fp8", | ||
"nvfp4_mlp_only", | ||
] | ||
for qformat in qformat_list | ||
), ( | ||
"One or more quantization formats provided are not supported for unified checkpoint export" | ||
) | ||
else: | ||
assert all( | ||
qformat in ["fp8", "int8_sq", "int4_awq", "w4a8_awq", "nvfp4", "nvfp4_awq"] | ||
for qformat in qformat_list | ||
), "One or more quantization formats provided are not supported for tensorrt llm export" | ||
assert all( | ||
qformat | ||
in [ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you think we can pull this list of supported qformats as a variable and reuse in other places (in the auto quantize section)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ACK. This PR is pretty large. I hope we can move the improvements to a following up |
||
"fp8", | ||
"int4_awq", | ||
"nvfp4", | ||
"nvfp4_awq", | ||
"w4a8_awq", | ||
"fp8_pb_wo", | ||
"w4a8_mxfp4_fp8", | ||
"nvfp4_mlp_only", | ||
] | ||
for qformat in qformat_list | ||
), "One or more quantization formats provided are not supported for unified checkpoint export" | ||
|
||
def loss_func(output, data): | ||
# For transformers AutoModelForCausalLM models, the outputs are wrapped in `CausalLMOutputWithPast` | ||
|
@@ -219,27 +211,21 @@ def main(args): | |
"Quantization supports only one quantization format." | ||
) | ||
|
||
# Check arguments for unified_hf export format and set to default if unsupported arguments are provided | ||
if args.export_fmt == "hf": | ||
assert args.sparsity_fmt == "dense", ( | ||
f"Sparsity format {args.sparsity_fmt} not supported by unified export api." | ||
) | ||
|
||
if not args.auto_quantize_bits: | ||
assert ( | ||
args.qformat | ||
in [ | ||
"int4_awq", | ||
"fp8", | ||
"nvfp4", | ||
"nvfp4_awq", | ||
"w4a8_awq", | ||
"fp8_pb_wo", | ||
"w4a8_mxfp4_fp8", | ||
"nvfp4_mlp_only", | ||
] | ||
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES | ||
), f"Quantization format {args.qformat} not supported for HF export path" | ||
if not args.auto_quantize_bits: | ||
assert ( | ||
args.qformat | ||
in [ | ||
"int4_awq", | ||
"fp8", | ||
"nvfp4", | ||
"nvfp4_awq", | ||
"w4a8_awq", | ||
"fp8_pb_wo", | ||
"w4a8_mxfp4_fp8", | ||
"nvfp4_mlp_only", | ||
] | ||
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES | ||
), f"Quantization format {args.qformat} not supported for HF export path" | ||
|
||
Comment on lines
+214
to
229
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assertion is ineffective; This currently lets unsupported qformats through. Tighten the check. - if not args.auto_quantize_bits:
- assert (
- args.qformat
- in [
- "int4_awq",
- "fp8",
- "nvfp4",
- "nvfp4_awq",
- "w4a8_awq",
- "fp8_pb_wo",
- "w4a8_mxfp4_fp8",
- "nvfp4_mlp_only",
- ]
- or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
- ), f"Quantization format {args.qformat} not supported for HF export path"
+ if not args.auto_quantize_bits:
+ assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, (
+ f"Quantization format {args.qformat} not supported for HF export path"
+ ) Note: If you intended to allow “KV-cache-only” quant, handle that as a separate branch instead of weakening this assert.
|
||
# If low memory mode is enabled, we compress the model while loading the HF checkpoint. | ||
calibration_only = False | ||
|
@@ -253,9 +239,6 @@ def main(args): | |
attn_implementation=args.attn_implementation, | ||
) | ||
else: | ||
assert args.export_fmt == "hf", ( | ||
"Low memory mode is only supported for exporting HF checkpoint." | ||
) | ||
assert args.qformat in QUANT_CFG_CHOICES, ( | ||
f"Quantization format is not supported for low memory mode. Supported formats: {QUANT_CFG_CHOICES.keys()}" | ||
) | ||
|
@@ -600,34 +583,41 @@ def output_decode(generated_ids, input_shape): | |
setattr(model.config, "architectures", full_model_config.architectures) | ||
|
||
start_time = time.time() | ||
if args.export_fmt == "tensorrt_llm": | ||
if ( | ||
model_type in ["t5", "bart", "whisper"] | ||
or args.sparsity_fmt != "dense" | ||
or "int8_sq" in args.qformat | ||
): | ||
warnings.warn( | ||
"Still exporting TensorRT-LLM checkpoints for models not supported by the TensorRT-LLM torch runtime." | ||
) | ||
|
||
# Move meta tensor back to device before exporting. | ||
remove_hook_from_module(model, recurse=True) | ||
|
||
dtype = None | ||
if "w4a8_awq" in args.qformat: | ||
# TensorRT-LLM w4a8 only support fp16 as the dtype. | ||
dtype = torch.float16 | ||
|
||
# For Gemma2-27B, TRT-LLM only works with bfloat16 as the dtype. | ||
if model_type == "gemma2": | ||
dtype = torch.bfloat16 | ||
|
||
export_tensorrt_llm_checkpoint( | ||
model, | ||
model_type, | ||
dtype=dtype, | ||
export_dir=export_path, | ||
inference_tensor_parallel=args.inference_tensor_parallel, | ||
inference_pipeline_parallel=args.inference_pipeline_parallel, | ||
) | ||
elif args.export_fmt == "hf": | ||
else: | ||
# Check arguments for unified_hf export format and set to default if unsupported arguments are provided | ||
assert args.sparsity_fmt == "dense", ( | ||
f"Sparsity format {args.sparsity_fmt} not supported by unified export api." | ||
) | ||
|
||
if args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1: | ||
warnings.warn( | ||
"Unified HF export format does not specify inference tensor parallel or pipeline parallel. " | ||
"They will be set at deployment time." | ||
) | ||
|
||
export_hf_checkpoint( | ||
full_model, | ||
export_dir=export_path, | ||
) | ||
else: | ||
raise NotImplementedError(f"{args.export_fmt} not supported") | ||
|
||
# Restore default padding and export the tokenizer as well. | ||
if tokenizer is not None: | ||
|
@@ -710,9 +700,9 @@ def output_decode(generated_ids, input_shape): | |
parser.add_argument( | ||
"--export_fmt", | ||
required=False, | ||
default="tensorrt_llm", | ||
default="hf", | ||
choices=["tensorrt_llm", "hf"], | ||
help=("Checkpoint export format"), | ||
help="Deprecated. Please avoid using this argument.", | ||
) | ||
parser.add_argument( | ||
"--trust_remote_code", | ||
|
@@ -767,6 +757,9 @@ def output_decode(generated_ids, input_shape): | |
|
||
args = parser.parse_args() | ||
|
||
if args.export_fmt != "hf": | ||
warnings.warn("Deprecated. --export_fmt forced to hf.") | ||
|
||
args.dataset = args.dataset.split(",") if args.dataset else None | ||
args.calib_size = [int(num_sample) for num_sample in args.calib_size.split(",")] | ||
main(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Ensure --low_memory_mode is implemented and gated as documented.
README states it’s only supported for FP8/NVFP4 with max calibration. Verify
hf_ptq.py
enforces this and that weight compression happens before calibration.🏁 Script executed:
Length of output: 2446
Enforce --low_memory_mode gating and compress weights before calibration
README says low_memory_mode is only supported for FP8 / NVFP4 with max calibration. The flag is defined and referenced but there is no explicit enforcement or obvious compression-before-calibration:
Action items: