-
Notifications
You must be signed in to change notification settings - Fork 161
Description
The generic post-training quantization script, examples/llm_ptq/hf_ptq.py
, fails when attempting to process and export a checkpoint for the meta-llama/Llama-4-Scout-17B-16E-Instruct
model.
The script works correctly for previous generation models like Llama 3, but it encounters a TypeError with Llama 4.
Root cause: the script expects the model's config.json
to have an architectures
key, which it uses to identify the model type.
The Llama 4 Scout model's config file does not contain this key, causing the script to crash when it tries to access model.config.architectures[0]
.
This is a blocker as it prevents the use of the standard post-training quantization workflow for a major new model release.
Error Log:
```
Traceback (most recent call last):
File "/mnt/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 772, in
main(args)
File "/mnt/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 616, in main
export_tensorrt_llm_checkpoint(
File "/mnt/TensorRT-Model-Optimizer/modelopt/torch/export/model_config_export.py", line 553, in export_tensorrt_llm_checkpoint
raise e
File "/mnt/TensorRT-Model-Optimizer/modelopt/torch/export/model_config_export.py", line 487, in export_tensorrt_llm_checkpoint
for (
File "/mnt/TensorRT-Model-Optimizer/modelopt/torch/export/model_config_export.py", line 154, in torch_to_tensorrt_llm_checkpoint
architecture = model.config.architectures[0]
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
```
Steps/Code to reproduce bug
- Set up an environment using an NVIDIA TensorRT-LLM container.
- Download the Llama 4 Scout model from Hugging Face:
```bash
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct
--local-dir /path/to/Llama-4-Scout-17B-16E-Instruct
``` - Run the PTQ script:
```bash
python3 examples/llm_ptq/hf_ptq.py
--model_dir /path/to/Llama-4-Scout-17B-16E-Instruct
--output_dir /tmp/quantized_llama4
--calib_dataset cnn_dailymail
--num_calib_size 32
--dtype float16
--qformat fp8
```
Expected behavior
The hf_ptq.py
script should successfully quantize the Llama 4 model and export a TensorRT-LLM compatible checkpoint, just as it does for Llama 3 models.
System information
- Container used (if applicable): Official NVIDIA TensorRT-LLM container
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): H100
- GPU memory size: 80 GB
- Number of GPUs: 8
- Library versions (if applicable):
- Python: 3.12
- ModelOpt version or commit hash: latest inside container
- CUDA: 12.9
- PyTorch: 2.8.0a0+5228986c39.nv25.5
- Transformers: (container version)
- TensorRT-LLM: 0.21.0
- ONNXRuntime: (container version)
- TensorRT: (container version)
- Any other details that may help: Bug only occurs with Llama 4 Scout models (
config.json
missingarchitectures
key).