Skip to content

Post-training quantization script (hf_ptq.py) fails on Llama-4-Scout-17B due to missing architectures key in config.json #340

@rajulshakya777

Description

@rajulshakya777

The generic post-training quantization script, examples/llm_ptq/hf_ptq.py, fails when attempting to process and export a checkpoint for the meta-llama/Llama-4-Scout-17B-16E-Instruct model.

The script works correctly for previous generation models like Llama 3, but it encounters a TypeError with Llama 4.

Root cause: the script expects the model's config.json to have an architectures key, which it uses to identify the model type.
The Llama 4 Scout model's config file does not contain this key, causing the script to crash when it tries to access model.config.architectures[0].

This is a blocker as it prevents the use of the standard post-training quantization workflow for a major new model release.

Error Log:
```
Traceback (most recent call last):
File "/mnt/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 772, in
main(args)
File "/mnt/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 616, in main
export_tensorrt_llm_checkpoint(
File "/mnt/TensorRT-Model-Optimizer/modelopt/torch/export/model_config_export.py", line 553, in export_tensorrt_llm_checkpoint
raise e
File "/mnt/TensorRT-Model-Optimizer/modelopt/torch/export/model_config_export.py", line 487, in export_tensorrt_llm_checkpoint
for (
File "/mnt/TensorRT-Model-Optimizer/modelopt/torch/export/model_config_export.py", line 154, in torch_to_tensorrt_llm_checkpoint
architecture = model.config.architectures[0]
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: 'NoneType' object is not subscriptable
```


Steps/Code to reproduce bug

  1. Set up an environment using an NVIDIA TensorRT-LLM container.
  2. Download the Llama 4 Scout model from Hugging Face:
    ```bash
    huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct
    --local-dir /path/to/Llama-4-Scout-17B-16E-Instruct
    ```
  3. Run the PTQ script:
    ```bash
    python3 examples/llm_ptq/hf_ptq.py
    --model_dir /path/to/Llama-4-Scout-17B-16E-Instruct
    --output_dir /tmp/quantized_llama4
    --calib_dataset cnn_dailymail
    --num_calib_size 32
    --dtype float16
    --qformat fp8
    ```

Expected behavior

The hf_ptq.py script should successfully quantize the Llama 4 model and export a TensorRT-LLM compatible checkpoint, just as it does for Llama 3 models.


System information

  • Container used (if applicable): Official NVIDIA TensorRT-LLM container
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): H100
  • GPU memory size: 80 GB
  • Number of GPUs: 8
  • Library versions (if applicable):
    • Python: 3.12
    • ModelOpt version or commit hash: latest inside container
    • CUDA: 12.9
    • PyTorch: 2.8.0a0+5228986c39.nv25.5
    • Transformers: (container version)
    • TensorRT-LLM: 0.21.0
    • ONNXRuntime: (container version)
    • TensorRT: (container version)
  • Any other details that may help: Bug only occurs with Llama 4 Scout models (config.json missing architectures key).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions