Skip to content

[Bug]: Granite 4.0 h small FP8 quantization fails (model can't load) #2339

@mramendi

Description

@mramendi

⚙️ Your current environment

The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.8.0-94-generic-x86_64-with-glibc2.39`
Python Version: `3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]`
llm-compressor Version: `0.9.0.1`
compressed-tensors Version: `0.13.0`
transformers Version: `4.57.3`
torch Version: `2.9.1`
CUDA Devices: `['NVIDIA RTX PRO 6000 Blackwell Workstation Edition']`
AMD Devices: `None`
NPU Devices: `None`

🐛 Describe the bug

I have this code which as far as I can see is a straight implementation of what is in the readme for Granite4

test_fp8_no_exclusion.py

It completes successfully, but the resulting model does not work. When I try to serve it in vllm I get this crash:

EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946] EngineCore failed to start.                                                                     
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946] Traceback (most recent call last):                                                                                                                                                                                                                            
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core                                                                                                                                                  
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)                         
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in __init__                                                                                                                                                         
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     super().__init__(                                                                           
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 105, in __init__                                                                                                                                                         
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     self.model_executor = executor_class(vllm_config)                    
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__                                                                                                                                                   
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     self._init_executor()                                                                       
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor                                                                                                                                      
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     self.driver_worker.load_model()     
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 275, in load_model
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4052, in load_model
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/granitemoehybrid.py", line 709, in load_weights
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     return loader.load_weights(weights)
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 342, in load_weights
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 290, in _load_module
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     yield from self._load_module(
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 263, in _load_module
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/granitemoehybrid.py", line 577, in load_weights
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     _load(n, p)
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/granitemoehybrid.py", line 444, in _load
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]     param = params_dict[n]
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946]             ~~~~~~~~~~~^^^
(EngineCore_DP0 pid=14675) ERROR 02-07 03:59:09 [core.py:946] KeyError: 'layers.0.block_sparse_moe.router.layer.weight_scale'
(EngineCore_DP0 pid=14675) Process EngineCore_DP0:
(EngineCore_DP0 pid=14675) Traceback (most recent call last):
(EngineCore_DP0 pid=14675)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=14675)     self.run()
(EngineCore_DP0 pid=14675)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=14675)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=14675)     raise e
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 937, in run_engine_core
(EngineCore_DP0 pid=14675)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=14675)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=14675)     super().__init__(
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=14675)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=14675)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=14675)     self._init_executor()
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=14675)     self.driver_worker.load_model()
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 275, in load_model
(EngineCore_DP0 pid=14675)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4052, in load_model
(EngineCore_DP0 pid=14675)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=14675)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=14675)     self.load_weights(model, model_config)
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=14675)     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=14675)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/granitemoehybrid.py", line 709, in load_weights
(EngineCore_DP0 pid=14675)     return loader.load_weights(weights)
(EngineCore_DP0 pid=14675)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=14675)     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=14675)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 342, in load_weights
(EngineCore_DP0 pid=14675)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=14675)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 290, in _load_module
(EngineCore_DP0 pid=14675)     yield from self._load_module(
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 263, in _load_module
(EngineCore_DP0 pid=14675)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=14675)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/granitemoehybrid.py", line 577, in load_weights
(EngineCore_DP0 pid=14675)     _load(n, p)
(EngineCore_DP0 pid=14675)   File "/opt/venv/datasci/lib/python3.12/site-packages/vllm/model_executor/models/granitemoehybrid.py", line 444, in _load
(EngineCore_DP0 pid=14675)     param = params_dict[n]
(EngineCore_DP0 pid=14675)             ~~~~~~~~~~~^^^
(EngineCore_DP0 pid=14675) KeyError: 'layers.0.block_sparse_moe.router.layer.weight_scale'

Note: originally I excluded some more layers (attention, embeddings, Mamba in/out, MoE router) from the quantization and the model loaded but output !!!!!!!

🛠️ Steps to reproduce

$ python test_fp8_no_exclusion.py --model-name ibm-granite/granite-4.0-h-small --output granite-4.0-h-small-fp8
$ cd granite-4.0-h-small-fp8
$ vllm serve . --port 8080

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions