Skip to content

Support quantizing native FP8 models #536

@mratsim

Description

@mratsim

New models are coming in native FP8 form, for example Minimax-M2.1

However trying to quantize them is a whack-in-a-mole of unsupported torch features in both compressed-tensors and llm-compressor.

In llm-compressor I'm hit by silent integer promotion:

  File "[...]/.venv/lib/python3.12/site-packages/compressed_tensors/quantization/lifecycle/forward.py", line 471, in _quantize
    scaled = x / scale
             ~~^~~~~~~
RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and Float

def _quantize(
x: torch.Tensor,
scale: torch.Tensor,
zero_point: torch.Tensor,
q_min: torch.Tensor,
q_max: torch.Tensor,
args: QuantizationArgs,
dtype: Optional[torch.dtype] = None,
global_scale: Optional[torch.Tensor] = None,
) -> torch.Tensor:
# if a global scale is optionally provided, use it
# to further scale the local `scale` parameter
if global_scale is not None:
scale = scale / global_scale
scaled = x / scale

Unimplemented min/max/abs

  File "[...]/.venv/lib/python3.12/site-packages/compressed_tensors/quantization/utils/helpers.py", line 432, in generate_gparam
    min_vals = torch.min(updated_min_val, torch.zeros_like(updated_min_val))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: "min_elementwise_cuda" not implemented for 'Float8_e4m3fn'
  File "[...]/.venv/lib/python3.12/site-packages/compressed_tensors/quantization/utils/helpers.py", line 95, in calculate_qparams
    max_val_pos = torch.max(torch.abs(min_vals), torch.abs(max_vals))
                            ^^^^^^^^^^^^^^^^^^^
NotImplementedError: "abs_cuda" not implemented for 'Float8_e4m3fn'

Downstream issue: vllm-project/llm-compressor#2172 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions