Support quantizing native FP8 models

New models are coming in native FP8 form, for example Minimax-M2.1

However trying to quantize them is a whack-in-a-mole of unsupported torch features in both compressed-tensors and llm-compressor.

In llm-compressor I'm hit by silent integer promotion:

```
  File "[...]/.venv/lib/python3.12/site-packages/compressed_tensors/quantization/lifecycle/forward.py", line 471, in _quantize
    scaled = x / scale
             ~~^~~~~~~
RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and Float
```
https://github.com/vllm-project/compressed-tensors/blob/797d3019ef6867362796f412980547c74551f369/src/compressed_tensors/quantization/lifecycle/forward.py#L455-L471

Unimplemented min/max/abs
```
  File "[...]/.venv/lib/python3.12/site-packages/compressed_tensors/quantization/utils/helpers.py", line 432, in generate_gparam
    min_vals = torch.min(updated_min_val, torch.zeros_like(updated_min_val))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: "min_elementwise_cuda" not implemented for 'Float8_e4m3fn'
```

```
  File "[...]/.venv/lib/python3.12/site-packages/compressed_tensors/quantization/utils/helpers.py", line 95, in calculate_qparams
    max_val_pos = torch.max(torch.abs(min_vals), torch.abs(max_vals))
                            ^^^^^^^^^^^^^^^^^^^
NotImplementedError: "abs_cuda" not implemented for 'Float8_e4m3fn'
```

Downstream issue: https://github.com/vllm-project/llm-compressor/issues/2172#issuecomment-3694778990

	def _quantize(
	x: torch.Tensor,
	scale: torch.Tensor,
	zero_point: torch.Tensor,
	q_min: torch.Tensor,
	q_max: torch.Tensor,
	args: QuantizationArgs,
	dtype: Optional[torch.dtype] = None,
	global_scale: Optional[torch.Tensor] = None,
	) -> torch.Tensor:

	# if a global scale is optionally provided, use it
	# to further scale the local `scale` parameter
	if global_scale is not None:
	scale = scale / global_scale

	scaled = x / scale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support quantizing native FP8 models #536

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support quantizing native FP8 models #536

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions