[Feature Request] datafree pipeline with layer-by-layer quantization

I am trying to quantize GLM-4.5-Air, GLM-4.5V and finetunes on 2x RTX Pro 6000 to NVFP4A16 (or FP8).

Unfortunately, GLM-4.5-Air with it's 110B parameters requires about 220GB of space and I have 192GB of VRAM.

- The default `datafree` pipeline fails with Cuda OOM.
- Adding `device_map='auto'` to `AutoModelForCausalLM.from_pretrained` still fails, I assume because `huggingface/transformers` doesn't leave enough space for the extra memory needed during quantization.
- `infer_auto_device_map(MODEL_ID, max_memory={0: "85GiB", 1: "85GiB", "cpu": "50GiB"})` fails I think because the model is built differently (error: `AttributeError: 'str' object has no attribute 'named_parameters'`)
- Using `pipeline=sequential` fails because there is no dataset set.

There might be a way to make it work by explicitly picking which layers go to which GPU and CPU but it's quite cumbersome.

Potential solutions:

1. A `calculate_offload_device_map` for datafree
2. A `datafree_sequential` pipeline that does the quantization layer by layer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] datafree pipeline with layer-by-layer quantization #1970

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] datafree pipeline with layer-by-layer quantization #1970

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions