Skip to content

[Feature Request] datafree pipeline with layer-by-layer quantization #1970

@mratsim

Description

@mratsim

I am trying to quantize GLM-4.5-Air, GLM-4.5V and finetunes on 2x RTX Pro 6000 to NVFP4A16 (or FP8).

Unfortunately, GLM-4.5-Air with it's 110B parameters requires about 220GB of space and I have 192GB of VRAM.

  • The default datafree pipeline fails with Cuda OOM.
  • Adding device_map='auto' to AutoModelForCausalLM.from_pretrained still fails, I assume because huggingface/transformers doesn't leave enough space for the extra memory needed during quantization.
  • infer_auto_device_map(MODEL_ID, max_memory={0: "85GiB", 1: "85GiB", "cpu": "50GiB"}) fails I think because the model is built differently (error: AttributeError: 'str' object has no attribute 'named_parameters')
  • Using pipeline=sequential fails because there is no dataset set.

There might be a way to make it work by explicitly picking which layers go to which GPU and CPU but it's quite cumbersome.

Potential solutions:

  1. A calculate_offload_device_map for datafree
  2. A datafree_sequential pipeline that does the quantization layer by layer

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestnvfp4For any PR / issue related to NVFP4 support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions