-
Notifications
You must be signed in to change notification settings - Fork 276
Open
Labels
enhancementNew feature or requestNew feature or requestnvfp4For any PR / issue related to NVFP4 supportFor any PR / issue related to NVFP4 support
Description
I am trying to quantize GLM-4.5-Air, GLM-4.5V and finetunes on 2x RTX Pro 6000 to NVFP4A16 (or FP8).
Unfortunately, GLM-4.5-Air with it's 110B parameters requires about 220GB of space and I have 192GB of VRAM.
- The default
datafreepipeline fails with Cuda OOM. - Adding
device_map='auto'toAutoModelForCausalLM.from_pretrainedstill fails, I assume becausehuggingface/transformersdoesn't leave enough space for the extra memory needed during quantization. infer_auto_device_map(MODEL_ID, max_memory={0: "85GiB", 1: "85GiB", "cpu": "50GiB"})fails I think because the model is built differently (error:AttributeError: 'str' object has no attribute 'named_parameters')- Using
pipeline=sequentialfails because there is no dataset set.
There might be a way to make it work by explicitly picking which layers go to which GPU and CPU but it's quite cumbersome.
Potential solutions:
- A
calculate_offload_device_mapfor datafree - A
datafree_sequentialpipeline that does the quantization layer by layer
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestnvfp4For any PR / issue related to NVFP4 supportFor any PR / issue related to NVFP4 support