Skip to content

[Bug]: preak_ram is high in the first time quantization #1619

@xin3he

Description

@xin3he

Problem Description

With hashed dataset, the peak_ram < 3GB, without hash, the peak_ram is about 8GB

  • Now hash fails on A100, we need to fix it
  • Reduce the first time peak_ram.

Reproduction Steps

auto_round Qwen/Qwen3-0.6B

Environment Information

A100

Error Logs

################################# Hash failure #############################
2026-03-26 13:07:47 INFO base.py L1800: start to cache block inputs                                                 
Parameter 'function'=<function get_tokenizer_function.<locals>.default_tokenizer_function at 0x7ae2a39919e0> of the 
transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Ma
ke sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and cachi
ng to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous ca
lls and recompute everything. This warning is only shown once. Subsequent hashing failures won't be shown.


################################# First time #############################
2026-03-26 13:20:10 INFO base.py L1800: start to cache block inputs                                                 
README.md: 100%|??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 373/373 [00:00<00:00, 1.85MB/s]
dataset_infos.json: 100%|????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 921/921 [00:00<00:00, 3.68MB/s]
data/train-00000-of-00001-4746b8785c874c(??): 100%|??????????????????????????????????????????????????????????| 33.3M/33.3M [00:03<00:00, 11.0MB/s]
Generating train split: 100%|??????????????????????????????????????????????????????????????????????????????| 10000/10000 [00:00<00:00, 27910.62 examples/s]
Map: 100%|????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 10000/10000 [00:37<00:00, 264.02 examples/s]
Filter: 100%|????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 10000/10000 [00:05<00:00, 1687.67 examples/s]
Casting the dataset: 100%|????????????????????????????????????????????????????????????????????????????????????????????| 1216/1216 [00:04<00:00, 299.25 examples/s]
2026-03-26 13:21:16 INFO base.py L1817: caching done                                                                
Quantizing model.layers.0:   0%|                                                             | 0/28 [00:00<?, ?it/s]
/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:865: UserWarning: Flash Attention 
defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(
True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.
cu:114.)                                                                                                            
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass     
# first time         
quantized 7/7 layers in the block, loss iter 0: 0.001116 -> iter 175: 0.000317,'peak_ram': 8.85GB, 'peak_vram': 3.51
GB
################################# Second time #############################
2026-03-26 13:22:16 INFO base.py L1800: start to cache block inputs
2026-03-26 13:22:24 INFO base.py L1817: caching done
Quantizing model.layers.0:   0%|                                                             | 0/28 [00:00<?, ?it/s]/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:865: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:114.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
quantized 7/7 layers in the block, loss iter 0: 0.001116 -> iter 175: 0.000315,'peak_ram': 2.01GB, 'peak_vram': 3.51GB

Additional Context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions