[Bug]: preak_ram is high in the first time quantization

### Problem Description

With hashed dataset, the peak_ram < 3GB, without hash, the peak_ram is about 8GB


- [ ] Now hash fails on A100, we need to fix it
- [ ] Reduce the first time peak_ram.

### Reproduction Steps

auto_round Qwen/Qwen3-0.6B

### Environment Information

A100

### Error Logs

```shell
################################# Hash failure #############################
2026-03-26 13:07:47 INFO base.py L1800: start to cache block inputs                                                 
Parameter 'function'=<function get_tokenizer_function.<locals>.default_tokenizer_function at 0x7ae2a39919e0> of the 
transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Ma
ke sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and cachi
ng to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous ca
lls and recompute everything. This warning is only shown once. Subsequent hashing failures won't be shown.


################################# First time #############################
2026-03-26 13:20:10 INFO base.py L1800: start to cache block inputs                                                 
README.md: 100%|??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 373/373 [00:00<00:00, 1.85MB/s]
dataset_infos.json: 100%|????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 921/921 [00:00<00:00, 3.68MB/s]
data/train-00000-of-00001-4746b8785c874c(??): 100%|??????????????????????????????????????????????????????????| 33.3M/33.3M [00:03<00:00, 11.0MB/s]
Generating train split: 100%|??????????????????????????????????????????????????????????????????????????????| 10000/10000 [00:00<00:00, 27910.62 examples/s]
Map: 100%|????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 10000/10000 [00:37<00:00, 264.02 examples/s]
Filter: 100%|????????????????????????????????????????????????????????????????????????????????????????????????????????????????| 10000/10000 [00:05<00:00, 1687.67 examples/s]
Casting the dataset: 100%|????????????????????????????????????????????????????????????????????????????????????????????| 1216/1216 [00:04<00:00, 299.25 examples/s]
2026-03-26 13:21:16 INFO base.py L1817: caching done                                                                
Quantizing model.layers.0:   0%|                                                             | 0/28 [00:00<?, ?it/s]
/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:865: UserWarning: Flash Attention 
defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(
True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.
cu:114.)                                                                                                            
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass     
# first time         
quantized 7/7 layers in the block, loss iter 0: 0.001116 -> iter 175: 0.000317,'peak_ram': 8.85GB, 'peak_vram': 3.51
GB
################################# Second time #############################
2026-03-26 13:22:16 INFO base.py L1800: start to cache block inputs
2026-03-26 13:22:24 INFO base.py L1817: caching done
Quantizing model.layers.0:   0%|                                                             | 0/28 [00:00<?, ?it/s]/home/xinhe/auto-round/.venv/lib/python3.12/site-packages/torch/autograd/graph.py:865: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:114.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
quantized 7/7 layers in the block, loss iter 0: 0.001116 -> iter 175: 0.000315,'peak_ram': 2.01GB, 'peak_vram': 3.51GB

```

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: preak_ram is high in the first time quantization #1619

Problem Description

Reproduction Steps

Environment Information

Error Logs

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: preak_ram is high in the first time quantization #1619

Description

Problem Description

Reproduction Steps

Environment Information

Error Logs

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions