Conversation
Add a --dry-run flag to the CLI that estimates VRAM usage, output file size, and approximate quantization time without running the full quantization process. Uses AutoConfig to load model architecture metadata without downloading weights. New module: auto_round/estimation.py with estimation functions for parameter count, peak VRAM, output size, and time. Relates to intel#1551 and intel#1584 Fixes intel#1591 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Refactor _count_parameters into smaller helpers to reduce local variable count. Convert dry_run_estimate to use **kwargs and extract helpers for config loading and result building. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
d042f9e to
8ce019d
Compare
for more information, see https://pre-commit.ci
| hidden_size^2 * num_layers heuristic when fields are missing. | ||
| """ | ||
| hidden = getattr(config, "hidden_size", None) | ||
| num_layers = getattr(config, "num_hidden_layers", None) |
There was a problem hiding this comment.
We typically perform block-wise tuning. By a “block,” we mean a decoder layer, which typically contains 6–7 linear layers for non-moe models.
| - CUDA overhead and fragmentation (~20% buffer) | ||
| """ | ||
| # Model weights | ||
| model_bytes = param_count * model_dtype_bytes |
There was a problem hiding this comment.
we need to cache some input data for the block when "low_gpu_mem_usage" is not enabled
|
|
||
| # Rough seconds per layer per iteration, measured on A100 for a 7B-class model. | ||
| # Actual speed varies widely by hardware and model architecture. | ||
| _SECS_PER_LAYER_PER_ITER = 0.12 |
There was a problem hiding this comment.
Can we use a dummy block to get the real performance of current machine?
|
|
||
| # Optimizer state: roughly 2x one block's parameters (momentum + variance for Adam) | ||
| # Approximate one block as total_params / num_layers | ||
| block_overhead = model_bytes * 0.05 # ~5% of model for one block's optimizer state |
There was a problem hiding this comment.
auto-round/auto_round/utils/device.py
Line 1204 in d02b2ed
card_0_used_memory = block_input_output_memory + layer_activation_memory + additional_memoryI have summarized the key points regarding
block_overhead here, and I hope this proves insightful for you.
| hidden_size^2 * num_layers heuristic when fields are missing. | ||
| """ | ||
| hidden = getattr(config, "hidden_size", None) | ||
| num_layers = getattr(config, "num_hidden_layers", None) |
There was a problem hiding this comment.
The num_hidden_layers may not cover most model cases. Claude could help refine it.
By the way, we may need to apply special handling to the MOE model.
|
Thanks for the detailed feedback on the estimation approach. @wenhuach21 Good point on block-wise tuning and the input caching overhead. I'll update the estimation to account for per-block input/output caching when @xin3he The dummy block idea for real machine benchmarking is interesting - that would give more accurate estimates than extrapolation. I'll look into the |
Summary
Adds a
--dry-runflag to the CLI that estimates VRAM usage, output file size, and approximate quantization time without running the full quantization process.AutoConfig.from_pretrained()(no weights downloaded)Motivation
Users quantizing large models (70B+) need to know resource requirements before committing GPU hours. This is relevant to #1551 (reduce quant cost) and #1584 (peak VRAM tracking).
Example output
Changes
auto_round/estimation.py- VRAM, disk, and time estimation functionsauto_round/__main__.py---dry_run/--dry-runCLI flag, short-circuits before model loadingtest/test_cpu/core/test_estimation.py- unit tests for all estimation functionsTesting
All estimation unit tests pass (parameter counting, VRAM estimation, output size calculation, time estimation, format helpers). Tests use stub configs to avoid model downloads.
Fixes #1591
This contribution was developed with AI assistance (Claude Code).