[QUESTION] Estimate measurements and quantization peak VRAM use in advance

Hi turbo!

I was wondering if there was a straightforward way to predict in advance the amount of peak vram during measurement or quantization.
I am parallelizing my quant processes, and unlocking this ability could make my workflow a lot more efficient.

2.5 pro gave the following proposition, given `config.json` and the relevant code:

---

1.  **Parse `config.json`:** Extract all relevant dimensions and architecture details.
2.  **Parse `measurement.json`:** Get the list of `QParams` options for each module type.
3.  **Simulate `optimize`:** For each quantizable module type (attn_q, attn_k, attn_v, attn_o, mlp_gate, mlp_up, mlp_down, lm_head), select the `QParams` that best matches `args.bits` (or `args.head_bits`). This forms your `job["strategy"]`.
4.  **Initialize `max_peak_vram_for_any_layer = 0`**.
5.  **Calculate `static_vram`:** Sum of sizes for non-linear weights (embeddings, norms) that are always loaded. Add PyTorch overhead (e.g., 512MB).
6.  **Iterate through each `ExLlamaV2Linear` module that will be quantized:**
    a.  Get `R` (in\_features) and `C` (out\_features).
    b.  Get the chosen `QParams` for this layer from your simulated strategy.
    c.  Calculate `num_groups` based on `R` and `QParams.group_size`.
    d.  Calculate `current_layer_peak_vram = 0`:
        i.  `size_original_weights_fp16 = C * R * 2` (original layer loaded)
        ii. `size_weights_arg_fp32 = R * C * 4` (FP32 weights for kernel)
        iii. `size_hessian_inv_fp32 = R * R * 4` (unless it's lm\_head and `rtn=True` due to size)
        iv. `size_quant_fp16 = R * C * 2`
        v.  `size_qweight_int16 = R * C * 2`
        vi. `size_error_fp32 = R * C * 4`
        vii. Sum these, considering which are truly concurrent. A safe bet is to sum all of them if unsure about precise lifetimes within the CUDA kernel and `AdaptiveGPTQ` methods.
    e.  `max_peak_vram_for_any_layer = max(max_peak_v_ram_for_any_layer, current_layer_peak_vram)`
7.  **Final Estimate:** `max_peak_vram_for_any_layer + static_vram`.

---

I was wondering if there was a simpler way, or if this strategy could be viable, at least for the quantization process.

I would be grateful to have an opinion on this before trying to implement such a solution

Have a nice day!

### Acknowledgements

- [x] I have looked for similar requests before submitting this one.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will make my requests politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[QUESTION] Estimate measurements and quantization peak VRAM use in advance #792

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[QUESTION] Estimate measurements and quantization peak VRAM use in advance #792

Description

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions