Skip to content

[QUESTION] Estimate measurements and quantization peak VRAM use in advance #792

@ThomasBaruzier

Description

@ThomasBaruzier

Hi turbo!

I was wondering if there was a straightforward way to predict in advance the amount of peak vram during measurement or quantization.
I am parallelizing my quant processes, and unlocking this ability could make my workflow a lot more efficient.

2.5 pro gave the following proposition, given config.json and the relevant code:


  1. Parse config.json: Extract all relevant dimensions and architecture details.
  2. Parse measurement.json: Get the list of QParams options for each module type.
  3. Simulate optimize: For each quantizable module type (attn_q, attn_k, attn_v, attn_o, mlp_gate, mlp_up, mlp_down, lm_head), select the QParams that best matches args.bits (or args.head_bits). This forms your job["strategy"].
  4. Initialize max_peak_vram_for_any_layer = 0.
  5. Calculate static_vram: Sum of sizes for non-linear weights (embeddings, norms) that are always loaded. Add PyTorch overhead (e.g., 512MB).
  6. Iterate through each ExLlamaV2Linear module that will be quantized:
    a. Get R (in_features) and C (out_features).
    b. Get the chosen QParams for this layer from your simulated strategy.
    c. Calculate num_groups based on R and QParams.group_size.
    d. Calculate current_layer_peak_vram = 0:
    i. size_original_weights_fp16 = C * R * 2 (original layer loaded)
    ii. size_weights_arg_fp32 = R * C * 4 (FP32 weights for kernel)
    iii. size_hessian_inv_fp32 = R * R * 4 (unless it's lm_head and rtn=True due to size)
    iv. size_quant_fp16 = R * C * 2
    v. size_qweight_int16 = R * C * 2
    vi. size_error_fp32 = R * C * 4
    vii. Sum these, considering which are truly concurrent. A safe bet is to sum all of them if unsure about precise lifetimes within the CUDA kernel and AdaptiveGPTQ methods.
    e. max_peak_vram_for_any_layer = max(max_peak_v_ram_for_any_layer, current_layer_peak_vram)
  7. Final Estimate: max_peak_vram_for_any_layer + static_vram.

I was wondering if there was a simpler way, or if this strategy could be viable, at least for the quantization process.

I would be grateful to have an opinion on this before trying to implement such a solution

Have a nice day!

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions