Skip to content

Errors found during quantizing PrimeIntellect-INTELLECT-3-FP16 #112

@blackcat1402

Description

@blackcat1402

I can successfully quantize GLM4.5 Air. however, when I quantized one of its variant PrimeIntellect-INTELLECT-3-FP16, it reported errors as below and failed:

Microsoft Windows [version 10.0.19045.6466]
(c) Microsoft Corporation

C:\Users\blackcat1402>cd exllamav3

C:\Users\blackcat1402\exllamav3>python convert.py ^
More?   -i C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16 ^
More?   -o C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw ^
More?   -w C:\Users\blackcat1402\exl3_working ^
More?   -b 5.76 ^
More?   -d 2
Detected Windows operating system. Triton does not have an official Windows release, thus FLA will not be adapted for Windows, and any potential errors will not be fixed. Please consider using a Linux environment for compatibility.
 -- Creating new job
    Input directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16
    Output directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw
    Working directory: C:\Users\blackcat1402\exl3_working
    Calibration size: 250 rows, 2048 columns
    Target bitrate: 5.76 (decoder), 6 (head)
    Output scales: auto
    Codebook: mcg
 -- Loaded model config
    Architecture: Glm4MoeForCausalLM
 -- Created model instance:
     - Glm4MoeModel
         - Embedding
         - TransformerBlock
             - RMSNorm
             - Attention
                 - [4x] Linear
             - RMSNorm
             - GatedMLP
                 - [3x] Linear
         - [45x] TransformerBlock
             - RMSNorm
             - Attention
                 - [4x] Linear
             - RMSNorm
             - BlockSparseMLP
                 - [385x] Linear
                 - GatedMLP
                     - [3x] Linear
         - RMSNorm
         - Linear
 -- Loaded tokenizer
    Vocab size: 151367
 -- Preparing input state
 -- Loading unquantized module: model.embed_tokens
 -- Quantized: model.embed_tokens                                           bpw: 16.00  rfn: 0.000000  cos: 0.000000  sqnr: 0.000000  [6.28 s]
 -- Loading unquantized module: model.layers.0
 -- Captured: model.layers.0
 -- Quantized: model.layers.0.self_attn.q_proj                              bpw:  6.00  proxy_err: 0.000025  .  g_sc: 0.405454  [9.13 s]
 -- Quantized: model.layers.0.self_attn.k_proj                              bpw:  7.00  proxy_err: 0.000006  .  g_sc: 0.467450  [1.35 s]
 -- Quantized: model.layers.0.self_attn.v_proj                              bpw:  7.00  proxy_err: 0.000008  .  g_sc: 0.629755  [1.32 s]
 -- Quantized: model.layers.0.self_attn.o_proj                              bpw:  6.00  proxy_err: 0.000024  o  g_sc: 0.691751  [10.49 s]
 -- Quantized: model.layers.0.mlp.up_proj                                   bpw:  5.00  proxy_err: 0.000160  o  g_sc: 0.859647  [8.23 s]
 -- Quantized: model.layers.0.mlp.gate_proj                                 bpw:  5.00  proxy_err: 0.000123  o  g_sc: 0.863102  [8.14 s]
 -- Quantized: model.layers.0.mlp.down_proj                                 bpw:  6.00  proxy_err: 0.000032  o  g_sc: 0.806696  [8.69 s]
 -- Quantized: model.layers.0                                               bpw:  5.67  rfn: 0.005075  cos: 0.000013  sqnr: 46.547755  [80.01 s]
 -- Estimated remaining time: 1 hour, 4 minutes
 -- Loading unquantized module: model.layers.1
 -- Captured: model.layers.1
 !! Warning: block.mlp.0.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.1.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.10.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.100.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.101.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.102.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 -- Quantized: model.layers.1.self_attn.q_proj                              bpw:  6.00  proxy_err: 0.000017  .  g_sc: 0.376184  [8.21 s]
 -- Quantized: model.layers.1.self_attn.k_proj                              bpw:  7.00  proxy_err: 0.000002  .  g_sc: 0.525990  [1.31 s]
 -- Quantized: model.layers.1.self_attn.v_proj                              bpw:  7.00  proxy_err: 0.000007  .  g_sc: 0.644391  [1.34 s]
 -- Quantized: model.layers.1.self_attn.o_proj                              bpw:  6.00  proxy_err: 0.000031  o  g_sc: 0.786471  [10.38 s]
 -- Quantized: model.layers.1.mlp.experts.0.up_proj                         bpw:  6.00  proxy_err: (OoM)     o  g_sc: 1.895478  [1.48 s]
 -- Quantized: model.layers.1.mlp.experts.0.gate_proj                       bpw:  5.00  proxy_err: 0.000178  o  g_sc: 0.859647  [1.38 s]
Traceback (most recent call last):
  File "C:\Users\blackcat1402\exllamav3\convert.py", line 11, in <module>
    main(_in_args, _job_state)
  File "C:\Users\blackcat1402\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\conversion\convert_model.py", line 520, in main
    proxy_err = linear.convert_exl3(
                ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\linear.py", line 299, in convert_exl3
    weight_q, proxy_err, out_tensors = quantize_exl3(
                                       ^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 793, in quantize_exl3
    H, L, su, H_diag = finalize_capture_H(H_data, quant_args, verbose)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 488, in finalize_capture_H
    L, H = block_ldl(H, 16, verbose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 291, in block_ldl
    raise e
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 278, in block_ldl
    L = torch.linalg.cholesky(H)
        ^^^^^^^^^^^^^^^^^^^^^^^^
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).

C:\Users\blackcat1402\exllamav3>
C:\Users\blackcat1402\exllamav3>
C:\Users\blackcat1402\exllamav3>
C:\Users\blackcat1402\exllamav3>python convert.py ^
More?   -i C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16 ^
More?   -o C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw ^
More?   -w C:\Users\blackcat1402\exl3_working ^
More?   -b 5.76 ^
More?   -d 1
Detected Windows operating system. Triton does not have an official Windows release, thus FLA will not be adapted for Windows, and any potential errors will not be fixed. Please consider using a Linux environment for compatibility.
 -- Creating new job
    Input directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16
    Output directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw
    Working directory: C:\Users\blackcat1402\exl3_working
    Calibration size: 250 rows, 2048 columns
    Target bitrate: 5.76 (decoder), 6 (head)
    Output scales: auto
    Codebook: mcg
 -- Loaded model config
    Architecture: Glm4MoeForCausalLM
 -- Created model instance:
     - Glm4MoeModel
         - Embedding
         - TransformerBlock
             - RMSNorm
             - Attention
                 - [4x] Linear
             - RMSNorm
             - GatedMLP
                 - [3x] Linear
         - [45x] TransformerBlock
             - RMSNorm
             - Attention
                 - [4x] Linear
             - RMSNorm
             - BlockSparseMLP
                 - [385x] Linear
                 - GatedMLP
                     - [3x] Linear
         - RMSNorm
         - Linear
 -- Loaded tokenizer
    Vocab size: 151367
 -- Preparing input state
 -- Loading unquantized module: model.embed_tokens
 -- Quantized: model.embed_tokens                                           bpw: 16.00  rfn: 0.000000  cos: 0.000000  sqnr: 0.000000  [4.49 s]
 -- Loading unquantized module: model.layers.0
 -- Captured: model.layers.0
 -- Quantized: model.layers.0.self_attn.q_proj                              bpw:  6.00  proxy_err: 0.000025  .  g_sc: 0.405454  [8.48 s]
 -- Quantized: model.layers.0.self_attn.k_proj                              bpw:  7.00  proxy_err: 0.000006  .  g_sc: 0.467450  [1.36 s]
 -- Quantized: model.layers.0.self_attn.v_proj                              bpw:  7.00  proxy_err: 0.000008  .  g_sc: 0.629755  [1.33 s]
 -- Quantized: model.layers.0.self_attn.o_proj                              bpw:  6.00  proxy_err: 0.000024  o  g_sc: 0.691751  [10.13 s]
 -- Quantized: model.layers.0.mlp.up_proj                                   bpw:  5.00  proxy_err: 0.000160  o  g_sc: 0.859647  [8.20 s]
 -- Quantized: model.layers.0.mlp.gate_proj                                 bpw:  5.00  proxy_err: 0.000123  o  g_sc: 0.863102  [8.10 s]
 -- Quantized: model.layers.0.mlp.down_proj                                 bpw:  6.00  proxy_err: 0.000032  o  g_sc: 0.806696  [8.64 s]
 -- Quantized: model.layers.0                                               bpw:  5.67  rfn: 0.005075  cos: 0.000013  sqnr: 46.547755  [78.07 s]
 -- Estimated remaining time: 1 hour, 2 minutes
 -- Loading unquantized module: model.layers.1
 -- Captured: model.layers.1
 !! Warning: block.mlp.0.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.1.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.10.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.100.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.101.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 !! Warning: block.mlp.102.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
 -- Quantized: model.layers.1.self_attn.q_proj                              bpw:  6.00  proxy_err: 0.000017  .  g_sc: 0.376184  [8.21 s]
 -- Quantized: model.layers.1.self_attn.k_proj                              bpw:  7.00  proxy_err: 0.000002  .  g_sc: 0.525990  [1.33 s]
 -- Quantized: model.layers.1.self_attn.v_proj                              bpw:  7.00  proxy_err: 0.000007  .  g_sc: 0.644391  [1.30 s]
 -- Quantized: model.layers.1.self_attn.o_proj                              bpw:  6.00  proxy_err: 0.000031  o  g_sc: 0.786471  [10.21 s]
 -- Quantized: model.layers.1.mlp.experts.0.up_proj                         bpw:  6.00  proxy_err: (OoM)     o  g_sc: 1.895478  [1.46 s]
 -- Quantized: model.layers.1.mlp.experts.0.gate_proj                       bpw:  5.00  proxy_err: 0.000178  o  g_sc: 0.859647  [1.38 s]
Traceback (most recent call last):
  File "C:\Users\blackcat1402\exllamav3\convert.py", line 11, in <module>
    main(_in_args, _job_state)
  File "C:\Users\blackcat1402\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\conversion\convert_model.py", line 520, in main
    proxy_err = linear.convert_exl3(
                ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\linear.py", line 299, in convert_exl3
    weight_q, proxy_err, out_tensors = quantize_exl3(
                                       ^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 793, in quantize_exl3
    H, L, su, H_diag = finalize_capture_H(H_data, quant_args, verbose)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 488, in finalize_capture_H
    L, H = block_ldl(H, 16, verbose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 291, in block_ldl
    raise e
  File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 278, in block_ldl
    L = torch.linalg.cholesky(H)
        ^^^^^^^^^^^^^^^^^^^^^^^^
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).   

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions