-
-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Description
I can successfully quantize GLM4.5 Air. however, when I quantized one of its variant PrimeIntellect-INTELLECT-3-FP16, it reported errors as below and failed:
Microsoft Windows [version 10.0.19045.6466]
(c) Microsoft Corporation
C:\Users\blackcat1402>cd exllamav3
C:\Users\blackcat1402\exllamav3>python convert.py ^
More? -i C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16 ^
More? -o C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw ^
More? -w C:\Users\blackcat1402\exl3_working ^
More? -b 5.76 ^
More? -d 2
Detected Windows operating system. Triton does not have an official Windows release, thus FLA will not be adapted for Windows, and any potential errors will not be fixed. Please consider using a Linux environment for compatibility.
-- Creating new job
Input directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16
Output directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw
Working directory: C:\Users\blackcat1402\exl3_working
Calibration size: 250 rows, 2048 columns
Target bitrate: 5.76 (decoder), 6 (head)
Output scales: auto
Codebook: mcg
-- Loaded model config
Architecture: Glm4MoeForCausalLM
-- Created model instance:
- Glm4MoeModel
- Embedding
- TransformerBlock
- RMSNorm
- Attention
- [4x] Linear
- RMSNorm
- GatedMLP
- [3x] Linear
- [45x] TransformerBlock
- RMSNorm
- Attention
- [4x] Linear
- RMSNorm
- BlockSparseMLP
- [385x] Linear
- GatedMLP
- [3x] Linear
- RMSNorm
- Linear
-- Loaded tokenizer
Vocab size: 151367
-- Preparing input state
-- Loading unquantized module: model.embed_tokens
-- Quantized: model.embed_tokens bpw: 16.00 rfn: 0.000000 cos: 0.000000 sqnr: 0.000000 [6.28 s]
-- Loading unquantized module: model.layers.0
-- Captured: model.layers.0
-- Quantized: model.layers.0.self_attn.q_proj bpw: 6.00 proxy_err: 0.000025 . g_sc: 0.405454 [9.13 s]
-- Quantized: model.layers.0.self_attn.k_proj bpw: 7.00 proxy_err: 0.000006 . g_sc: 0.467450 [1.35 s]
-- Quantized: model.layers.0.self_attn.v_proj bpw: 7.00 proxy_err: 0.000008 . g_sc: 0.629755 [1.32 s]
-- Quantized: model.layers.0.self_attn.o_proj bpw: 6.00 proxy_err: 0.000024 o g_sc: 0.691751 [10.49 s]
-- Quantized: model.layers.0.mlp.up_proj bpw: 5.00 proxy_err: 0.000160 o g_sc: 0.859647 [8.23 s]
-- Quantized: model.layers.0.mlp.gate_proj bpw: 5.00 proxy_err: 0.000123 o g_sc: 0.863102 [8.14 s]
-- Quantized: model.layers.0.mlp.down_proj bpw: 6.00 proxy_err: 0.000032 o g_sc: 0.806696 [8.69 s]
-- Quantized: model.layers.0 bpw: 5.67 rfn: 0.005075 cos: 0.000013 sqnr: 46.547755 [80.01 s]
-- Estimated remaining time: 1 hour, 4 minutes
-- Loading unquantized module: model.layers.1
-- Captured: model.layers.1
!! Warning: block.mlp.0.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.1.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.10.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.100.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.101.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.102.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
-- Quantized: model.layers.1.self_attn.q_proj bpw: 6.00 proxy_err: 0.000017 . g_sc: 0.376184 [8.21 s]
-- Quantized: model.layers.1.self_attn.k_proj bpw: 7.00 proxy_err: 0.000002 . g_sc: 0.525990 [1.31 s]
-- Quantized: model.layers.1.self_attn.v_proj bpw: 7.00 proxy_err: 0.000007 . g_sc: 0.644391 [1.34 s]
-- Quantized: model.layers.1.self_attn.o_proj bpw: 6.00 proxy_err: 0.000031 o g_sc: 0.786471 [10.38 s]
-- Quantized: model.layers.1.mlp.experts.0.up_proj bpw: 6.00 proxy_err: (OoM) o g_sc: 1.895478 [1.48 s]
-- Quantized: model.layers.1.mlp.experts.0.gate_proj bpw: 5.00 proxy_err: 0.000178 o g_sc: 0.859647 [1.38 s]
Traceback (most recent call last):
File "C:\Users\blackcat1402\exllamav3\convert.py", line 11, in <module>
main(_in_args, _job_state)
File "C:\Users\blackcat1402\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\conversion\convert_model.py", line 520, in main
proxy_err = linear.convert_exl3(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\linear.py", line 299, in convert_exl3
weight_q, proxy_err, out_tensors = quantize_exl3(
^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 793, in quantize_exl3
H, L, su, H_diag = finalize_capture_H(H_data, quant_args, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 488, in finalize_capture_H
L, H = block_ldl(H, 16, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 291, in block_ldl
raise e
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 278, in block_ldl
L = torch.linalg.cholesky(H)
^^^^^^^^^^^^^^^^^^^^^^^^
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).
C:\Users\blackcat1402\exllamav3>
C:\Users\blackcat1402\exllamav3>
C:\Users\blackcat1402\exllamav3>
C:\Users\blackcat1402\exllamav3>python convert.py ^
More? -i C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16 ^
More? -o C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw ^
More? -w C:\Users\blackcat1402\exl3_working ^
More? -b 5.76 ^
More? -d 1
Detected Windows operating system. Triton does not have an official Windows release, thus FLA will not be adapted for Windows, and any potential errors will not be fixed. Please consider using a Linux environment for compatibility.
-- Creating new job
Input directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-FP16
Output directory: C:\Users\blackcat1402\PrimeIntellect-INTELLECT-3-exl3-5.76bpw
Working directory: C:\Users\blackcat1402\exl3_working
Calibration size: 250 rows, 2048 columns
Target bitrate: 5.76 (decoder), 6 (head)
Output scales: auto
Codebook: mcg
-- Loaded model config
Architecture: Glm4MoeForCausalLM
-- Created model instance:
- Glm4MoeModel
- Embedding
- TransformerBlock
- RMSNorm
- Attention
- [4x] Linear
- RMSNorm
- GatedMLP
- [3x] Linear
- [45x] TransformerBlock
- RMSNorm
- Attention
- [4x] Linear
- RMSNorm
- BlockSparseMLP
- [385x] Linear
- GatedMLP
- [3x] Linear
- RMSNorm
- Linear
-- Loaded tokenizer
Vocab size: 151367
-- Preparing input state
-- Loading unquantized module: model.embed_tokens
-- Quantized: model.embed_tokens bpw: 16.00 rfn: 0.000000 cos: 0.000000 sqnr: 0.000000 [4.49 s]
-- Loading unquantized module: model.layers.0
-- Captured: model.layers.0
-- Quantized: model.layers.0.self_attn.q_proj bpw: 6.00 proxy_err: 0.000025 . g_sc: 0.405454 [8.48 s]
-- Quantized: model.layers.0.self_attn.k_proj bpw: 7.00 proxy_err: 0.000006 . g_sc: 0.467450 [1.36 s]
-- Quantized: model.layers.0.self_attn.v_proj bpw: 7.00 proxy_err: 0.000008 . g_sc: 0.629755 [1.33 s]
-- Quantized: model.layers.0.self_attn.o_proj bpw: 6.00 proxy_err: 0.000024 o g_sc: 0.691751 [10.13 s]
-- Quantized: model.layers.0.mlp.up_proj bpw: 5.00 proxy_err: 0.000160 o g_sc: 0.859647 [8.20 s]
-- Quantized: model.layers.0.mlp.gate_proj bpw: 5.00 proxy_err: 0.000123 o g_sc: 0.863102 [8.10 s]
-- Quantized: model.layers.0.mlp.down_proj bpw: 6.00 proxy_err: 0.000032 o g_sc: 0.806696 [8.64 s]
-- Quantized: model.layers.0 bpw: 5.67 rfn: 0.005075 cos: 0.000013 sqnr: 46.547755 [78.07 s]
-- Estimated remaining time: 1 hour, 2 minutes
-- Loading unquantized module: model.layers.1
-- Captured: model.layers.1
!! Warning: block.mlp.0.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.1.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.10.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.100.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.101.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
!! Warning: block.mlp.102.down state has 0 inf values and 720,896,000 NaN values (out of 720,896,000)
-- Quantized: model.layers.1.self_attn.q_proj bpw: 6.00 proxy_err: 0.000017 . g_sc: 0.376184 [8.21 s]
-- Quantized: model.layers.1.self_attn.k_proj bpw: 7.00 proxy_err: 0.000002 . g_sc: 0.525990 [1.33 s]
-- Quantized: model.layers.1.self_attn.v_proj bpw: 7.00 proxy_err: 0.000007 . g_sc: 0.644391 [1.30 s]
-- Quantized: model.layers.1.self_attn.o_proj bpw: 6.00 proxy_err: 0.000031 o g_sc: 0.786471 [10.21 s]
-- Quantized: model.layers.1.mlp.experts.0.up_proj bpw: 6.00 proxy_err: (OoM) o g_sc: 1.895478 [1.46 s]
-- Quantized: model.layers.1.mlp.experts.0.gate_proj bpw: 5.00 proxy_err: 0.000178 o g_sc: 0.859647 [1.38 s]
Traceback (most recent call last):
File "C:\Users\blackcat1402\exllamav3\convert.py", line 11, in <module>
main(_in_args, _job_state)
File "C:\Users\blackcat1402\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\conversion\convert_model.py", line 520, in main
proxy_err = linear.convert_exl3(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\linear.py", line 299, in convert_exl3
weight_q, proxy_err, out_tensors = quantize_exl3(
^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 793, in quantize_exl3
H, L, su, H_diag = finalize_capture_H(H_data, quant_args, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 488, in finalize_capture_H
L, H = block_ldl(H, 16, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 291, in block_ldl
raise e
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 278, in block_ldl
L = torch.linalg.cholesky(H)
^^^^^^^^^^^^^^^^^^^^^^^^
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).
Metadata
Metadata
Assignees
Labels
No labels