-
-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Description
Hi, turboderp,
Under Win10 22H2 OS, I can quantize FP16 models with single A100 SXM2 32G GPU e.g. GLM4.5 Air. However, with a FP8 model e.g. cerebras-MiniMax-M2-REAP-139B-A10B-FP8
It always reported below error:
More? -i C:\Users\blackcat1402\cerebras-MiniMax-M2-REAP-139B-A10B-FP8 ^
More? -o C:\Users\blackcat1402\cerebras-MiniMax-M2-REAP-139B-A10B-exl3-5.25bpw ^
More? -w C:\Users\blackcat1402\exl3_working ^
More? -b 5.25 ^
More? -d 0
Detected Windows operating system. Triton does not have an official Windows release, thus FLA will not be adapted for Windows, and any potential errors will not be fixed. Please consider using a Linux environment for compatibility.
-- Creating new job
Input directory: C:\Users\blackcat1402\cerebras-MiniMax-M2-REAP-139B-A10B-FP8
Output directory: C:\Users\blackcat1402\cerebras-MiniMax-M2-REAP-139B-A10B-exl3-5.25bpw
Working directory: C:\Users\blackcat1402\exl3_working
Calibration size: 250 rows, 2048 columns
Target bitrate: 5.25 (decoder), 6 (head)
Output scales: auto
Codebook: mcg
-- Loaded model config
Architecture: MiniMaxM2ForCausalLM
-- Created model instance:
- MiniMaxM2Model
- Embedding
- [62x] TransformerBlock
- RMSNorm
- Attention
- [4x] Linear
- [2x] RMSNorm
- RMSNorm
- BlockSparseMLP
- [463x] Linear
- RMSNorm
- Linear
-- Loaded tokenizer
Vocab size: 200054
-- Preparing input state
-- Loading unquantized module: model.embed_tokens
-- Quantized: model.embed_tokens bpw: 16.00 rfn: 0.000000 cos: 0.000000 sqnr: 0.000000 [2.91 s]
-- Loading unquantized module: model.layers.0
Traceback (most recent call last):
File "C:\Users\blackcat1402\exllamav3\convert.py", line 11, in <module>
main(_in_args, _job_state)
File "C:\Users\blackcat1402\AppData\Roaming\Python\Python311\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\conversion\convert_model.py", line 372, in main
rs = module.forward(rs, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\transformer.py", line 81, in forward
y = self.mlp.forward(y, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\block_sparse_mlp.py", line 516, in forward
selected_experts, routing_weights = self.routing_fn(bsz, self.routing_cfg, y, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\blackcat1402\exllamav3\exllamav3\modules\block_sparse_mlp.py", line 129, in routing_dots
ext.routing_ds3_nogroup(
RuntimeError: num_experts must be a multiple of 32
Exception raised from routing_ds3_nogroup at D:\a\exllamav3\exllamav3\exllamav3\exllamav3_ext\routing.cu:347 (most recent call first):
00007FF8D3CE2C2400007FF8D3CE2B80 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FF8D3CE169A00007FF8D3CE1640 c10.dll!c10::detail::torchCheckFail [<unknown file> @ <unknown line number>]
00007FFF2EA245B300007FFF2E993F50 exllamav3_ext.cp311-win_amd64.pyd!PyInit_exllamav3_ext [<unknown file> @ <unknown line number>]
00007FFF2E9A8EE100007FFF2E993F50 exllamav3_ext.cp311-win_amd64.pyd!PyInit_exllamav3_ext [<unknown file> @ <unknown line number>]
00007FFF2E9A8F6400007FFF2E993F50 exllamav3_ext.cp311-win_amd64.pyd!PyInit_exllamav3_ext [<unknown file> @ <unknown line number>]
00007FFF2E99079300007FFF2E985D80 exllamav3_ext.cp311-win_amd64.pyd!c10::ivalue::Object::operator= [<unknown file> @ <unknown line number>]
00007FF8D1AA82BE00007FF8D1AA8060 python311.dll!PyObject_MakeTpCall [<unknown file> @ <unknown line number>]
00007FF8D1AABDDF00007FF8D1AABB50 python311.dll!PyObject_Vectorcall [<unknown file> @ <unknown line number>]
00007FF8D1AAD42300007FF8D1AACC20 python311.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF8D1AAAE6400007FF8D1AAACC0 python311.dll!PyFunction_Vectorcall [<unknown file> @ <unknown line number>]
00007FF8D1B54CF300007FF8D1B548EC python311.dll!PyObject_CallObject [<unknown file> @ <unknown line number>]
00007FF8D1AB1F7F00007FF8D1AACC20 python311.dll!PyEval_EvalFrameDefault [<unknown file> @ <unknown line number>]
00007FF8D1A87FEB00007FF8D1A87EF0 python311.dll!PyType_CalculateMetaclass [<unknown file> @ <unknown line number>]
00007FF8D1B5253300007FF8D1B5249C python311.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF8D1B527CA00007FF8D1B5249C python311.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF8D1B5274600007FF8D1B5249C python311.dll!PyEval_EvalCode [<unknown file> @ <unknown line number>]
00007FF8D1C4ED2600007FF8D1BF9C78 python311.dll!PyThread_tss_is_created [<unknown file> @ <unknown line number>]
00007FF8D1BBF0C900007FF8D1BBEFAC python311.dll!PyRun_SimpleFileObject [<unknown file> @ <unknown line number>]
00007FF8D1BBEE3800007FF8D1BBEDE4 python311.dll!PyRun_AnyFileObject [<unknown file> @ <unknown line number>]
00007FF8D1BBEA8700007FF8D1BBDDB0 python311.dll!PyDict_Values [<unknown file> @ <unknown line number>]
00007FF8D1BBE94300007FF8D1BBDDB0 python311.dll!PyDict_Values [<unknown file> @ <unknown line number>]
00007FF8D1A7A4EC00007FF8D1A7A368 python311.dll!Py_RunMain [<unknown file> @ <unknown line number>]
00007FF8D1A7A37D00007FF8D1A7A368 python311.dll!Py_RunMain [<unknown file> @ <unknown line number>]
00007FF8D1A7880D00007FF8D1A787E8 python311.dll!Py_Main [<unknown file> @ <unknown line number>]
00007FF688601230 <unknown symbol address> python.exe!<unknown symbol> [<unknown file> @ <unknown line number>]
00007FF8FE1D737400007FF8FE1D7360 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FF8FE31CC9100007FF8FE31CC70 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]```
And I am wondering whether it was a limitation of A100 SM80 did not support FP8, or Exllamav3 did not support quantize from FP8 model at all. Thanks in advance.
Metadata
Metadata
Assignees
Labels
No labels