how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371
Replies: 1 comment
-
Right now, (as of v0.32) you cannot load a 4-bit quantized gguf model and use AMX. You can only load BF16 GGUF. In the optimize rule (Qwen3Moe-serve-amx.yaml) set the backend to "AMXInt8". The engine will perform online quantization to from BF16 to Int8 as it loads each layer into RAM. To verify AMX use perf while running prompts against the model:
You will see only a few AMX counters; as it is only used for decode operations on the CPU offloaded experts. If you want to test AMX further, you can move all experts to the CPU (It will be slower than a GPU+CPU) change this in the optimize rule:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I have recently tested ktransformers AMX support and the speed up is nice for the prefill.
In the documentation there is figure showing Model Qwen3-30B-A3B (4-bit) test reuslts but in the doc: "Qwen3MoE running with AMX can only read BF16 GGUF" obviously - reading any GGUF like Q4_K_M is not possible to use (or I'm doing something wrong)
how to try 4bit version with the AMX optimisations? Do I miss something?
This is how I run it (this works)
python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/Qwen/Qwen3-30B-A3B --gguf_path /root/models/unsloth/Qwen3-30B-A3B-GGUF/BF16 --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 512 --max_batch_size 8 --model_name "unsloth/Qwen3-30B-A3B"
and this ends with error: assert self.gate_type == GGMLQuantizationType.BF16 (so I guess it needs the BF16 format but this means how to load 4bit quantisied model?
python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/unsloth/Qwen3-30B-A3B/ --gguf_path /mnt/models/Qwen/Qwen3-30B-A3B-Q4_K_M.gguf --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --model_name "unsloth/Qwen3-30B-A3B"
Beta Was this translation helpful? Give feedback.
All reactions