how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371

voipmonitor · 2025-06-07T09:52:57Z

voipmonitor
Jun 7, 2025

Hello,

I have recently tested ktransformers AMX support and the speed up is nice for the prefill.

In the documentation there is figure showing Model Qwen3-30B-A3B (4-bit) test reuslts but in the doc: "Qwen3MoE running with AMX can only read BF16 GGUF" obviously - reading any GGUF like Q4_K_M is not possible to use (or I'm doing something wrong)

how to try 4bit version with the AMX optimisations? Do I miss something?

This is how I run it (this works)
python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/Qwen/Qwen3-30B-A3B --gguf_path /root/models/unsloth/Qwen3-30B-A3B-GGUF/BF16 --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 512 --max_batch_size 8 --model_name "unsloth/Qwen3-30B-A3B"

and this ends with error: assert self.gate_type == GGMLQuantizationType.BF16 (so I guess it needs the BF16 format but this means how to load 4bit quantisied model?

python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/unsloth/Qwen3-30B-A3B/ --gguf_path /mnt/models/Qwen/Qwen3-30B-A3B-Q4_K_M.gguf --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --model_name "unsloth/Qwen3-30B-A3B"

Gadflyii · 2025-09-05T13:05:57Z

Gadflyii
Sep 5, 2025

Right now, (as of v0.32) you cannot load a 4-bit quantized gguf model and use AMX.

You can only load BF16 GGUF. In the optimize rule (Qwen3Moe-serve-amx.yaml) set the backend to "AMXInt8". The engine will perform online quantization to from BF16 to Int8 as it loads each layer into RAM.

To verify AMX use perf while running prompts against the model:

sudo perf stat -a -e exe.amx_busy,cycles -- sleep 60

You will see only a few AMX counters; as it is only used for decode operations on the CPU offloaded experts. If you want to test AMX further, you can move all experts to the CPU (It will be slower than a GPU+CPU) change this in the optimize rule:

- match: name: "^model\\.layers\\..*\\.mlp\\.experts$" replace: class: ktransformers.operators.experts.KTransformersExpertsV2 # custom MoE Kernel with expert paralleism kwargs: prefill_device: "cpu" prefill_op: "KExpertsCPU" generate_device: "cpu" generate_op: "KExpertsCPU" out_device: "cuda" backend: "AMXInt8" # or "AMXBF16" or "llamafile" (default)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371

Uh oh!

voipmonitor Jun 7, 2025

Replies: 1 comment

Uh oh!

Uh oh!

Gadflyii Sep 5, 2025

voipmonitor
Jun 7, 2025

Gadflyii
Sep 5, 2025