AMX QWEN support #1356

voipmonitor · 2025-06-03T10:52:18Z

voipmonitor
Jun 3, 2025

Hello,

here: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/AMX.md there is "Note: At present, Qwen3MoE running with AMX can only read BF16 GGUF; support for loading from safetensor will be added later." which confuses me - does it mean, that we are not able to run 4bit quantisied versions of QWEN using AMX feature? Is it possible to use Qwen3-235B-A22B-GGUF with AMX? BF16 version is around 450GB - can anyone point me to the repo with GGUF which I can run with AMX --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml please?

edit: I have figured it out. At the moment the AMX backend can read only FP16 model variant. But the backend engine can be switched to the "AMXInt8" or "AMXBF16" in the file ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml

mtcl · 2025-06-04T03:58:58Z

mtcl
Jun 4, 2025

Hey @voipmonitor I have been struggling with this as well. I have been unable to find this information.
in case you are attempting to run DeepSeek-R1-0528 using the FP8 Q4KM hybrid model then I created a video about it, but I haven't been able to run the Qweb3-235b on ktransformers yet. And I have been trying since day1 :)
https://www.youtube.com/watch?v=Xui3_bA26LE

3 replies

voipmonitor Jun 6, 2025
Author

Hey @voipmonitor I have been struggling with this as well. I have been unable to find this information. in case you are attempting to run DeepSeek-R1-0528 using the FP8 Q4KM hybrid model then I created a video about it, but I haven't been able to run the Qweb3-235b on ktransformers yet. And I have been trying since day1 :) https://www.youtube.com/watch?v=Xui3_bA26LE

Hi, I'm able to run Qweb3-235b on ktransformers

python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/unsloth/Qwen3-235B-A22B --gguf_path /root/models/unsloth/Qwen3-235B-A22B-GGUF-Q4_K_M/ --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 2048 --max_batch_size 8 --model_name "unsloth/Qwen3-235B-A22B-GGUF-Q4_K_M"

but obviously this is not with the AMX support as it needs to be BF16

voipmonitor Jun 6, 2025
Author

you can run RTX 5090 with ktransformers but you need to install the latest torch 2.8 nightly

pip install --pre --force-reinstall torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/nightly/cu128

and when compiling ktransformers: export TORCH_CUDA_ARCH_LIST="12.0"

you can also run the AMX version (for now I'm running only the Qwen3-30B-A3B - I have 2400 prefill tokens with this and 46 tokens/sec. I'm running it on the RTX 5060

python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/Qwen/Qwen3-30B-A3B --gguf_path /root/models/unsloth/Qwen3-30B-A3B-GGUF/BF16 --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 4096 --max_batch_size 8 --model_name "unsloth/Qwen3-30B-A3B"

next I'm going to try /unsloth/Qwen3-235B-A22B-GGUF-BF16 but I have only 256 GB mem for now (upgrading to 756G RAM is on the way)
maybe it will work when I will change the backend to the "AMXInt8" in the Qwen3Moe-serve-amx.yaml (which works for Qwen3-30B-A3B or Deepseek-R1)

you can contact me on [email protected] to share experience

mtcl Jun 10, 2025

you can run RTX 5090 with ktransformers but you need to install the latest torch 2.8 nightly

pip install --pre --force-reinstall torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/nightly/cu128

and when compiling ktransformers: export TORCH_CUDA_ARCH_LIST="12.0"

you can also run the AMX version (for now I'm running only the Qwen3-30B-A3B - I have 2400 prefill tokens with this and 46 tokens/sec. I'm running it on the RTX 5060

python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/Qwen/Qwen3-30B-A3B --gguf_path /root/models/unsloth/Qwen3-30B-A3B-GGUF/BF16 --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 4096 --max_batch_size 8 --model_name "unsloth/Qwen3-30B-A3B"

next I'm going to try /unsloth/Qwen3-235B-A22B-GGUF-BF16 but I have only 256 GB mem for now (upgrading to 756G RAM is on the way)
maybe it will work when I will change the backend to the "AMXInt8" in the Qwen3Moe-serve-amx.yaml (which works for Qwen3-30B-A3B or Deepseek-R1)

you can contact me on [email protected] to share experience

Hmm, interesting, I can give this a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMX QWEN support #1356

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

AMX QWEN support #1356

Uh oh!

Uh oh!

voipmonitor Jun 3, 2025

Replies: 1 comment · 3 replies

Uh oh!

mtcl Jun 4, 2025

Uh oh!

voipmonitor Jun 6, 2025 Author

Uh oh!

Uh oh!

voipmonitor Jun 6, 2025 Author

Uh oh!

mtcl Jun 10, 2025

voipmonitor
Jun 3, 2025

Replies: 1 comment 3 replies

mtcl
Jun 4, 2025

voipmonitor Jun 6, 2025
Author

voipmonitor Jun 6, 2025
Author