AMX Improved Performance #2
trilog-inc
started this conversation in
General
Replies: 2 comments
-
Thanks! Nice to see such an improvement from just running the moe on the CPU while keeping everything else on the GPU. Thanks again for testing this out. |
Beta Was this translation helpful? Give feedback.
0 replies
-
@Gadflyii Hi! can you sync up with mainline llama.cpp? I would like to test the new GLM4.6 model |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I went a head and rebuilt the project with the build parameters your listed:
cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_AMX_TILE=ON -DGGML_AMX_INT8=ON -DGGML_AMX_BF16=ON
the TG seems to have increased significantly from 7.5 to 10.74 ( ~45% increase ). Prompt eval didn't really budge.
Using this command:
./build/bin/llama-server --model /mnt/home_extend/models/unsloth_DeepSeek-V3.1-GGUF/UD-Q4_K_XL/DeepSeek-V3.1-UD-Q4_K_XL-00001-of-00008.gguf --alias ds3.1 --threads 44 --ctx-size 100000 --n-gpu-layers 99 --cpu-moe --temp 0.6 --top-p 0.95 -fa 1 --host 0.0.0.0 --jinja --port 8099 --threads 44 --amx -ub 8192 -b 8192
Very interesting! Going to keep testing. Maybe this can be merged with ik_llama to futher increase performance of MOE models..
Beta Was this translation helpful? Give feedback.
All reactions