|
1 | | -This fork enables Intel AMX acceleration for 4th, 5th, and 6th generation Xeon / Xeon-w processors in CPU / GPU hybrids. |
2 | | - |
3 | | -Build with all the normal AMX flags (unchanged from upstream), and use "--amx" in you run commands. |
4 | | - |
5 | | -You can use --AMX on all excutables, tested with CLI / Server / and Bench. |
| 1 | +## gadflyii/llama.cpp |
| 2 | + |
| 3 | +This fork enables Intel AMX acceleration for 4th, 5th, and 6th generation Xeon / Xeon-w processors in CPU / GPU hybrids. Upstream llama.cpp will disable AMX if a GPU is detected, slowing performance on offloaded CPU layers / experts. |
| 4 | + |
| 5 | +The default behavior for CPU only operations is unchanged. When a GPU is present, and the cli/server/bench is started with the "--amx" flag, the CPU's extra buffers are exposed and prefferred, thus enabling repack and use AMX acceleration on the CPU. |
| 6 | + |
| 7 | +# Intial testing results (Xeon 8592+): |
| 8 | + |
| 9 | +## llama-bench |
| 10 | +## No AMX |
| 11 | +''' |
| 12 | +numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3 |
| 13 | +ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no |
| 14 | +ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no |
| 15 | +ggml_cuda_init: found 1 CUDA devices: |
| 16 | + Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes |
| 17 | +| model | size | params | backend | ngl | threads | n_batch | nopo | test | t/s | |
| 18 | +| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | ---: | --------------: | -------------------: | |
| 19 | +| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512 | 214.45 ± 0.11 | |
| 20 | +| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | tg128 | 45.67 ± 0.03 | |
| 21 | +| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512+tg512 | 65.27 ± 0.13 | |
| 22 | +''' |
| 23 | + |
| 24 | +## With AMX |
| 25 | + |
| 26 | +''' |
| 27 | +numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 --amx -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3 |
| 28 | +ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no |
| 29 | +ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no |
| 30 | +ggml_cuda_init: found 1 CUDA devices: |
| 31 | + Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes |
| 32 | +| model | size | params | backend | ngl | threads | n_batch | amx | nopo | test | t/s | |
| 33 | +| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | --------: | ---: | --------------: | -------------------: | |
| 34 | +| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512 | 284.08 ± 0.26 | |
| 35 | +| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | tg128 | 55.55 ± 0.26 | |
| 36 | +| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512+tg512 | 77.62 ± 0.26 | |
| 37 | +''' |
| 38 | + |
| 39 | +##PP512 + 69.62 t/s (+32.47%) |
| 40 | +##TG128 + 9.88 t/s (+21.63%) |
| 41 | +##PP512+TG512 + 12.35 t/s (+18.92%) |
| 42 | + |
| 43 | +## CLI performance: |
| 44 | + |
| 45 | +## No AMX |
| 46 | + |
| 47 | +''' |
| 48 | +numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv |
| 49 | + |
| 50 | +llama_perf_sampler_print: sampling time = 62.16 ms / 517 runs ( 0.12 ms per token, 8316.84 tokens per second) |
| 51 | +llama_perf_context_print: load time = 1327.17 ms |
| 52 | +llama_perf_context_print: prompt eval time = 58.17 ms / 5 tokens ( 11.63 ms per token, 85.96 tokens per second) |
| 53 | +llama_perf_context_print: eval time = 12675.00 ms / 511 runs ( 24.80 ms per token, 40.32 tokens per second) |
| 54 | +llama_perf_context_print: total time = 13012.05 ms / 516 tokens |
| 55 | +llama_perf_context_print: graphs reused = 508 |
| 56 | +''' |
| 57 | + |
| 58 | +## With AMX |
| 59 | + |
| 60 | +''' |
| 61 | +numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 --amx -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv |
| 62 | + |
| 63 | +llama_perf_sampler_print: sampling time = 56.16 ms / 517 runs ( 0.11 ms per token, 9205.18 tokens per second) |
| 64 | +llama_perf_context_print: load time = 9817.13 ms |
| 65 | +llama_perf_context_print: prompt eval time = 51.53 ms / 5 tokens ( 10.31 ms per token, 97.03 tokens per second) |
| 66 | +llama_perf_context_print: eval time = 10416.81 ms / 511 runs ( 20.39 ms per token, 49.06 tokens per second) |
| 67 | +llama_perf_context_print: total time = 10670.73 ms / 516 tokens |
| 68 | +llama_perf_context_print: graphs reused = 508 |
| 69 | +''' |
| 70 | + |
| 71 | +## Decode (generation): +8.74 t/s (+21.68%) |
| 72 | +## Prompt (prefill): +11.07 t/s (+12.88%) |
| 73 | +## Overall throughput: + 8.77 t/s (+21.64%) |
| 74 | + |
| 75 | + |
| 76 | +# Instructions: |
| 77 | + |
| 78 | +Build with all the normal AMX flags (unchanged from upstream); then use the new varible "--amx" in your run commands. You can use "--amx" on all excutables, including llama-bench. |
| 79 | + |
| 80 | +## Copy and paste pull and build (bash): |
| 81 | + |
| 82 | +''' |
| 83 | +set -euo pipefail |
| 84 | + |
| 85 | +# 1) System packages (compiler toolchain, cmake, Ninja optional, perf tools, Python venv) |
| 86 | +sudo apt-get update |
| 87 | +sudo apt-get install -y \ |
| 88 | + build-essential cmake ninja-build git pkg-config \ |
| 89 | + python3-venv python3-pip python3-dev \ |
| 90 | + linux-tools-common linux-tools-$(uname -r) |
| 91 | + |
| 92 | +# 2) Python virtual environment |
| 93 | +mkdir -p ~/venvs |
| 94 | +python3 -m venv ~/venvs/amxllama |
| 95 | +source ~/venvs/amxllama/bin/activate |
| 96 | +python -m pip install -U pip |
| 97 | + |
| 98 | +# 3) Clone this fork |
| 99 | +mkdir -p ~/src |
| 100 | +git clone https://github.com/Gadflyii/llama.cpp.git ~/src/amx-llama.cpp |
| 101 | +cd ~/src/amx-llama.cpp |
| 102 | + |
| 103 | +# 4) Configure CMake (AMX on, CUDA on) |
| 104 | +# - GGML_NATIVE=ON : enable host-specific CPU optimizations |
| 105 | +# - GGML_CUDA=ON : enable CUDA backend (requires CUDA/cuBLAS installed) |
| 106 | +# - GGML_AMX_TILE/INT8/BF16=ON : enable AMX paths |
| 107 | +cmake -S . -B build -G Ninja \ |
| 108 | + -DCMAKE_BUILD_TYPE=Release \ |
| 109 | + -DGGML_NATIVE=ON \ |
| 110 | + -DGGML_CUDA=ON \ |
| 111 | + -DGGML_AMX_TILE=ON \ |
| 112 | + -DGGML_AMX_INT8=ON \ |
| 113 | + -DGGML_AMX_BF16=ON |
| 114 | + |
| 115 | +# 5) Build |
| 116 | +cmake --build build -j"$(nproc)" |
| 117 | +''' |
| 118 | +## Example startup and benchmark commands (recommend you run with numactl and adjust thread count to match your numa node): |
| 119 | + |
| 120 | +''' |
| 121 | +# Bench (hybrid GPU+CPU AMX, no warmup) |
| 122 | +./build/bin/llama-bench \ |
| 123 | + --amx \ |
| 124 | + -m /path-to-your-model.gguf \ |
| 125 | + -t 32 -ngl 10 -b 256 -ub 256 -pg 1024 --no-warmup |
| 126 | + |
| 127 | +# CLI (hybrid) quick generation |
| 128 | +./build/bin/llama-cli \ |
| 129 | + --amx \ |
| 130 | + -m /path-to-your-model.gguf \ |
| 131 | + -t 32 -ngl 10 -c 4096 -n 64 -p "10 facts about birds" --no-warmup |
| 132 | + |
| 133 | +# Server (hybrid) – default port 8080 |
| 134 | +./build/bin/llama-server --amx \ |
| 135 | + -m /path-to-your-model.gguf |
| 136 | +''' |
| 137 | + |
| 138 | +## Thanks for helping me test! |
6 | 139 |
|
7 | 140 |
|
8 | 141 | # llama.cpp |
|
0 commit comments