Skip to content

Commit 340e038

Browse files
committed
Update README.md
1 parent 485db7d commit 340e038

File tree

1 file changed

+138
-5
lines changed

1 file changed

+138
-5
lines changed

README.md

Lines changed: 138 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,141 @@
1-
This fork enables Intel AMX acceleration for 4th, 5th, and 6th generation Xeon / Xeon-w processors in CPU / GPU hybrids.
2-
3-
Build with all the normal AMX flags (unchanged from upstream), and use "--amx" in you run commands.
4-
5-
You can use --AMX on all excutables, tested with CLI / Server / and Bench.
1+
## gadflyii/llama.cpp
2+
3+
This fork enables Intel AMX acceleration for 4th, 5th, and 6th generation Xeon / Xeon-w processors in CPU / GPU hybrids. Upstream llama.cpp will disable AMX if a GPU is detected, slowing performance on offloaded CPU layers / experts.
4+
5+
The default behavior for CPU only operations is unchanged. When a GPU is present, and the cli/server/bench is started with the "--amx" flag, the CPU's extra buffers are exposed and prefferred, thus enabling repack and use AMX acceleration on the CPU.
6+
7+
# Intial testing results (Xeon 8592+):
8+
9+
## llama-bench
10+
## No AMX
11+
'''
12+
numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
13+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
14+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
15+
ggml_cuda_init: found 1 CUDA devices:
16+
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
17+
| model | size | params | backend | ngl | threads | n_batch | nopo | test | t/s |
18+
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | ---: | --------------: | -------------------: |
19+
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512 | 214.45 ± 0.11 |
20+
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | tg128 | 45.67 ± 0.03 |
21+
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512+tg512 | 65.27 ± 0.13 |
22+
'''
23+
24+
## With AMX
25+
26+
'''
27+
numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 --amx -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
28+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
29+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
30+
ggml_cuda_init: found 1 CUDA devices:
31+
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
32+
| model | size | params | backend | ngl | threads | n_batch | amx | nopo | test | t/s |
33+
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | --------: | ---: | --------------: | -------------------: |
34+
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512 | 284.08 ± 0.26 |
35+
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | tg128 | 55.55 ± 0.26 |
36+
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512+tg512 | 77.62 ± 0.26 |
37+
'''
38+
39+
##PP512 + 69.62 t/s (+32.47%)
40+
##TG128 + 9.88 t/s (+21.63%)
41+
##PP512+TG512 + 12.35 t/s (+18.92%)
42+
43+
## CLI performance:
44+
45+
## No AMX
46+
47+
'''
48+
numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv
49+
50+
llama_perf_sampler_print: sampling time = 62.16 ms / 517 runs ( 0.12 ms per token, 8316.84 tokens per second)
51+
llama_perf_context_print: load time = 1327.17 ms
52+
llama_perf_context_print: prompt eval time = 58.17 ms / 5 tokens ( 11.63 ms per token, 85.96 tokens per second)
53+
llama_perf_context_print: eval time = 12675.00 ms / 511 runs ( 24.80 ms per token, 40.32 tokens per second)
54+
llama_perf_context_print: total time = 13012.05 ms / 516 tokens
55+
llama_perf_context_print: graphs reused = 508
56+
'''
57+
58+
## With AMX
59+
60+
'''
61+
numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 --amx -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv
62+
63+
llama_perf_sampler_print: sampling time = 56.16 ms / 517 runs ( 0.11 ms per token, 9205.18 tokens per second)
64+
llama_perf_context_print: load time = 9817.13 ms
65+
llama_perf_context_print: prompt eval time = 51.53 ms / 5 tokens ( 10.31 ms per token, 97.03 tokens per second)
66+
llama_perf_context_print: eval time = 10416.81 ms / 511 runs ( 20.39 ms per token, 49.06 tokens per second)
67+
llama_perf_context_print: total time = 10670.73 ms / 516 tokens
68+
llama_perf_context_print: graphs reused = 508
69+
'''
70+
71+
## Decode (generation): +8.74 t/s (+21.68%)
72+
## Prompt (prefill): +11.07 t/s (+12.88%)
73+
## Overall throughput: + 8.77 t/s (+21.64%)
74+
75+
76+
# Instructions:
77+
78+
Build with all the normal AMX flags (unchanged from upstream); then use the new varible "--amx" in your run commands. You can use "--amx" on all excutables, including llama-bench.
79+
80+
## Copy and paste pull and build (bash):
81+
82+
'''
83+
set -euo pipefail
84+
85+
# 1) System packages (compiler toolchain, cmake, Ninja optional, perf tools, Python venv)
86+
sudo apt-get update
87+
sudo apt-get install -y \
88+
build-essential cmake ninja-build git pkg-config \
89+
python3-venv python3-pip python3-dev \
90+
linux-tools-common linux-tools-$(uname -r)
91+
92+
# 2) Python virtual environment
93+
mkdir -p ~/venvs
94+
python3 -m venv ~/venvs/amxllama
95+
source ~/venvs/amxllama/bin/activate
96+
python -m pip install -U pip
97+
98+
# 3) Clone this fork
99+
mkdir -p ~/src
100+
git clone https://github.com/Gadflyii/llama.cpp.git ~/src/amx-llama.cpp
101+
cd ~/src/amx-llama.cpp
102+
103+
# 4) Configure CMake (AMX on, CUDA on)
104+
# - GGML_NATIVE=ON : enable host-specific CPU optimizations
105+
# - GGML_CUDA=ON : enable CUDA backend (requires CUDA/cuBLAS installed)
106+
# - GGML_AMX_TILE/INT8/BF16=ON : enable AMX paths
107+
cmake -S . -B build -G Ninja \
108+
-DCMAKE_BUILD_TYPE=Release \
109+
-DGGML_NATIVE=ON \
110+
-DGGML_CUDA=ON \
111+
-DGGML_AMX_TILE=ON \
112+
-DGGML_AMX_INT8=ON \
113+
-DGGML_AMX_BF16=ON
114+
115+
# 5) Build
116+
cmake --build build -j"$(nproc)"
117+
'''
118+
## Example startup and benchmark commands (recommend you run with numactl and adjust thread count to match your numa node):
119+
120+
'''
121+
# Bench (hybrid GPU+CPU AMX, no warmup)
122+
./build/bin/llama-bench \
123+
--amx \
124+
-m /path-to-your-model.gguf \
125+
-t 32 -ngl 10 -b 256 -ub 256 -pg 1024 --no-warmup
126+
127+
# CLI (hybrid) quick generation
128+
./build/bin/llama-cli \
129+
--amx \
130+
-m /path-to-your-model.gguf \
131+
-t 32 -ngl 10 -c 4096 -n 64 -p "10 facts about birds" --no-warmup
132+
133+
# Server (hybrid) – default port 8080
134+
./build/bin/llama-server --amx \
135+
-m /path-to-your-model.gguf
136+
'''
137+
138+
## Thanks for helping me test!
6139

7140

8141
# llama.cpp

0 commit comments

Comments
 (0)