Skip to content

Commit 56f5295

Browse files
committed
Update README.md
1 parent 3b6e008 commit 56f5295

File tree

1 file changed

+19
-18
lines changed

1 file changed

+19
-18
lines changed

README.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
1-
## gadflyii/llama.cpp
1+
# gadflyii/llama.cpp
22

33
This fork enables Intel AMX acceleration for 4th, 5th, and 6th generation Xeon / Xeon-w processors in CPU / GPU hybrids. Upstream llama.cpp will disable AMX if a GPU is detected, slowing performance on offloaded CPU layers / experts.
44

55
The default behavior for CPU only operations is unchanged. When a GPU is present, and the cli/server/bench is started with the "--amx" flag, the CPU's extra buffers are exposed and prefferred, thus enabling repack and use AMX acceleration on the CPU.
66

7-
# Intial testing results (Xeon 8592+):
7+
## Intial testing results (Xeon 8592+):
88

99
## llama-bench
10-
## No AMX
10+
### No AMX
1111

1212
numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
1313
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
@@ -21,7 +21,7 @@ ggml_cuda_init: found 1 CUDA devices:
2121
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | tg128 | 45.67 ± 0.03 |
2222
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512+tg512 | 65.27 ± 0.13 |
2323

24-
## With AMX
24+
### With AMX
2525

2626
numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 --amx -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
2727
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
@@ -35,13 +35,13 @@ ggml_cuda_init: found 1 CUDA devices:
3535
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | tg128 | 55.55 ± 0.26 |
3636
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512+tg512 | 77.62 ± 0.26 |
3737

38-
## PP512 + 69.62 t/s (+32.47%)
39-
## TG128 + 9.88 t/s (+21.63%)
40-
## PP512+TG512 + 12.35 t/s (+18.92%)
38+
### PP512 + 69.62 t/s (+32.47%)
39+
### TG128 + 9.88 t/s (+21.63%)
40+
### PP512+TG512 + 12.35 t/s (+18.92%)
4141

4242
## CLI performance:
4343

44-
## No AMX
44+
### No AMX
4545

4646
numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv
4747

@@ -53,7 +53,7 @@ llama_perf_context_print: total time = 13012.05 ms / 516 tokens
5353
llama_perf_context_print: graphs reused = 508
5454

5555

56-
## With AMX
56+
### With AMX
5757

5858
numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 --amx -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv
5959

@@ -64,18 +64,18 @@ llama_perf_context_print: eval time = 10416.81 ms / 511 runs ( 20
6464
llama_perf_context_print: total time = 10670.73 ms / 516 tokens
6565
llama_perf_context_print: graphs reused = 508
6666

67-
## Decode (generation): +8.74 t/s (+21.68%)
68-
## Prompt (prefill): +11.07 t/s (+12.88%)
69-
## Overall throughput: + 8.77 t/s (+21.64%)
67+
### Decode (generation): +8.74 t/s (+21.68%)
68+
### Prompt (prefill): +11.07 t/s (+12.88%)
69+
### Overall throughput: + 8.77 t/s (+21.64%)
7070

7171

72-
# Instructions:
72+
## Instructions:
7373

7474
Build with all the normal AMX flags (unchanged from upstream); then use the new varible "--amx" in your run commands. You can use "--amx" on all excutables, including llama-bench.
7575

76-
## Copy and paste pull and build (bash):
76+
## Copy and paste / pull and build (bash):
7777

78-
'''
78+
```
7979
set -euo pipefail
8080
8181
sudo apt-get update
@@ -102,9 +102,9 @@ cmake -S . -B build -G Ninja \
102102
-DGGML_AMX_BF16=ON
103103
104104
cmake --build build -j"$(nproc)"
105-
'''
105+
```
106106
# Example Commands
107-
'''
107+
```
108108
# Bench (hybrid GPU+CPU AMX, no warmup)
109109
./build/bin/llama-bench \
110110
--amx \
@@ -120,10 +120,11 @@ cmake --build build -j"$(nproc)"
120120
# Server (hybrid) – default port 8080
121121
./build/bin/llama-server --amx \
122122
-m /path-to-your-model.gguf
123-
'''
123+
```
124124

125125
## Thanks for helping me test!
126126

127+
---
127128

128129
# llama.cpp
129130

0 commit comments

Comments
 (0)