Skip to content

Commit 0a28796

Browse files
committed
Update README.md
1 parent 56f5295 commit 0a28796

File tree

1 file changed

+19
-14
lines changed

1 file changed

+19
-14
lines changed

README.md

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The default behavior for CPU only operations is unchanged. When a GPU is present
88

99
## llama-bench
1010
### No AMX
11-
11+
```
1212
numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
1313
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
1414
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
@@ -20,9 +20,10 @@ ggml_cuda_init: found 1 CUDA devices:
2020
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512 | 214.45 ± 0.11 |
2121
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | tg128 | 45.67 ± 0.03 |
2222
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | pp512+tg512 | 65.27 ± 0.13 |
23+
```
2324

2425
### With AMX
25-
26+
```
2627
numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 --amx -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
2728
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2829
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
@@ -34,15 +35,16 @@ ggml_cuda_init: found 1 CUDA devices:
3435
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512 | 284.08 ± 0.26 |
3536
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | tg128 | 55.55 ± 0.26 |
3637
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512+tg512 | 77.62 ± 0.26 |
38+
```
39+
**PP512 + 69.62 t/s (+32.47%)**
40+
**TG128 + 9.88 t/s (+21.63%)**
41+
**PP512+TG512 + 12.35 t/s (+18.92%)**
3742

38-
### PP512 + 69.62 t/s (+32.47%)
39-
### TG128 + 9.88 t/s (+21.63%)
40-
### PP512+TG512 + 12.35 t/s (+18.92%)
4143

4244
## CLI performance:
4345

4446
### No AMX
45-
47+
```
4648
numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv
4749
4850
llama_perf_sampler_print: sampling time = 62.16 ms / 517 runs ( 0.12 ms per token, 8316.84 tokens per second)
@@ -51,10 +53,10 @@ llama_perf_context_print: prompt eval time = 58.17 ms / 5 tokens ( 11
5153
llama_perf_context_print: eval time = 12675.00 ms / 511 runs ( 24.80 ms per token, 40.32 tokens per second)
5254
llama_perf_context_print: total time = 13012.05 ms / 516 tokens
5355
llama_perf_context_print: graphs reused = 508
54-
56+
```
5557

5658
### With AMX
57-
59+
```
5860
numactl -N 2 -m 2 /llama-cli -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -ngl 10 --amx -t 32 -b 4096 -c 4096 -n 512 --numa numactl -p "10 facts about birds" -no-cnv
5961
6062
llama_perf_sampler_print: sampling time = 56.16 ms / 517 runs ( 0.11 ms per token, 9205.18 tokens per second)
@@ -63,17 +65,17 @@ llama_perf_context_print: prompt eval time = 51.53 ms / 5 tokens ( 10
6365
llama_perf_context_print: eval time = 10416.81 ms / 511 runs ( 20.39 ms per token, 49.06 tokens per second)
6466
llama_perf_context_print: total time = 10670.73 ms / 516 tokens
6567
llama_perf_context_print: graphs reused = 508
66-
67-
### Decode (generation): +8.74 t/s (+21.68%)
68-
### Prompt (prefill): +11.07 t/s (+12.88%)
69-
### Overall throughput: + 8.77 t/s (+21.64%)
68+
```
69+
**Decode (generation): +8.74 t/s (+21.68%)**
70+
**Prompt (prefill): +11.07 t/s (+12.88%)**
71+
**Overall throughput: + 8.77 t/s (+21.64%)**
7072

7173

7274
## Instructions:
7375

7476
Build with all the normal AMX flags (unchanged from upstream); then use the new varible "--amx" in your run commands. You can use "--amx" on all excutables, including llama-bench.
7577

76-
## Copy and paste / pull and build (bash):
78+
### Copy and paste / pull and build (bash):
7779

7880
```
7981
set -euo pipefail
@@ -103,7 +105,7 @@ cmake -S . -B build -G Ninja \
103105
104106
cmake --build build -j"$(nproc)"
105107
```
106-
# Example Commands
108+
## Example Commands
107109
```
108110
# Bench (hybrid GPU+CPU AMX, no warmup)
109111
./build/bin/llama-bench \
@@ -124,6 +126,9 @@ cmake --build build -j"$(nproc)"
124126

125127
## Thanks for helping me test!
126128

129+
130+
131+
127132
---
128133

129134
# llama.cpp

0 commit comments

Comments
 (0)