File tree Expand file tree Collapse file tree 1 file changed +7
-7
lines changed Expand file tree Collapse file tree 1 file changed +7
-7
lines changed Original file line number Diff line number Diff line change @@ -6,7 +6,7 @@ The default behavior for CPU only operations is unchanged. When a GPU is present
66
77## Intial testing results (Xeon 8592+):
88
9- ## llama-bench
9+ ## llama-bench:
1010### No AMX
1111```
1212numactl -N 2 -m 2 llama-bench -m /Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf -t 32 --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
@@ -36,9 +36,9 @@ ggml_cuda_init: found 1 CUDA devices:
3636| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | tg128 | 55.55 ± 0.26 |
3737| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CUDA | 10 | 32 | 512 | 1 | 1 | pp512+tg512 | 77.62 ± 0.26 |
3838```
39- ** PP512 + 69.62 t/s (+32.47%)**
40- ** TG128 + 9.88 t/s (+21.63%)**
41- ** PP512+TG512 + 12.35 t/s (+18.92%)**
39+ - ** PP512 + 69.62 t/s (+32.47%)**
40+ - ** TG128 + 9.88 t/s (+21.63%)**
41+ - ** PP512+TG512 + 12.35 t/s (+18.92%)**
4242
4343
4444## CLI performance:
@@ -66,9 +66,9 @@ llama_perf_context_print: eval time = 10416.81 ms / 511 runs ( 20
6666llama_perf_context_print: total time = 10670.73 ms / 516 tokens
6767llama_perf_context_print: graphs reused = 508
6868```
69- ** Decode (generation): +8.74 t/s (+21.68%)**
70- ** Prompt (prefill): +11.07 t/s (+12.88%)**
71- ** Overall throughput: + 8.77 t/s (+21.64%)**
69+ - ** Decode (generation): +8.74 t/s (+21.68%)**
70+ - ** Prompt (prefill): +11.07 t/s (+12.88%)**
71+ - ** Overall throughput: + 8.77 t/s (+21.64%)**
7272
7373
7474## Instructions:
You can’t perform that action at this time.
0 commit comments