-
Notifications
You must be signed in to change notification settings - Fork 0
Worse performance with --amx #3
Replies: 2 comments · 5 replies
-
Tried with FA off, still slower... Without AMX, FA off: With AMX, FA off: |
Beta Was this translation helpful? Give feedback.
All reactions
-
Can you run with verbose? (-v)? Are you running in HMB only mode, or as level 4 cache? |
Beta Was this translation helpful? Give feedback.
All reactions
-
HBM Only mode, don't even have any DIMMs installed. Here's the complete run with --amx -v: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no system_info: n_threads = 28 (n_threads_batch = 28) / 112 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | n_ctx: 8192, add_bos: 0 embd_inp.size(): 10, n_consumed: 0 eval: [ '. ':382 ] eval: [ '. ':382 ] eval: [ ' ':4710 ] eval: [ ' ':4710 ] eval: [ '. ':382 ] eval: [ '. ':382 ] eval: [ '. ':382 ] eval: [ ' ':4710 ] llama_perf_sampler_print: sampling time = 62.15 ms / 522 runs ( 0.12 ms per token, 8398.63 tokens per second) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Switched to smaller model so I could use your new command as is and still fit in GPU: numactl -N 2 -m 2 /root/llama.cpp/build/bin/llama-cli -m /mnt/vm100/quants/Ling-lite-1.5-2507.i1-Q4_K_M.gguf -ngl 99 --amx --cpu-moe -t 14 -b 4096 -c 4096 -n 512 --numa numactl -p "The quick brown fox jumps over the lazy dog many times. A curious cat watches carefully from the garden wall nearby. Birds sing softly in the morning air, while the sun rises gently above the hills. Children walk slowly to school carrying bright backpacks filled with books, pencils, and small notes. The teacher greets them warmly at the classroom door. Lessons begin with stories about science, history, art, and music. Ideas flow clearly and simply, creating a calm rhythm of learning. Friends share smiles, trade sandwiches, and laugh during the short break. The day continues peacefully until the afternoon bell finally rings." -no-cnv Without --amx: llama_perf_context_print: prompt eval time = 961.61 ms / 121 tokens ( 7.95 ms per token, 125.83 tokens per second) With --amx: llama_perf_context_print: prompt eval time = 1291.65 ms / 121 tokens ( 10.67 ms per token, 93.68 tokens per second) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Also, in a new terminal instance, run this while the tests are running sudo perf stat -a -e exe.amx_busy,cycles -- sleep 30 |
Beta Was this translation helpful? Give feedback.
All reactions
-
hmm.. Try this: "numactl -N 1 -m 1 ~/src/llama.cpp/build/bin/llama-bench -m /XXXX.gguf -t 16 --amx --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3" Then again, but without "nopo 1" |
Beta Was this translation helpful? Give feedback.
All reactions
-
root@wen:~# numactl -N 1 -m 1 llama.cpp-20250915-AMX/build/bin/llama-bench -m quants/Ling-lite-1.5-2507.i1-Q4_K_M.gguf -t 14 --amx --numa numactl -ngl 10 -nopo 1 -b 512 -ub 512 -pg 512,512 --repetitions 3
build: 71cc890 (6461)
build: 71cc890 (6461) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Gen 4 (Sapphire Rapids) Xeon CPU MAX 9480 w/ 64GB HBM + RTX 4070 Ti Super
Build script (probably redundant stuff in there):
CUDACXX=/usr/local/cuda/bin/nvcc cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_NATIVE=ON \
-DGGML_AVX512=ON \
-DGGML_AVX512_BF16=ON \
-DGGML_AVX512_VBMI=ON \
-DGGML_AVX512_VNNI=ON \
-DGGML_AMX=ON \
-DGGML_AMX_TILE=ON \
-DGGML_AMX_INT8=ON \
-DGGML_AMX_BF16=ON \
-DGGML_CUDA=ON \
-DGGML_CUDA_ARCH=89 \
-DCMAKE_CXX_FLAGS="-O3 -march=sapphirerapids -mtune=sapphirerapids"
cmake --build build --config Release -j 56
Launch script (without and with --amx):
echo 3 > /proc/sys/vm/drop_caches
numactl --interleave=0,1,2,3 \
build/bin/llama-cli --jinja \
-m /quants/GLM-4.5-Air-Q4_K_S-00001-of-00002.gguf \
-ngl 999 --n-cpu-moe 40 --amx \
-c 16384 -fa on --numa distribute -t 56 -n 512 -p "Write a complete novel about the AI Takeover." -no-cnv
system_info: n_threads = 56 (n_threads_batch = 56) / 112 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Without AMX:
load_tensors: offloaded 48/48 layers to GPU
load_tensors: CUDA0 model buffer size = 11458.75 MiB
load_tensors: CPU_Mapped model buffer size = 46976.56 MiB
load_tensors: CPU_Mapped model buffer size = 6183.31 MiB
llama_perf_context_print: prompt eval time = 77.53 ms / 4 tokens ( 19.38 ms per token, 51.59 tokens per second)
llama_perf_context_print: eval time = 14752.42 ms / 511 runs ( 28.87 ms per token, 34.64 tokens per second)
With AMX:
load_tensors: offloaded 48/48 layers to GPU
load_tensors: CUDA0 model buffer size = 11458.75 MiB
load_tensors: CPU_REPACK model buffer size = 30888.00 MiB
load_tensors: CPU_Mapped model buffer size = 46976.56 MiB
load_tensors: CPU_Mapped model buffer size = 4523.67 MiB
llama_perf_context_print: prompt eval time = 576.45 ms / 4 tokens ( 144.11 ms per token, 6.94 tokens per second)
llama_perf_context_print: eval time = 21230.37 ms / 511 runs ( 41.55 ms per token, 24.07 tokens per second)
I did a few different runs, and it always seems slower for some reason.
Benchmark command gave this error:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Beta Was this translation helpful? Give feedback.
All reactions