forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Tested on Windows 11, 512GB DDR5 and Xeon 8592 (Emerald Rapid)
(32 Threads was the fastest of the tests)
============================================================================
TEST 1: AMX Repository Settings (32 threads)
============================================================================
[1A] WITH AMX Acceleration:
| model | size | params | backend | threads | n_batch | amx | nopo | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | --------------: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_0 | 16.11 GiB | 30.53 B | CPU | 32 | 512 | 1 | 1 | pp512 | 242.22 ± 2.75 |
| qwen3moe 30B.A3B Q4_0 | 16.11 GiB | 30.53 B | CPU | 32 | 512 | 1 | 1 | tg128 | 29.26 ± 0.15 |
| qwen3moe 30B.A3B Q4_0 | 16.11 GiB | 30.53 B | CPU | 32 | 512 | 1 | 1 | pp512+tg512 | 45.34 ± 0.28 |
build: unknown (0)
[1B] WITHOUT AMX (baseline):
| model | size | params | backend | threads | n_batch | nopo | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_0 | 16.11 GiB | 30.53 B | CPU | 32 | 512 | 1 | pp512 | 238.76 ± 3.48 |
| qwen3moe 30B.A3B Q4_0 | 16.11 GiB | 30.53 B | CPU | 32 | 512 | 1 | tg128 | 28.98 ± 0.11 |
| qwen3moe 30B.A3B Q4_0 | 16.11 GiB | 30.53 B | CPU | 32 | 512 | 1 | pp512+tg512 | 45.44 ± 0.16 |
@echo off
title Qwen3-MoE 30B Comprehensive Benchmark - Intel Xeon Platinum 8592
echo ============================================================================
echo Qwen3-MoE 30B.A3B Q4_0 - Comprehensive Performance Benchmark
echo Intel Xeon Platinum 8592+ Single Socket (64 cores, 128 threads) + AMX Acceleration
echo ============================================================================
echo.
echo This benchmark will test multiple configurations:
echo 1. AMX enabled vs disabled comparison
echo 2. Different thread counts (32, 48, 64 threads)
echo 3. Repository-recommended settings
echo 4. Single-socket Xeon optimized settings
echo.
echo Expected runtime: 15-30 minutes
echo.
pause
set MODEL_PATH="D:\99_LLM_MODELS\qwen3moe_30B.A3B_Q4_0\qwen3-30b-a3b-q4_0.gguf"
set LLAMA_BENCH="D:\01_PROJECTS\AI\qwen3moe_llamacpp\llama.cpp\build\bin\Release\llama-bench.exe"
echo.
echo ============================================================================
echo TEST 1: AMX Repository Settings (32 threads)
echo ============================================================================
echo.
echo [1A] WITH AMX Acceleration:
%LLAMA_BENCH% ^
--amx ^
-m %MODEL_PATH% ^
-t 32 ^
-ngl 0 ^
-nopo 1 ^
-b 512 ^
-ub 512 ^
-pg 512,512 ^
--repetitions 3
echo.
echo [1B] WITHOUT AMX (baseline):
%LLAMA_BENCH% ^
-m %MODEL_PATH% ^
-t 32 ^
-ngl 0 ^
-nopo 1 ^
-b 512 ^
-ub 512 ^
-pg 512,512 ^
--repetitions 3
echo.
echo ============================================================================
echo TEST 2: Intel Xeon Optimized Settings (48 threads - optimal for single socket)
echo ============================================================================
echo.
echo [2A] WITH AMX - 48 threads (optimal for single socket):
%LLAMA_BENCH% ^
--amx ^
-m %MODEL_PATH% ^
-t 48 ^
-ngl 0 ^
-nopo 1 ^
-b 1024 ^
-ub 1024 ^
-pg 1024,1024 ^
--repetitions 3
echo.
echo [2B] WITHOUT AMX - 48 threads:
%LLAMA_BENCH% ^
-m %MODEL_PATH% ^
-t 48 ^
-ngl 0 ^
-nopo 1 ^
-b 1024 ^
-ub 1024 ^
-pg 1024,1024 ^
--repetitions 3
echo.
echo ============================================================================
echo TEST 3: Full CPU Utilization (64 threads - all physical cores)
echo ============================================================================
echo.
echo [3A] WITH AMX - 64 threads (full CPU):
%LLAMA_BENCH% ^
--amx ^
-m %MODEL_PATH% ^
-t 64 ^
-ngl 0 ^
-nopo 1 ^
-b 2048 ^
-ub 2048 ^
-pg 2048,2048 ^
--repetitions 2
echo.
echo [3B] WITHOUT AMX - 64 threads:
%LLAMA_BENCH% ^
-m %MODEL_PATH% ^
-t 64 ^
-ngl 0 ^
-nopo 1 ^
-b 2048 ^
-ub 2048 ^
-pg 2048,2048 ^
--repetitions 2
echo.
echo ============================================================================
echo TEST 4: Hyperthreading Test (128 logical threads)
echo ============================================================================
echo.
echo [4A] WITH AMX - 128 threads (with hyperthreading):
%LLAMA_BENCH% ^
--amx ^
-m %MODEL_PATH% ^
-t 128 ^
-ngl 0 ^
-nopo 1 ^
-b 1024 ^
-ub 1024 ^
-pg 1024,1024 ^
--repetitions 2
echo.
echo [4B] WITHOUT AMX - 128 threads:
%LLAMA_BENCH% ^
-m %MODEL_PATH% ^
-t 128 ^
-ngl 0 ^
-nopo 1 ^
-b 1024 ^
-ub 1024 ^
-pg 1024,1024 ^
--repetitions 2
echo.
echo ============================================================================
echo TEST 5: Quick Generation Test (Real-world usage)
echo ============================================================================
echo.
echo [5A] AMX Generation Test - "10 facts about birds":
"D:\01_PROJECTS\AI\qwen3moe_llamacpp\llama.cpp\build\bin\Release\llama-cli.exe" ^
--amx ^
-m %MODEL_PATH% ^
-ngl 0 ^
-t 48 ^
-b 4096 ^
-c 4096 ^
-n 512 ^
-p "10 facts about birds" ^
--color
echo.
echo [5B] Non-AMX Generation Test - "10 facts about birds":
"D:\01_PROJECTS\AI\qwen3moe_llamacpp\llama.cpp\build\bin\Release\llama-cli.exe" ^
-m %MODEL_PATH% ^
-ngl 0 ^
-t 48 ^
-b 4096 ^
-c 4096 ^
-n 512 ^
-p "10 facts about birds" ^
--color
echo.
echo ============================================================================
echo BENCHMARK COMPLETE!
echo ============================================================================
echo.
echo Summary of tested configurations:
echo - AMX vs Non-AMX at 32, 48, 64, and 128 threads
echo - Repository settings vs Xeon-optimized settings
echo - Partial cores (48T) vs All cores (64T) vs Hyperthreading (128T)
echo - Real-world generation performance
echo.
echo Optimal configuration recommendations will be visible in the results above.
echo Look for the highest tokens/second in each category.
echo.
echo For your Intel Xeon Platinum 8592+ (Single Socket):
echo - 48 threads typically optimal for balanced performance/efficiency
echo - 64 threads for maximum throughput (all physical cores)
echo - 128 threads tests hyperthreading benefit
echo - AMX should show 2-4x performance improvement for matrix ops
echo.
pause
`