Skip to content

Intel Xeon Platinum 8592 #5

@spongioblast

Description

@spongioblast

Tested on Windows 11, 512GB DDR5 and Xeon 8592 (Emerald Rapid)
(32 Threads was the fastest of the tests)

============================================================================
TEST 1: AMX Repository Settings (32 threads) 
============================================================================

[1A] WITH AMX Acceleration:
| model                          |       size |     params | backend    | threads | n_batch |             amx | nopo |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | --------------: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_0          |  16.11 GiB |    30.53 B | CPU        |      32 |     512 |               1 |    1 |           pp512 |        242.22 ± 2.75 |
| qwen3moe 30B.A3B Q4_0          |  16.11 GiB |    30.53 B | CPU        |      32 |     512 |               1 |    1 |           tg128 |         29.26 ± 0.15 |
| qwen3moe 30B.A3B Q4_0          |  16.11 GiB |    30.53 B | CPU        |      32 |     512 |               1 |    1 |     pp512+tg512 |         45.34 ± 0.28 |

build: unknown (0)

[1B] WITHOUT AMX (baseline):
| model                          |       size |     params | backend    | threads | n_batch | nopo |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_0          |  16.11 GiB |    30.53 B | CPU        |      32 |     512 |    1 |           pp512 |        238.76 ± 3.48 |
| qwen3moe 30B.A3B Q4_0          |  16.11 GiB |    30.53 B | CPU        |      32 |     512 |    1 |           tg128 |         28.98 ± 0.11 |
| qwen3moe 30B.A3B Q4_0          |  16.11 GiB |    30.53 B | CPU        |      32 |     512 |    1 |     pp512+tg512 |         45.44 ± 0.16 |
@echo off
title Qwen3-MoE 30B Comprehensive Benchmark - Intel Xeon Platinum 8592
echo ============================================================================
echo Qwen3-MoE 30B.A3B Q4_0 - Comprehensive Performance Benchmark
echo Intel Xeon Platinum 8592+ Single Socket (64 cores, 128 threads) + AMX Acceleration
echo ============================================================================
echo.
echo This benchmark will test multiple configurations:
echo 1. AMX enabled vs disabled comparison
echo 2. Different thread counts (32, 48, 64 threads)
echo 3. Repository-recommended settings
echo 4. Single-socket Xeon optimized settings
echo.
echo Expected runtime: 15-30 minutes
echo.
pause

set MODEL_PATH="D:\99_LLM_MODELS\qwen3moe_30B.A3B_Q4_0\qwen3-30b-a3b-q4_0.gguf"
set LLAMA_BENCH="D:\01_PROJECTS\AI\qwen3moe_llamacpp\llama.cpp\build\bin\Release\llama-bench.exe"

echo.
echo ============================================================================
echo TEST 1: AMX Repository Settings (32 threads)
echo ============================================================================
echo.
echo [1A] WITH AMX Acceleration:
%LLAMA_BENCH% ^
    --amx ^
    -m %MODEL_PATH% ^
    -t 32 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 512 ^
    -ub 512 ^
    -pg 512,512 ^
    --repetitions 3

echo.
echo [1B] WITHOUT AMX (baseline):
%LLAMA_BENCH% ^
    -m %MODEL_PATH% ^
    -t 32 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 512 ^
    -ub 512 ^
    -pg 512,512 ^
    --repetitions 3

echo.
echo ============================================================================
echo TEST 2: Intel Xeon Optimized Settings (48 threads - optimal for single socket)
echo ============================================================================
echo.
echo [2A] WITH AMX - 48 threads (optimal for single socket):
%LLAMA_BENCH% ^
    --amx ^
    -m %MODEL_PATH% ^
    -t 48 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 1024 ^
    -ub 1024 ^
    -pg 1024,1024 ^
    --repetitions 3

echo.
echo [2B] WITHOUT AMX - 48 threads:
%LLAMA_BENCH% ^
    -m %MODEL_PATH% ^
    -t 48 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 1024 ^
    -ub 1024 ^
    -pg 1024,1024 ^
    --repetitions 3

echo.
echo ============================================================================
echo TEST 3: Full CPU Utilization (64 threads - all physical cores)
echo ============================================================================
echo.
echo [3A] WITH AMX - 64 threads (full CPU):
%LLAMA_BENCH% ^
    --amx ^
    -m %MODEL_PATH% ^
    -t 64 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 2048 ^
    -ub 2048 ^
    -pg 2048,2048 ^
    --repetitions 2

echo.
echo [3B] WITHOUT AMX - 64 threads:
%LLAMA_BENCH% ^
    -m %MODEL_PATH% ^
    -t 64 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 2048 ^
    -ub 2048 ^
    -pg 2048,2048 ^
    --repetitions 2

echo.
echo ============================================================================
echo TEST 4: Hyperthreading Test (128 logical threads)
echo ============================================================================
echo.
echo [4A] WITH AMX - 128 threads (with hyperthreading):
%LLAMA_BENCH% ^
    --amx ^
    -m %MODEL_PATH% ^
    -t 128 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 1024 ^
    -ub 1024 ^
    -pg 1024,1024 ^
    --repetitions 2

echo.
echo [4B] WITHOUT AMX - 128 threads:
%LLAMA_BENCH% ^
    -m %MODEL_PATH% ^
    -t 128 ^
    -ngl 0 ^
    -nopo 1 ^
    -b 1024 ^
    -ub 1024 ^
    -pg 1024,1024 ^
    --repetitions 2

echo.
echo ============================================================================
echo TEST 5: Quick Generation Test (Real-world usage)
echo ============================================================================
echo.
echo [5A] AMX Generation Test - "10 facts about birds":
"D:\01_PROJECTS\AI\qwen3moe_llamacpp\llama.cpp\build\bin\Release\llama-cli.exe" ^
    --amx ^
    -m %MODEL_PATH% ^
    -ngl 0 ^
    -t 48 ^
    -b 4096 ^
    -c 4096 ^
    -n 512 ^
    -p "10 facts about birds" ^
    --color

echo.
echo [5B] Non-AMX Generation Test - "10 facts about birds":
"D:\01_PROJECTS\AI\qwen3moe_llamacpp\llama.cpp\build\bin\Release\llama-cli.exe" ^
    -m %MODEL_PATH% ^
    -ngl 0 ^
    -t 48 ^
    -b 4096 ^
    -c 4096 ^
    -n 512 ^
    -p "10 facts about birds" ^
    --color

echo.
echo ============================================================================
echo BENCHMARK COMPLETE!
echo ============================================================================
echo.
echo Summary of tested configurations:
echo - AMX vs Non-AMX at 32, 48, 64, and 128 threads
echo - Repository settings vs Xeon-optimized settings
echo - Partial cores (48T) vs All cores (64T) vs Hyperthreading (128T)
echo - Real-world generation performance
echo.
echo Optimal configuration recommendations will be visible in the results above.
echo Look for the highest tokens/second in each category.
echo.
echo For your Intel Xeon Platinum 8592+ (Single Socket):
echo - 48 threads typically optimal for balanced performance/efficiency
echo - 64 threads for maximum throughput (all physical cores)
echo - 128 threads tests hyperthreading benefit
echo - AMX should show 2-4x performance improvement for matrix ops
echo.
pause

`

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions