NUMA and CPU selection #646

joshuakoh1 · 2025-07-25T05:08:46Z

joshuakoh1
Jul 25, 2025

Hello,

A great fan of this repo. Main llama.cpp was definitely getting too bloated.

Currently shopping for new CPU and would like to clarify some crucial information. I've had bad experiences with NUMA so far, moving from single socket EPYC to dual and back to single.

My understanding from my research so far is that the higher core count EPYC CPUs will run into NUMA issues even for single socket as CCD count grows.

My primary use case is to load MOE models like Kimi and Deepseek and my understanding is that there is still no way to bind specific experts per GPU/NUMA domain.

Am I right to say that I should be avoiding the higher CCD CPUs like the 9654/9754 for the foreseeable future?

Thanks!

ikawrakow · 2025-07-25T05:30:21Z

ikawrakow
Jul 25, 2025
Maintainer

NUMA is a topic that I want to do something about, but nothing has been done at this point.

I'm not sure if the high core-count CPU's can be configured as a single NUMA node. Can one not simply try in a cloud instance?

Here is a comment from someone using a 9355 EPYC (12 memory channels) and getting a very decent CPU-only TG performance with it.

3 replies

saood06 Jul 25, 2025
Collaborator

I'm not sure if the high core-count CPU's can be configured as a single NUMA node. Can one not simply try in a cloud instance?

I've seen BIOS's that mirror memory to create one NUMA node. (Not relevant on gear I have, and not sure if present on gear I have ).

saood06 Jul 25, 2025
Collaborator

NUMA is a topic that I want to do something about, but nothing has been done at this point.

Someone on the reddit thread about this repo's incident commented this:

I solved the llama-cpp bug where multiple numas made things slower, not faster (everything was being bound to a single numa node due to the first-touch linux kernel policy with malloc... everything was being zeroed from the main thread so it always ended up bound to the main thread's numa). Now it's numa aware and buffers should be bound to the thread-local numa. So each numa will only have to reach across the link some of the time for matmuls. Instead of the main thread numa having full local access and all remote numas reaching across the bus 100% of the time as now.

And in another comment:

The problem with the existing code was that numactl was not really effective since all the buffers were initialised on the main thread and would all get pinned to the main thread's numa by the kernel no matter what you set in numactl. So that was broken anyways. You'd end up with all of the model in one numa but the threads all split across both numas, so it was very ridiculous.

Ph0rk0z Oct 5, 2025

I got some insight on this now, having tested my system (xeon scalable) with numa=off instead of just disabling SNC. Without the OS being numa aware, the hardware itself splits whatever is in sysram to both nodes. Similar to doing numactl --interleave all and running with numa --distribute.

It didn't matter which numa the GPUs were on. Speed is identical. Vs that setting itself, I get slightly more t/s 14.10 vs 13.x. Prompt processing is the same. Model is GLM-4.6 Q3K_XL. OS overhead?

I'm not sure if the high core-count CPU's can be configured as a single NUMA node.

At least on intel, this is possible and works. In terms of memory b/w use, my UPI link is maxed out at ~50GB/s and I only see at most 48GB/s throughput. Even with threads=cores, CPU usage is ~40/48 so about 83%. Hyperthreading lowers perf.

Putting everything on one socket with numa enabled produces even lower bandwidth so it didn't seem like a solution either. If there's a PCIE bottleneck it doesn't show up watching GPU transfers. During t/g it's low.

ikawrakow · 2025-07-25T06:01:14Z

ikawrakow
Jul 25, 2025
Maintainer

I haven't seen the Reddit thread, but what I would try first when I get access to a NUMA system is to delay tensor data loading until the warm-up graph computation, and there have each thread load the tensor portions that they will be working on. One could also multi-thread tensor data loading, but one needs to make sure that the correct tensor portions are loaded by the respective threads, which happens automatically if data loading is done within the warm-up graph computation. I have done something along these lines in the past at a $former_job in the context of a large-scale optimization problem where the system matrix had to be distributed that way to (nearly) double the performance on the dual-socket production system. I haven't tried any of it yet because I want to test on a real NUMA system.

This and the Vulkan back-end are competing for being the next thing to focus on when I come back from vacation.

0 replies

sousekd · 2025-07-25T07:02:21Z

sousekd
Jul 25, 2025

Am I right to say that I should be avoiding the higher CCD CPUs like the 9654/9754 for the foreseeable future?

This is a great topic I've been trying to understand too - and the same question I asked myself before shopping. I've seen people on the Level1Techs forum blaming low CCD count for poor memory bandwidth, with some suggesting the EPYC 9175F as a great CPU because it has 16 CCDs (1 per core).

But then I came across a paper explaining cross-CCD memory latency and bandwidth, and another with memory bandwidth benchmarks for various 9004/9005 CPUs. I can't access them now for some reason (everything I touch gets nuked lately 😅), but try googling "Fujitsu Genoa Turin memory performance white paper" - you might find it.

Here's a chart of all AMD EPYC 9005 Turin CPUs I found extremely useful.

1 reply

joshuakoh1 Jul 25, 2025
Author

The 9175F or F series chips certainly look interesting IF the software is mature enough for CCD pinning. Not sure if this is the case or will be the case in the near future. As far as I've found out, seems like all cross CCD communication will go across the infinity fabric which would end up being a massive bottleneck if not properly managed.

ikawrakow · 2025-07-25T09:07:41Z

ikawrakow
Jul 25, 2025
Maintainer

@joshuakoh1 One more thing: apart from any single vs dual socket considerations, if it was me, I would select a CPU from the 9005 series rather than the 9004 series for 2 reasons:

Theoretically higher memory bandwidth
Real 512-bit instructions

The 1st point may or may not be important as we are not able to come even close to saturating the 9004 theoretical memory bandwidth during TG with the big MoE models, but in case we figure out where the bottleneck is, the 9005 series will give a better TG performance (independently of single vs dual socket, NUMA, number of CCDs, etc.)

The second point is definitely important. The 9004 series uses the Zen4 core, which has a fairly comprehensive AVX512 support, but 512-bit instructions are implemented as two 256-instructions, so there is little to no gain from using AVX512¹. The Zen5 core used in the 9005 series is the first AMD core to have real 512-bit instructions. The effect on PP performance can be very significant. For instance, for prompt processing, in ik_llama.cpp quantized tensors get repacked to Q8_0_R8 or Q8_K_R8 before performing matrix multiplications (GEMM). The GEMM implementation for Q8_0_R8 uses AVX512, while GEMM for Q8_K_R8 does not. On my Ryzen-7950X CPU (Zen4 core), Q8_K_R8 GEMM is ~20% faster than Q8_0_R8 GEMM (just because Q8_K_R8 needs fewer conversions to float and multiplications with block scales). But on @ubergarm's Ryzen-9950X CPU it is the other way around. PR #610 has a POC for Q8_K_R8 GEMM with AVX512. With this PR, the 16-core Ryzen-9950X beats the 24-core Ryzen-7965WX, so that will translate in more than 50% better PP performance per core/clock for the 9005 series EPYC.

¹ There are some relatively minor performance gains when one can make use of the additional AVX512 instructions, particularly instructions with a mask. But for the bulk of the computation, which is performing int8_t dot products between SIMD vectors, there is no benefit from AVX512 with the Zen4 core.

10 replies

ubergarm Jul 27, 2025

@sousekd

Do you build/compile with some unusual params or just the common set? I would like to reproduce your gains.

Nothing special no.

cd ik_llama.cpp
git checkout main
git pull
git checkout ik/q8_k_r8_avx512
git checkout -b testing
git rebase main

# build, if you're using GPU set -DGGML_CUDA=1 -DGGML_SCHED_MAX_COPIES=1
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_BLAS=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# when you're all done
git checkout main
git branch -D testing

ikawrakow Jul 27, 2025
Maintainer

I think on Windows one needs to explicitly turn on the required AVX512 features. On Linux this happens automatically with -DGGML_NATIVE=ON (it is already on by default), but not on Windows. The required CPU features for the fast path are AVX512F, AVX512VNNI, AVX512VL, AVX512BW, AVX512DQ. I don't think there are cmake variables for all of these, so one probably needs to work with -DGGML_ARCH_FLAGS=... (fill in whatever is required for the compiler used for building).

sousekd Jul 27, 2025

@ubergarm @ikawrakow Yeah that's what I do. I tested various configurations with your Kimi-K2 and most recent Qwen quants today but have not been able to achieve any improvements with AVX512. But I test with GPU in the mix, so the last thing I'll try, just for the sake of testing it, is a build without CUDA.

ikawrakow Jul 27, 2025
Maintainer

Yeah that's what I do.

Which likely means that you are not using AVX512 at all, or at least not using the fast path that @ubergarm uses on Linux, and this is why the PR in question has no impact for you.

sousekd Jul 27, 2025

Which likely means that you are not using AVX512 at all, or at least not using the fast path that @ubergarm uses on Linux, and this is why the PR in question has no impact for you.

It is absolutely possible (and likely) I am doing something wrong. I finally posted some results here, suggesting minor improvements when running CPU only. My build was:

cmake -B build -G Ninja `
  -DCMAKE_BUILD_TYPE=Release `
  -DCMAKE_C_COMPILER="clang-cl" `
  -DCMAKE_CXX_COMPILER="clang-cl" `
  -DGGML_CUDA=OFF `
  -DGGML_AVX512=ON `
  -DGGML_AVX512_VNNI=ON `
  -DGGML_AVX512_VBMI=ON `
  -DGGML_AVX512_BF16=ON `
  -DGGML_BLAS=OFF `
  -DGGML_CCACHE=OFF `
  -DCMAKE_C_FLAGS='/clang:-march=znver5' `
  -DCMAKE_CXX_FLAGS='/EHsc /clang:-march=znver5' `
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON `
  -DLLAMA_CURL=OFF `
  -DBUILD_SHARED_LIBS=OFF

usrlocalben · 2025-08-01T20:40:44Z

usrlocalben
Aug 1, 2025

I've mentioned the Intel / sglang work here and there recently. It seems worth repeating here.

Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang

In particular the NUMA design wrt. tensor & expert parallelism seems significant.

0 replies

NUMA and CPU selection #646

Uh oh!

joshuakoh1 Jul 25, 2025

Replies: 5 comments · 14 replies

Uh oh!

ikawrakow Jul 25, 2025 Maintainer

Uh oh!

Uh oh!

saood06 Jul 25, 2025 Collaborator

Uh oh!

Uh oh!

saood06 Jul 25, 2025 Collaborator

Uh oh!

Uh oh!

Ph0rk0z Oct 5, 2025

Uh oh!

ikawrakow Jul 25, 2025 Maintainer

Uh oh!

sousekd Jul 25, 2025

Uh oh!

joshuakoh1 Jul 25, 2025 Author

Uh oh!

ikawrakow Jul 25, 2025 Maintainer

Uh oh!

Uh oh!

ubergarm Jul 27, 2025

Uh oh!

ikawrakow Jul 27, 2025 Maintainer

Uh oh!

Uh oh!

sousekd Jul 27, 2025

Uh oh!

ikawrakow Jul 27, 2025 Maintainer

Uh oh!

Uh oh!

sousekd Jul 27, 2025

Uh oh!

usrlocalben Aug 1, 2025

joshuakoh1
Jul 25, 2025

Replies: 5 comments 14 replies

ikawrakow
Jul 25, 2025
Maintainer

saood06 Jul 25, 2025
Collaborator

saood06 Jul 25, 2025
Collaborator

ikawrakow
Jul 25, 2025
Maintainer

sousekd
Jul 25, 2025

joshuakoh1 Jul 25, 2025
Author

ikawrakow
Jul 25, 2025
Maintainer

ikawrakow Jul 27, 2025
Maintainer

ikawrakow Jul 27, 2025
Maintainer

usrlocalben
Aug 1, 2025