NUMA and CPU selection #646
Replies: 5 comments 13 replies
-
NUMA is a topic that I want to do something about, but nothing has been done at this point. I'm not sure if the high core-count CPU's can be configured as a single NUMA node. Can one not simply try in a cloud instance? Here is a comment from someone using a 9355 EPYC (12 memory channels) and getting a very decent CPU-only TG performance with it. |
Beta Was this translation helpful? Give feedback.
-
I haven't seen the Reddit thread, but what I would try first when I get access to a NUMA system is to delay tensor data loading until the warm-up graph computation, and there have each thread load the tensor portions that they will be working on. One could also multi-thread tensor data loading, but one needs to make sure that the correct tensor portions are loaded by the respective threads, which happens automatically if data loading is done within the warm-up graph computation. I have done something along these lines in the past at a $former_job in the context of a large-scale optimization problem where the system matrix had to be distributed that way to (nearly) double the performance on the dual-socket production system. I haven't tried any of it yet because I want to test on a real NUMA system. This and the Vulkan back-end are competing for being the next thing to focus on when I come back from vacation. |
Beta Was this translation helpful? Give feedback.
-
This is a great topic I've been trying to understand too - and the same question I asked myself before shopping. I've seen people on the Level1Techs forum blaming low CCD count for poor memory bandwidth, with some suggesting the EPYC 9175F as a great CPU because it has 16 CCDs (1 per core). But then I came across a paper explaining cross-CCD memory latency and bandwidth, and another with memory bandwidth benchmarks for various 9004/9005 CPUs. I can't access them now for some reason (everything I touch gets nuked lately 😅), but try googling "Fujitsu Genoa Turin memory performance white paper" - you might find it. Here's a chart of all AMD EPYC 9005 Turin CPUs I found extremely useful. |
Beta Was this translation helpful? Give feedback.
-
@joshuakoh1 One more thing: apart from any single vs dual socket considerations, if it was me, I would select a CPU from the 9005 series rather than the 9004 series for 2 reasons:
The 1st point may or may not be important as we are not able to come even close to saturating the 9004 theoretical memory bandwidth during TG with the big MoE models, but in case we figure out where the bottleneck is, the 9005 series will give a better TG performance (independently of single vs dual socket, NUMA, number of CCDs, etc.) The second point is definitely important. The 9004 series uses the Zen4 core, which has a fairly comprehensive 1 There are some relatively minor performance gains when one can make use of the additional |
Beta Was this translation helpful? Give feedback.
-
I've mentioned the Intel / sglang work here and there recently. It seems worth repeating here. Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang In particular the NUMA design wrt. tensor & expert parallelism seems significant. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
A great fan of this repo. Main llama.cpp was definitely getting too bloated.
Currently shopping for new CPU and would like to clarify some crucial information. I've had bad experiences with NUMA so far, moving from single socket EPYC to dual and back to single.
My understanding from my research so far is that the higher core count EPYC CPUs will run into NUMA issues even for single socket as CCD count grows.
My primary use case is to load MOE models like Kimi and Deepseek and my understanding is that there is still no way to bind specific experts per GPU/NUMA domain.
Am I right to say that I should be avoiding the higher CCD CPUs like the 9654/9754 for the foreseeable future?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions