- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
Implementation of GGML_NUMA_MIRROR for inferencing performance gain on numa systems #14969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…/llama.cpp into numa-improvements-take2
… instance and pointer struct member accesses
          
 I think you mean  This is just my stock settings and not this PR though - I probably won't have chance to run that until tomorrow or Monday now. According to: https://en.wikichip.org/wiki/intel/xeon_gold/6248 
  | 
    
| 
           No I mean run llama-server with  Looks like you are getting symmetric bandwidth usage on both nodes though. I think that's what I would expect...  | 
    
          
 Oh, sorry   | 
    
| 
           I think anyway I'll add a very detailed log at the start of memory allocation with: 
 Then it will be very clear and easy to debug. There could be strange things like numa sub-clustering going on, I read that's a thing.  | 
    
| 
           It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄  | 
    
| 
           I tried this out on my dual Xeon 4216 system (no GPU) with Cohere Command-A on RHEL 8. I had to make changes to the  Unfortunately, I didn't see any change to performance on my system. Here's the command I used: Edit: I tried allocating hugepages with a script similar to what you shared above, except with 49152 2048k hugepages per node. Still no performance change.  | 
    
          
 Thinking about this more today, for offloading shared-experts only;  If the sampling frequency is high enough, then I might be able to hack   | 
    
          
 It might be worth trying without the  Do you find that using   | 
    
          
 IIRC, the current CUDA offloading code only uses a single GPU for the offloaded calculations, so having 2 copies won't really help it. I do think there is a bottleneck somewhere as   | 
    
| 
           If the threads doing the offloading are located on socket 1, but the GPU is on a pcie slot attached to socket 2, maybe that would be sending the traffic over the UPI link? Might be worth investigating. I'll try to get that better visibility of thread/numa assignments in soon.  | 
    
          
 Without this PR, I had a slight speedup from using HyperThreading (i.e. --threads 64 instead of --threads 32). Removing -np 2 had no impact on performance (for a single request, nothing running concurrently). However, I noticed that with -np 2 the generated tokens were gibberish, while without -np 2 it was giving valid/correct outputs. Looks like a bug. With this PR, switching from --threads 64 to --threads 32 had the same slowdown I had without this PR.  | 
    
| 
           Installed libnuma and Pulled the latest changes (9d66473). Had to disable building RPC to build successfully. Tried to run it with Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL. I set .vm.nr_hugepages to 160000 to make sure the model and context had enough space. First time I run it took much longer than regular llama.cpp to load, I didn't time it, but felt like 5 minutes, whereas regular llama.cpp takes 1 minute or less. Subsequent loads were very quick, much quicker than llama.cpp. I haven't been able to get any output. Prompt prpcessing takes forever even on short six word prompts (ex: write a pong game in C++). In htop, I see only two cores (on CPU0) being at 100%, while all others are at 0%. The cores are the first and the 24th in htop. The system is a dual 24 core Xeon (ES QQ89, with HT enabled). I think there's a bug in thread pinning. The 24th core would have been the first core of the second CPU if HT was disabled. All threads get pinned to those two cores regardless of whether I set -t or not in llama-server. Tried using numactl with --physcpubind=$(seq -s, 1 2 95), which usually pins one worker to each physical core, but all threads get mapped to the same two cores (0 and 24). Waited a couple of minutes on that pong prompt to see if I get any output, but not a single token. EDIT: Got my dual Epyc back online, and can confirm same behavior as the dual Xeon. Compiled the branch, and run with --threads 96. Can see all threads get crammed on cpuid 00 and 48 in the log output, as well as on htop. Can also confirm what @aifartist mentioned about SMT threads not being the same as Intel consumer. Running   | 
    
          
 Personally if that's of any issue, i'd put support for gpu aside the time for the patch to be developed.  | 
    
| 
           I've done quite a bit of testing and code deep diving over the weekend. What I've realised is that: 
 All of this said, now I can see what needs to be done to get this over the line. Each socket needs its own threadpool and the matrix operations need to be divvied up between the numa nodes / sockets, then we can leverage data paralellism. I am iterating on this locally at the moment and will update when I have something to test.  | 
    
| 
           Some information that may be useful.  | 
    
| 
           Looking at the code architecture, COSMA needs to be its own new backend really. And just throw away ggml-cpu. This could be good or bad, I'm not sure :D I like the idea. As a pedagogical exercise, I'll carry on with the framework I've created up to now, and maybe attempt that as a new PR when I feel more confident.  | 
    
| 
           hmm.. according to this PCM test I am only using 55gb/s of my 220gb/s bandwidth. About 10-15 more gb/s than with a single proc/numa. The UPI link utilization isn't constant but it does sometimes get saturated. I'm also feeding 4x GPU over PCIE 3.0, however, maybe that's capping it? On a single proc/node I get 9t/s and on dual it's 11ish. GPU on only one, but speeds from the opposite node are only hair slower so must not be it. You can look at fastllm, it supports CPU numa, or claims to. Could give you some ideas to implement here. I want to compare it's speeds with llama.cpp to see what, if anything I'm leaving on the table. A real head to head with qwen-235bs and now I'll be keeping an eye on pcm-memory behavior.  | 
    
| 
           I've got a local implementation mostly working now. I'll merge it into this PR next week. Key changes upcoming: 
 I'm quite excited about this and will get it into this PR once I finish local testing.  | 
    
| 
           That's great! How much uplift did you get? Also was there benefit to more than one numa node per socket? edit: trying it now, only masks for a single socket and requires mmap with the HP enabled. attempting to see if it will run with the latter.  | 
    
| 
           There are bugs in what's checked into my iteration branch that I've only fixed locally, let me finish lol.  | 
    
| 
           Just as an update, I promised a release last week but I didn't want to release something broken. 
 I think this is roughly the level of performance gain we can expect. I will start migrating the major kernels like MUL_MAT, ROPE, etc that are used most often, then do a performance benchmark against real inferencing. This is looking very promising now. I have a real working solution. Edit: Oh, and no more hugepages, I just use regular malloc() but I page fault to each numa individually. Even found a workaround for the broken allocation in docker containers :)  | 
    
| 
           i see not everyone is in vacations :)  | 
    
| 
           There's also transparent huge pages. In IK there's a switch but from monitoring their size, the kernel seems to do it automatically.  | 
    
| 
           any news on this PR? seems exciting!? would love to try on our 4x cpu and 2x cpu ddr4 servers!  | 
    
          
 It's clear that dbsanfte has invested a great deal of time and effort into this work—so much so that code reviews alone would be a daunting task.  | 
    
| 
           Halfway through I realized I can get most of the benefits of  closing this in favour of:  | 
    
Just a draft for now. Uses code from the fork by @wkgcass, added cleanup and merged it with a recent cut of
master.This strategy mirrors the model in the local memory of each numa node on your system to eliminate the slow UPI link bottleneck.
Headline Improvements
Test system is a dual Xeon Gold 6240 with 768GB of DDR4 @ 2933Mhz, 6 channels per socket.
I see a performance improvement during inferencing of 64.6% on my system:
I see both memory banks being fully utilised during inference using the Intel pcm-memory tool:
Instructions
sudo apt-get install -y libnuma-devCheck out the source and build with
-DGGML_NUMA_MIRROR=ON.Make sure you run as a user with the ability to write to
/dev/hugepages.Allocate some hugepages on your system. This allocates about 80GB, enough for 2x Qwen3-32B:
sudo sysctl -w vm.nr_hugepages=40000-ngl 0or whatever) and with--numa distribute.You should see the following: