Skip to content

Conversation

@dbsanfte
Copy link

@dbsanfte dbsanfte commented Jul 30, 2025

Just a draft for now. Uses code from the fork by @wkgcass, added cleanup and merged it with a recent cut of master.

This strategy mirrors the model in the local memory of each numa node on your system to eliminate the slow UPI link bottleneck.

Headline Improvements

Test system is a dual Xeon Gold 6240 with 768GB of DDR4 @ 2933Mhz, 6 channels per socket.

I see a performance improvement during inferencing of 64.6% on my system:

root@xeon:/home/dbsanfte/llama-cpp-dbsanfte# /home/dbsanfte/llama-cpp-dbsanfte/build/bin/llama-bench -m /home/dbsanfte/models/Qwen3-32B-Q6_K.gguf -ngl 0 -ot ".*=CPU" --numa distribute
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           pp512 |         64.55 ± 0.01 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           tg128 |          2.43 ± 0.00 |

build: fa72aa39 (6010)


root@xeon:/home/dbsanfte/llama-cpp-dbsanfte# /home/dbsanfte/llama.cpp-rocm/build/bin/llama-bench -m /home/dbsanfte/models/Qwen3-32B-Q6_K.gguf -ngl 0 -ot ".*=CPU" --numa distribute
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           pp512 |         64.22 ± 0.11 |
| qwen3 32B Q6_K                 |  25.03 GiB |    32.76 B | ROCm       |   0 | .*=CPU                |           tg128 |          1.57 ± 0.00 |

build: a86f52b2 (5973)

I see both memory banks being fully utilised during inference using the Intel pcm-memory tool:

|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      61880.60                --|
|--           System DRAM Write Throughput(MB/s):       2604.99                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      61880.60                --|
|--                System Write Throughput(MB/s):       2604.99                --|
|--               System Memory Throughput(MB/s):      64485.59                --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  4990.48 --||-- Mem Ch  0: Reads (MB/s):  5244.65 --|
|--            Writes(MB/s):    37.25 --||--            Writes(MB/s):   400.60 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  1: Reads (MB/s):  4989.23 --||-- Mem Ch  1: Reads (MB/s):  5198.51 --|
|--            Writes(MB/s):    35.61 --||--            Writes(MB/s):   333.43 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  2: Reads (MB/s):  4988.42 --||-- Mem Ch  2: Reads (MB/s):  5291.83 --|
|--            Writes(MB/s):    34.97 --||--            Writes(MB/s):   480.57 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  3: Reads (MB/s):  4990.57 --||-- Mem Ch  3: Reads (MB/s):  5259.35 --|
|--            Writes(MB/s):    32.57 --||--            Writes(MB/s):   438.86 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  4: Reads (MB/s):  4991.53 --||-- Mem Ch  4: Reads (MB/s):  5245.06 --|
|--            Writes(MB/s):    33.23 --||--            Writes(MB/s):   399.35 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  5: Reads (MB/s):  4992.82 --||-- Mem Ch  5: Reads (MB/s):  5233.07 --|
|--            Writes(MB/s):    34.20 --||--            Writes(MB/s):   389.41 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- SKT  0 Mem Read (MB/s) : 29943.05 --||-- SKT  1 Mem Read (MB/s) : 31472.47 --|
|-- SKT  0 Mem Write(MB/s) :   207.83 --||-- SKT  1 Mem Write(MB/s) :  2442.21 --|
|-- SKT  0 PMM Read (MB/s):      0.00 --||-- SKT  1 PMM Read (MB/s):      0.00 --|
|-- SKT  0 PMM Write(MB/s):      0.00 --||-- SKT  1 PMM Write(MB/s):      0.00 --|
|-- SKT  0.0 NM read hit rate :  1.00 --||-- SKT  1.0 NM read hit rate :  1.03 --|
|-- SKT  0.1 NM read hit rate :  1.00 --||-- SKT  1.1 NM read hit rate :  1.03 --|
|-- SKT  0.2 NM read hit rate :  0.00 --||-- SKT  1.2 NM read hit rate :  0.00 --|
|-- SKT  0.3 NM read hit rate :  0.00 --||-- SKT  1.3 NM read hit rate :  0.00 --|
|-- SKT  0.4 NM read hit rate :  0.00 --||-- SKT  1.4 NM read hit rate :  0.00 --|
|-- SKT  0.5 NM read hit rate :  0.00 --||-- SKT  1.5 NM read hit rate :  0.00 --|
|-- SKT  0.6 NM read hit rate :  0.00 --||-- SKT  1.6 NM read hit rate :  0.00 --|
|-- SKT  0.7 NM read hit rate :  0.00 --||-- SKT  1.7 NM read hit rate :  0.00 --|
|-- SKT  0.8 NM read hit rate :  0.00 --||-- SKT  1.8 NM read hit rate :  0.00 --|
|-- SKT  0.9 NM read hit rate :  0.00 --||-- SKT  1.9 NM read hit rate :  0.00 --|
|-- SKT  0.10 NM read hit rate :  0.00 --||-- SKT  1.10 NM read hit rate :  0.00 --|
|-- SKT  0.11 NM read hit rate :  0.00 --||-- SKT  1.11 NM read hit rate :  0.00 --|
|-- SKT  0.12 NM read hit rate :  0.00 --||-- SKT  1.12 NM read hit rate :  0.00 --|
|-- SKT  0.13 NM read hit rate :  0.00 --||-- SKT  1.13 NM read hit rate :  0.00 --|
|-- SKT  0.14 NM read hit rate :  0.00 --||-- SKT  1.14 NM read hit rate :  0.00 --|
|-- SKT  0.15 NM read hit rate :  0.00 --||-- SKT  1.15 NM read hit rate :  0.00 --|
|-- SKT  0.16 NM read hit rate :  0.00 --||-- SKT  1.16 NM read hit rate :  0.00 --|
|-- SKT  0.17 NM read hit rate :  0.00 --||-- SKT  1.17 NM read hit rate :  0.00 --|
|-- SKT  0.18 NM read hit rate :  0.00 --||-- SKT  1.18 NM read hit rate :  0.00 --|
|-- SKT  0.19 NM read hit rate :  0.00 --||-- SKT  1.19 NM read hit rate :  0.00 --|
|-- SKT  0.20 NM read hit rate :  0.00 --||-- SKT  1.20 NM read hit rate :  0.00 --|
|-- SKT  0.21 NM read hit rate :  0.00 --||-- SKT  1.21 NM read hit rate :  0.00 --|
|-- SKT  0.22 NM read hit rate :  0.00 --||-- SKT  1.22 NM read hit rate :  0.00 --|
|-- SKT  0.23 NM read hit rate :  0.00 --||-- SKT  1.23 NM read hit rate :  0.00 --|
|-- SKT  0.24 NM read hit rate :  0.00 --||-- SKT  1.24 NM read hit rate :  0.00 --|
|-- SKT  0.25 NM read hit rate :  0.00 --||-- SKT  1.25 NM read hit rate :  0.00 --|
|-- SKT  0.26 NM read hit rate :  0.00 --||-- SKT  1.26 NM read hit rate :  0.00 --|
|-- SKT  0.27 NM read hit rate :  0.00 --||-- SKT  1.27 NM read hit rate :  0.00 --|
|-- SKT  0.28 NM read hit rate :  0.00 --||-- SKT  1.28 NM read hit rate :  0.00 --|
|-- SKT  0.29 NM read hit rate :  0.00 --||-- SKT  1.29 NM read hit rate :  0.00 --|
|-- SKT  0.30 NM read hit rate :  0.00 --||-- SKT  1.30 NM read hit rate :  0.00 --|
|-- SKT  0.31 NM read hit rate :  0.00 --||-- SKT  1.31 NM read hit rate :  0.00 --|
|-- SKT  0 Memory (MB/s):    30150.88 --||-- SKT  1 Memory (MB/s):    33914.68 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      61415.52                --|
|--           System DRAM Write Throughput(MB/s):       2650.04                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      61415.52                --|
|--                System Write Throughput(MB/s):       2650.04                --|
|--               System Memory Throughput(MB/s):      64065.56                --|
|---------------------------------------||---------------------------------------|

Instructions

  1. sudo apt-get install -y libnuma-dev

  2. Check out the source and build with -DGGML_NUMA_MIRROR=ON.

  3. Make sure you run as a user with the ability to write to /dev/hugepages.

  4. Allocate some hugepages on your system. This allocates about 80GB, enough for 2x Qwen3-32B:

sudo sysctl -w vm.nr_hugepages=40000

  1. Run llama-server with CPU offload (-ngl 0 or whatever) and with --numa distribute.

You should see the following:

Aug 01 14:54:52 xeon bash[52251]: load_tensors: tensor 'token_embd.weight' (q4_K) (and 174 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
Aug 01 14:54:52 xeon bash[52251]: Creating unified NUMA mapping for 6 multi-part GGUF files
Aug 01 14:54:52 xeon bash[52251]: Detected 2 NUMA nodes for unified multi-part mapping
Aug 01 14:54:52 xeon bash[52251]: Total unified model size: 272915622592 bytes across 6 files
Aug 01 14:54:52 xeon bash[52251]: Creating unified mapping: 255 hugepages (273804165120 bytes total) for 272915622592 bytes across 6 files
Aug 01 14:54:52 xeon bash[52251]: numa_set_preferred(0) - creating single unified mapping
Aug 01 14:56:20 xeon bash[52251]: mmap(/dev/hugepages/llama-unified-node0-0) desire=0x200000000000 size=273804165120 result=0x200000000000 is_new_mem[0]=yes
Aug 01 14:57:33 xeon bash[52251]: numa_set_preferred(1) - creating single unified mapping
Aug 01 14:58:18 xeon bash[52251]: mmap(/dev/hugepages/llama-unified-node1-0) desire=0x400000000000 size=273804165120 result=0x400000000000 is_new_mem[1]=yes
Aug 01 14:58:54 xeon bash[52251]: begin to copy unified model data from disk to mem...
Aug 01 14:58:54 xeon bash[52251]: copying file data at offset 0, size 49913955424
Aug 01 14:59:31 xeon bash[52251]: copying file data at offset 49913955424, size 48440972736
Aug 01 15:00:05 xeon bash[52251]: copying file data at offset 98354928160, size 49385348672
Aug 01 15:00:41 xeon bash[52251]: copying file data at offset 147740276832, size 49200580288
Aug 01 15:01:16 xeon bash[52251]: copying file data at offset 196940857120, size 49567581888
Aug 01 15:01:51 xeon bash[52251]: copying file data at offset 246508439008, size 26407183584
Aug 01 15:02:10 xeon bash[52251]: begin to copy unified model from numa0 to numa1...
Aug 01 15:03:04 xeon bash[52251]: load_tensors: offloading 61 repeating layers to GPU
Aug 01 15:03:04 xeon bash[52251]: load_tensors: offloading output layer to GPU
Aug 01 15:03:04 xeon bash[52251]: load_tensors: offloaded 62/62 layers to GPU
Aug 01 15:03:04 xeon bash[52251]: load_tensors:        ROCm0 model buffer size =  9562.48 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46857.20 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46189.46 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46985.99 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 46810.22 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 47160.22 MiB
Aug 01 15:03:04 xeon bash[52251]: load_tensors:   CPU_Mapped model buffer size = 25175.97 MiB
Aug 01 15:03:06 xeon bash[52251]: ...................................................................................................
Aug 01 15:03:06 xeon bash[52251]: .
Aug 01 15:03:06 xeon bash[52251]: llama_context: constructing llama_context
Aug 01 15:03:06 xeon bash[52251]: llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_seq_max     = 1
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ctx         = 131072
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ctx_per_seq = 131072
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_batch       = 2048
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ubatch      = 512
Aug 01 15:03:06 xeon bash[52251]: llama_context: causal_attn   = 1
Aug 01 15:03:06 xeon bash[52251]: llama_context: flash_attn    = 1
Aug 01 15:03:06 xeon bash[52251]: llama_context: kv_unified    = true
Aug 01 15:03:06 xeon bash[52251]: llama_context: freq_base     = 10000.0
Aug 01 15:03:06 xeon bash[52251]: llama_context: freq_scale    = 0.025
Aug 01 15:03:06 xeon bash[52251]: llama_context: n_ctx_per_seq (131072) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
Aug 01 15:03:06 xeon bash[52251]: set_abort_callback: call
Aug 01 15:03:06 xeon bash[52251]: llama_context:  ROCm_Host  output buffer size =     0.49 MiB
Aug 01 15:03:06 xeon bash[52251]: create_memory: n_ctx = 131072 (padded)
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   0: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   1: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   2: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   3: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   4: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   5: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   6: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   7: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   8: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer   9: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  10: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  11: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  12: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  13: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  14: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  15: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  16: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  17: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  18: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  19: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  20: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  21: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  22: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  23: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  24: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  25: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  26: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  27: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  28: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  29: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  30: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  31: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  32: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  33: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  34: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  35: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  36: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  37: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  38: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  39: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  40: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  41: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  42: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  43: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  44: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  45: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  46: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  47: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  48: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  49: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  50: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  51: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  52: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  53: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  54: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  55: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  56: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  57: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  58: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  59: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: layer  60: dev = ROCm0
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified:      ROCm0 KV buffer size = 16592.00 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: size = 16592.00 MiB (131072 cells,  61 layers,  1/ 1 seqs), K (f16): 8784.00 MiB, V (f16): 7808.00 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
Aug 01 15:03:06 xeon bash[52251]: llama_context: enumerating backends
Aug 01 15:03:06 xeon bash[52251]: llama_context: backend_ptrs.size() = 2
Aug 01 15:03:06 xeon bash[52251]: llama_context: max_nodes = 8688
Aug 01 15:03:06 xeon bash[52251]: llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
Aug 01 15:03:06 xeon bash[52251]: graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
Aug 01 15:03:06 xeon bash[52251]: graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
Aug 01 15:03:06 xeon bash[52251]: graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
Aug 01 15:03:06 xeon bash[52251]: llama_context:      ROCm0 compute buffer size =  2077.50 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_context:  ROCm_Host compute buffer size =   784.01 MiB
Aug 01 15:03:06 xeon bash[52251]: llama_context: graph nodes  = 4907
Aug 01 15:03:06 xeon bash[52251]: llama_context: graph splits = 298 (with bs=512), 240 (with bs=1)
Aug 01 15:03:06 xeon bash[52251]: clear_adapter_lora: call
Aug 01 15:03:06 xeon bash[52251]: common_init_from_params: added <|end▁of▁sentence|> logit bias = -inf
Aug 01 15:03:06 xeon bash[52251]: common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
Aug 01 15:03:06 xeon bash[52251]: common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Aug 01 15:03:06 xeon bash[52251]: set_warmup: value = 1
Aug 01 15:03:06 xeon bash[52251]: thread_id = 00, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 1
Aug 01 15:03:06 xeon bash[52251]: thread_id = 18, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 15, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 34, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 03, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 04, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 07, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 02, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 25, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 26, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 01, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 24, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 17, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 32, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 05, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 20, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 29, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 12, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 09, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 10, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 27, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 06, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 11, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 22, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 23, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 08, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 31, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 16, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 21, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 14, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 35, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 30, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 19, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 28, target_node = 0, actual_node = 0, cpuid = 00, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 13, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:06 xeon bash[52251]: thread_id = 33, target_node = 1, actual_node = 1, cpuid = 18, n_threads = 36
Aug 01 15:03:14 xeon bash[52251]: set_warmup: value = 0
Aug 01 15:03:14 xeon bash[52251]: srv          init: initializing slots, n_slots = 1
Aug 01 15:03:14 xeon bash[52251]: slot         init: id  0 | task -1 | new slot n_ctx_slot = 131072
Aug 01 15:03:14 xeon bash[52251]: slot        reset: id  0 | task -1 |
Aug 01 15:03:14 xeon bash[52251]: main: model loaded

@jukofyork
Copy link
Collaborator

And can you also make sure you run with -v and post the log here, and let's see what numa nodes it's putting all those threads on.

I think you mean -u?

|---------------------------------------||---------------------------------------|
|--             Socket  0             --||--             Socket  1             --|
|---------------------------------------||---------------------------------------|
|--     Memory Channel Monitoring     --||--     Memory Channel Monitoring     --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch  0: Reads (MB/s):  6724.50 --||-- Mem Ch  0: Reads (MB/s):  6727.73 --|
|--            Writes(MB/s):   271.34 --||--            Writes(MB/s):   249.71 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  1: Reads (MB/s):  6702.11 --||-- Mem Ch  1: Reads (MB/s):  6734.47 --|
|--            Writes(MB/s):   244.82 --||--            Writes(MB/s):   258.78 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  2: Reads (MB/s):  6706.66 --||-- Mem Ch  2: Reads (MB/s):  6682.26 --|
|--            Writes(MB/s):   250.18 --||--            Writes(MB/s):   181.19 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  3: Reads (MB/s):  6724.16 --||-- Mem Ch  3: Reads (MB/s):  6687.68 --|
|--            Writes(MB/s):   270.95 --||--            Writes(MB/s):   185.54 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  4: Reads (MB/s):  6707.93 --||-- Mem Ch  4: Reads (MB/s):  6737.81 --|
|--            Writes(MB/s):   246.61 --||--            Writes(MB/s):   261.74 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- Mem Ch  5: Reads (MB/s):  6697.25 --||-- Mem Ch  5: Reads (MB/s):  6717.85 --|
|--            Writes(MB/s):   231.48 --||--            Writes(MB/s):   225.57 --|
|--      PMM Reads(MB/s)   :     0.00 --||--      PMM Reads(MB/s)   :     0.00 --|
|--      PMM Writes(MB/s)  :     0.00 --||--      PMM Writes(MB/s)  :     0.00 --|
|-- SKT  0 Mem Read (MB/s) : 40262.61 --||-- SKT  1 Mem Read (MB/s) : 40287.79 --|
|-- SKT  0 Mem Write(MB/s) :  1515.37 --||-- SKT  1 Mem Write(MB/s) :  1362.54 --|
|-- SKT  0 PMM Read (MB/s):      0.00 --||-- SKT  1 PMM Read (MB/s):      0.00 --|
|-- SKT  0 PMM Write(MB/s):      0.00 --||-- SKT  1 PMM Write(MB/s):      0.00 --|
|-- SKT  0.0 NM read hit rate :  1.00 --||-- SKT  1.0 NM read hit rate :  1.01 --|
|-- SKT  0.1 NM read hit rate :  1.00 --||-- SKT  1.1 NM read hit rate :  1.01 --|
|-- SKT  0.2 NM read hit rate :  0.00 --||-- SKT  1.2 NM read hit rate :  0.00 --|
|-- SKT  0.3 NM read hit rate :  0.00 --||-- SKT  1.3 NM read hit rate :  0.00 --|
|-- SKT  0.4 NM read hit rate :  0.00 --||-- SKT  1.4 NM read hit rate :  0.00 --|
|-- SKT  0.5 NM read hit rate :  0.00 --||-- SKT  1.5 NM read hit rate :  0.00 --|
|-- SKT  0.6 NM read hit rate :  0.00 --||-- SKT  1.6 NM read hit rate :  0.00 --|
|-- SKT  0.7 NM read hit rate :  0.00 --||-- SKT  1.7 NM read hit rate :  0.00 --|
|-- SKT  0.8 NM read hit rate :  0.00 --||-- SKT  1.8 NM read hit rate :  0.00 --|
|-- SKT  0.9 NM read hit rate :  0.00 --||-- SKT  1.9 NM read hit rate :  0.00 --|
|-- SKT  0.10 NM read hit rate :  0.00 --||-- SKT  1.10 NM read hit rate :  0.00 --|
|-- SKT  0.11 NM read hit rate :  0.00 --||-- SKT  1.11 NM read hit rate :  0.00 --|
|-- SKT  0.12 NM read hit rate :  0.00 --||-- SKT  1.12 NM read hit rate :  0.00 --|
|-- SKT  0.13 NM read hit rate :  0.00 --||-- SKT  1.13 NM read hit rate :  0.00 --|
|-- SKT  0.14 NM read hit rate :  0.00 --||-- SKT  1.14 NM read hit rate :  0.00 --|
|-- SKT  0.15 NM read hit rate :  0.00 --||-- SKT  1.15 NM read hit rate :  0.00 --|
|-- SKT  0.16 NM read hit rate :  0.00 --||-- SKT  1.16 NM read hit rate :  0.00 --|
|-- SKT  0.17 NM read hit rate :  0.00 --||-- SKT  1.17 NM read hit rate :  0.00 --|
|-- SKT  0.18 NM read hit rate :  0.00 --||-- SKT  1.18 NM read hit rate :  0.00 --|
|-- SKT  0.19 NM read hit rate :  0.00 --||-- SKT  1.19 NM read hit rate :  0.00 --|
|-- SKT  0.20 NM read hit rate :  0.00 --||-- SKT  1.20 NM read hit rate :  0.00 --|
|-- SKT  0.21 NM read hit rate :  0.00 --||-- SKT  1.21 NM read hit rate :  0.00 --|
|-- SKT  0.22 NM read hit rate :  0.00 --||-- SKT  1.22 NM read hit rate :  0.00 --|
|-- SKT  0.23 NM read hit rate :  0.00 --||-- SKT  1.23 NM read hit rate :  0.00 --|
|-- SKT  0.24 NM read hit rate :  0.00 --||-- SKT  1.24 NM read hit rate :  0.00 --|
|-- SKT  0.25 NM read hit rate :  0.00 --||-- SKT  1.25 NM read hit rate :  0.00 --|
|-- SKT  0.26 NM read hit rate :  0.00 --||-- SKT  1.26 NM read hit rate :  0.00 --|
|-- SKT  0.27 NM read hit rate :  0.00 --||-- SKT  1.27 NM read hit rate :  0.00 --|
|-- SKT  0.28 NM read hit rate :  0.00 --||-- SKT  1.28 NM read hit rate :  0.00 --|
|-- SKT  0.29 NM read hit rate :  0.00 --||-- SKT  1.29 NM read hit rate :  0.00 --|
|-- SKT  0.30 NM read hit rate :  0.00 --||-- SKT  1.30 NM read hit rate :  0.00 --|
|-- SKT  0.31 NM read hit rate :  0.00 --||-- SKT  1.31 NM read hit rate :  0.00 --|
|-- SKT  0 Memory (MB/s):    41777.98 --||-- SKT  1 Memory (MB/s):    41650.33 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|--            System DRAM Read Throughput(MB/s):      80550.40                --|
|--           System DRAM Write Throughput(MB/s):       2877.91                --|
|--             System PMM Read Throughput(MB/s):          0.00                --|
|--            System PMM Write Throughput(MB/s):          0.00                --|
|--                 System Read Throughput(MB/s):      80550.40                --|
|--                System Write Throughput(MB/s):       2877.91                --|
|--               System Memory Throughput(MB/s):      83428.31                --|
|---------------------------------------||---------------------------------------|
slot update_slots: id  0 | task 1087 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 16
slot update_slots: id  0 | task 1087 | need to evaluate at least 1 token for each active slot, n_past = 16, n_prompt_tokens = 16
slot update_slots: id  0 | task 1087 | kv cache rm [15, end)
slot update_slots: id  0 | task 1087 | prompt processing progress, n_past = 16, n_tokens = 1, progress = 0.062500
slot update_slots: id  0 | task 1087 | prompt done, n_past = 16, n_tokens = 1
slot      release: id  0 | task 1087 | stop processing: n_past = 1570, truncated = 0
slot print_timing: id  0 | task 1087 | 
prompt eval time =     244.06 ms /     1 tokens (  244.06 ms per token,     4.10 tokens per second)
       eval time =  230359.52 ms /  1555 tokens (  148.14 ms per token,     6.75 tokens per second)
      total time =  230603.58 ms /  1556 tokens

This is just my stock settings and not this PR though - I probably won't have chance to run that until tomorrow or Monday now.

According to:

https://en.wikichip.org/wiki/intel/xeon_gold/6248

(2699/2933)*131.13 = ~120GB/s per socket is the theoretical maximum bandwidth, so it looks like I'm getting about 1/3rd of this.

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

No I mean run llama-server with -v and post its startup log so we can see where my numa code is putting your threads.

Looks like you are getting symmetric bandwidth usage on both nodes though. I think that's what I would expect...

@jukofyork
Copy link
Collaborator

No I mean run llama-server with -v and post its startup log so we can see where my numa code is putting your threads.

Oh, sorry -u is to make pcm-memory act like you used watch on it! :)

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

I think anyway I'll add a very detailed log at the start of memory allocation with:

  • what numa topology it sees

  • what sockets and cores it sees

  • what kind of cores they are (perf/efficiency/hyper threaded)

  • where it plans to put the threads: on which nodes

Then it will be very clear and easy to debug.

There could be strange things like numa sub-clustering going on, I read that's a thing.

@dbsanfte
Copy link
Author

dbsanfte commented Aug 1, 2025

It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄

@sultanqasim
Copy link

sultanqasim commented Aug 2, 2025

I tried this out on my dual Xeon 4216 system (no GPU) with Cohere Command-A on RHEL 8. I had to make changes to the gettid and getcpu calls (replace them with raw syscalls) because those were added in glibc 2.30/2.29, while RHEL 8 uses glibc 2.28. I got it to build, and it appear to allocate mirrored copies of the model for the two sockets.

Unfortunately, I didn't see any change to performance on my system. Here's the command I used:

sudo sysctl -w vm.nr_hugepages=120000
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo ~/llama.cpp/build/bin/llama-server -m ~/llm/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 65536 -np 2 --temp 0.6 -fa -ctk q8_0 -ctv q8_0 -ot ".*=CPU" --chat-template-file ~/llm/c4ai-command-a-template-notool.txt

Edit: I tried allocating hugepages with a script similar to what you shared above, except with 49152 2048k hugepages per node. Still no performance change.

@jukofyork
Copy link
Collaborator

According to:

https://en.wikichip.org/wiki/intel/xeon_gold/6248

(2699/2933)*131.13 = ~120GB/s per socket is the theoretical maximum bandwidth, so it looks like I'm getting about 1/3rd of this.

Thinking about this more today, for offloading shared-experts only; pcm-memory is probably giving quite deceptive throughput statistics, as it really depends on what fraction of the time is spent in the non-offloaded calculation vs offloaded calculation...

If the sampling frequency is high enough, then I might be able to hack pcm-memory or failing that; add some timing stats to llama.cpp to track this.

@jukofyork
Copy link
Collaborator

Here's the command I used:

sudo sysctl -w vm.nr_hugepages=120000
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo ~/llama.cpp/build/bin/llama-server -m ~/llm/c4ai-command-a-03-2025-Q4_K_M-00001-of-00002.gguf --numa distribute --threads 64 -c 65536 -np 2 --temp 0.6 -fa -ctk q8_0 -ctv q8_0 -ot ".*=CPU" --chat-template-file ~/llm/c4ai-command-a-template-notool.txt

It might be worth trying without the -np 2 in case llama.cpp does anything different with that.

Do you find that using --threads 64 (ie: using your hyperthreading threads) gives better performance than using --threads 32 without this PR?

@jukofyork
Copy link
Collaborator

jukofyork commented Aug 2, 2025

It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄

IIRC, the current CUDA offloading code only uses a single GPU for the offloaded calculations, so having 2 copies won't really help it.

I do think there is a bottleneck somewhere as PCIe 3.0 x16 has a max bandwidth of ~16GB/s, yet watching nvtop during large batch processing it often only gets around 1/3rd of this (but like for the pcm-memory test; I can't be sure if this is due to averaging a bimodal set of timings).

@dbsanfte
Copy link
Author

dbsanfte commented Aug 2, 2025

If the threads doing the offloading are located on socket 1, but the GPU is on a pcie slot attached to socket 2, maybe that would be sending the traffic over the UPI link? Might be worth investigating. I'll try to get that better visibility of thread/numa assignments in soon.

@sultanqasim
Copy link

It might be worth trying without the -np 2 in case llama.cpp does anything different with that.

Do you find that using --threads 64 (ie: using your hyperthreading threads) gives better performance than using --threads 32 without this PR?

Without this PR, I had a slight speedup from using HyperThreading (i.e. --threads 64 instead of --threads 32).

Removing -np 2 had no impact on performance (for a single request, nothing running concurrently). However, I noticed that with -np 2 the generated tokens were gibberish, while without -np 2 it was giving valid/correct outputs. Looks like a bug.

With this PR, switching from --threads 64 to --threads 32 had the same slowdown I had without this PR.

GGML_NUMA_MIRROR, with 64 threads

prompt eval time =   70104.59 ms /   321 tokens (  218.39 ms per token,     4.58 tokens per second)
       eval time =   85848.91 ms /   165 tokens (  520.30 ms per token,     1.92 tokens per second)
      total time =  155953.50 ms /   486 tokens

GGML_NUMA_MIRROR, with 32 threads

prompt eval time =   75922.60 ms /   321 tokens (  236.52 ms per token,     4.23 tokens per second)
       eval time =  118967.32 ms /   201 tokens (  591.88 ms per token,     1.69 tokens per second)
      total time =  194889.92 ms /   522 tokens

@FullstackSensei
Copy link

FullstackSensei commented Aug 3, 2025

Installed libnuma and Pulled the latest changes (9d66473). Had to disable building RPC to build successfully.

Tried to run it with Qwen3-235B-A22B-Instruct-2507-UD-Q4_K_XL. I set .vm.nr_hugepages to 160000 to make sure the model and context had enough space. First time I run it took much longer than regular llama.cpp to load, I didn't time it, but felt like 5 minutes, whereas regular llama.cpp takes 1 minute or less. Subsequent loads were very quick, much quicker than llama.cpp.

I haven't been able to get any output. Prompt prpcessing takes forever even on short six word prompts (ex: write a pong game in C++). In htop, I see only two cores (on CPU0) being at 100%, while all others are at 0%. The cores are the first and the 24th in htop.

The system is a dual 24 core Xeon (ES QQ89, with HT enabled). I think there's a bug in thread pinning. The 24th core would have been the first core of the second CPU if HT was disabled. All threads get pinned to those two cores regardless of whether I set -t or not in llama-server.

Tried using numactl with --physcpubind=$(seq -s, 1 2 95), which usually pins one worker to each physical core, but all threads get mapped to the same two cores (0 and 24). Waited a couple of minutes on that pong prompt to see if I get any output, but not a single token.

EDIT: Got my dual Epyc back online, and can confirm same behavior as the dual Xeon. Compiled the branch, and run with --threads 96. Can see all threads get crammed on cpuid 00 and 48 in the log output, as well as on htop. Can also confirm what @aifartist mentioned about SMT threads not being the same as Intel consumer. Running cat /sys/devices/system/cpu/cpu{0..NN}/topology/thread_siblings_list (where NN is the total number of cores/threads reported in, for ex, htop) on my dual Xeon, dual Epyc, and single Epyc all report the same pairing: physical cores come first, then SMT ones.

@ptempier
Copy link

ptempier commented Aug 3, 2025

It also occurs to me that PCIe slots are always attached to a single socket, so numa-aware allocation might impact that, that would be an interesting side effect. I'm learning more about memory every day... 😄

IIRC, the current CUDA offloading code only uses a single GPU for the offloaded calculations, so having 2 copies won't really help it.

I do think there is a bottleneck somewhere as PCIe 3.0 x16 has a max bandwidth of ~16GB/s, yet watching nvtop during large batch processing it often only gets around 1/3rd of this (but like for the pcm-memory test; I can't be sure if this is due to averaging a bimodal set of timings).

Personally if that's of any issue, i'd put support for gpu aside the time for the patch to be developed.
Datacenters are riddled which old esxi with a lot of memory.
They are often more powerful than personal computers but with no gpu and sometime, no space to put one.
If the patch works on this type of older machines with slower memory, that would already be nice.

@dbsanfte dbsanfte changed the title Implementation of GGML_NUMA_MIRROR for 64% inferencing performance gain on numa systems Implementation of GGML_NUMA_MIRROR for inferencing performance gain on numa systems Aug 5, 2025
@dbsanfte
Copy link
Author

dbsanfte commented Aug 5, 2025

I've done quite a bit of testing and code deep diving over the weekend. What I've realised is that:

  1. My performance gain in the original PR post was illusory - it was real, but only because the NUMA_MIRROR code was undoing the effects of kernel numa-balancing. I had forgotten to disable it before testing... (facepalm). With it disabled, I see no speedup over master.

  2. The NUMA mirroring is a valid strategy and works in theory, but the threadpool in ggml-cpu.c does not split the matrix work up between sockets, just between threads, and they all wait on each other in the barrier method. So this solves the issue of cross-numa memory access but does not provide any real speedup yet - all numas are still bound to each other and to a single threadpool.

All of this said, now I can see what needs to be done to get this over the line. Each socket needs its own threadpool and the matrix operations need to be divvied up between the numa nodes / sockets, then we can leverage data paralellism. I am iterating on this locally at the moment and will update when I have something to test.

@FullstackSensei
Copy link

@dbsanfte take a look at.COSMA. I've read the paper and it's supposed to solve all these issues, including distributing the workload across several nodes.

I have 56gb infiniband on my nodes and can test dual Xeon with dual Epyc, and can even add a single Epyc as a 3rd node

@rankaiyx
Copy link

rankaiyx commented Aug 5, 2025

Some information that may be useful.
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/#multi-numa-parallelism
https://arxiv.org/abs/2502.10923
https://lwn.net/Articles/221885/

@dbsanfte
Copy link
Author

dbsanfte commented Aug 5, 2025

Looking at the code architecture, COSMA needs to be its own new backend really. And just throw away ggml-cpu.

This could be good or bad, I'm not sure :D I like the idea.

As a pedagogical exercise, I'll carry on with the framework I've created up to now, and maybe attempt that as a new PR when I feel more confident.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 10, 2025

hmm.. according to this PCM test I am only using 55gb/s of my 220gb/s bandwidth. About 10-15 more gb/s than with a single proc/numa. The UPI link utilization isn't constant but it does sometimes get saturated. I'm also feeding 4x GPU over PCIE 3.0, however, maybe that's capping it?

On a single proc/node I get 9t/s and on dual it's 11ish. GPU on only one, but speeds from the opposite node are only hair slower so must not be it.

You can look at fastllm, it supports CPU numa, or claims to. Could give you some ideas to implement here. I want to compare it's speeds with llama.cpp to see what, if anything I'm leaving on the table. A real head to head with qwen-235bs and now I'll be keeping an eye on pcm-memory behavior.

@dbsanfte
Copy link
Author

I've got a local implementation mostly working now. I'll merge it into this PR next week.

Key changes upcoming:

  • Data parallel implementation that actually splits up matrix operations between Numas, which when combined with tensor data mirroring, actually gives linear speedup as you add more numas

  • Numa aware kv cache. Each data parallel numa keeps its own local cache. This was missing from the existing implementation.

  • A brand new "numa aware coordinator" to run the data parallel implementation. It doesn't use OpenMP so no fighting with OpenMP's own internal threading logic. We manage our own threading and make sure everything is placed where we want it to be. Gives us full control over which cores to use and whether to enable hyperthreaded cores etc.

  • Enable mirroring and full data parallelism by doing --numa mirror. The other options will still be supported but you won't see the same good speedups.

I'm quite excited about this and will get it into this PR once I finish local testing.

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 10, 2025

That's great! How much uplift did you get? Also was there benefit to more than one numa node per socket?

edit: trying it now, only masks for a single socket and requires mmap with the HP enabled. attempting to see if it will run with the latter.

@dbsanfte
Copy link
Author

There are bugs in what's checked into my iteration branch that I've only fixed locally, let me finish lol.

@dbsanfte
Copy link
Author

dbsanfte commented Aug 22, 2025

Just as an update, I promised a release last week but I didn't want to release something broken.

  1. All kernels need to be reimplemented to do data slicing against each numa node and its tensor data mirror. I now have the framework to do that.

  2. I have implemented a data-parallel kernel for ADD as my first go on my iteration branch. It shows the following results:

TENSOR_DATA DEBUG: node=0, tensor=0x7aca9e800020, __data[0]=0x7aca9e8001b0, __data[0]=0x7aca9e8001b0
NUMA MIRROR DEBUG: checking should_mirror: initialized=1, numa_enabled=1, numa_nodes=2, strategy=4 (MIRROR=4)
NUMA MIRROR DEBUG: should_mirror result = TRUE
TENSOR_DATA DEBUG: node=0, tensor=0x7acaae80NUMA Executor: Successfully completed ADD using NUMA ADD Direct (Data-Parallel/Multi)
✅ NUMA Executor: All 1 operations completed successfully
01d0, __data[0]=0x7acaae800360, __data[0]=0x7acaae800360
NUMA MIRROR DEBUG: checking should_mirror: initialized=1, numa_enabled=1, numa_nodes=2, strategy=4 (MIRROR=4)
NUMA MIRROR DEBUG: should_mirror result = TRUE
TENSOR_DATA DEBUG: node=0, tensor=0x7acabe800380, __data[0]=0x7acabe800510, __data[0]=0x7acabe800510
🔗 NUMA ADD KERNEL: Using mirrored data pointers: src0=0x7aca9e8001b0, src1=0x7acaae800360, dst=0x7acabe800510
✅ NUMA ADD KERNEL: Completed 33554432 elements in 42.562 ms (9.46 GB/s)
DEBUG: Dispatch thread 0 completed work with status 0 (42.606ms)
✅ NUMA ADD KERNEL: Completed 33554432 elements in 43.521 ms (9.25 GB/s)
--
Improvement:      69.2%
GOOD: NUMA shows 69.2% improvement

I think this is roughly the level of performance gain we can expect.

I will start migrating the major kernels like MUL_MAT, ROPE, etc that are used most often, then do a performance benchmark against real inferencing.

This is looking very promising now. I have a real working solution.

Edit: Oh, and no more hugepages, I just use regular malloc() but I page fault to each numa individually. Even found a workaround for the broken allocation in docker containers :)

@ptempier
Copy link

i see not everyone is in vacations :)

@Ph0rk0z
Copy link

Ph0rk0z commented Aug 22, 2025

There's also transparent huge pages. In IK there's a switch but from monitoring their size, the kernel seems to do it automatically.

@fernandaspets
Copy link

any news on this PR? seems exciting!? would love to try on our 4x cpu and 2x cpu ddr4 servers!

@rankaiyx
Copy link

any news on this PR? seems exciting!? would love to try on our 4x cpu and 2x cpu ddr4 servers!

It's clear that dbsanfte has invested a great deal of time and effort into this work—so much so that code reviews alone would be a daunting task.
https://github.com/dbsanfte/llama.cpp/commits/numa-improvements-take2-iteration/

@dbsanfte
Copy link
Author

Halfway through I realized I can get most of the benefits of --numa mirror without all the data slicing.

closing this in favour of:

#16000

@dbsanfte dbsanfte closed this Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants