Skip to content

Commit c665d3c

Browse files
committed
1) fix CPU detection of physical cores 2) fix tensor_data() access in CUDA
1 parent 435f095 commit c665d3c

File tree

15 files changed

+346
-769
lines changed

15 files changed

+346
-769
lines changed

.github/instructions/numa-mirroring-implementation.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ Clean integration point during model loading where NUMA mirrors are established
4848
#### `ggml/include/ggml.h`
4949
**Purpose**: Core tensor data access with NUMA-aware routing
5050
**Key additions**:
51-
- `#ifdef GGML_NUMA_MIRROR` conditional compilation blocks
5251
- NUMA mirror data structures in `ggml_tensor`
5352
- `tensor_set_data_with_numa_mirrors()` function declaration
5453
- Optimized `tensor_data()` function with fast path for non-NUMA tensors
@@ -94,9 +93,14 @@ Clean integration point during model loading where NUMA mirrors are established
9493
## Build Configuration
9594

9695
### CMake Configuration
97-
Enable NUMA mirroring during build:
96+
Enable OpenMP during build:
9897
```bash
99-
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NUMA_MIRROR=ON -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"
98+
# Debug config (for debugging, obviously)
99+
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DGGML_OPENMP=ON
100+
101+
# Release config (for performance testing)
102+
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native" -DGGML_OPENMP=ON
103+
100104
cmake --build build --parallel
101105
```
102106

@@ -106,7 +110,6 @@ cmake --build build --parallel
106110
- **C++17 compiler**: Modern C++ standard support
107111

108112
### Compilation Flags
109-
- `GGML_NUMA_MIRROR=ON`: Enables NUMA mirroring functionality
110113
- `-march=native`: CPU-specific optimizations (recommended for maximum performance)
111114
- `CMAKE_BUILD_TYPE=Release`: Optimized release build
112115

@@ -127,17 +130,17 @@ cmake --build build --parallel
127130
## Implementation Details
128131

129132
### Tensor Data Access Optimization
133+
The `ggml_tensor` struct in `ggml.h` has been updated to no longer have a `data` field. This has been renamed to a `__data[]` array to hold pointers to multiple memory locations, with the index corresponding to the index of a local Numa node.
134+
135+
Instead of directly addressing `tensor->data`, instead you do `tensor_data(tensor)`. And setting is done with `tensor_set_data()`. These are two new macros in `ggml.h`.
136+
130137
The `tensor_data()` function in `ggml.h` has been optimized with a fast path:
131138
```c
132139
static inline void * tensor_data(const struct ggml_tensor * tensor) {
133-
#ifdef GGML_NUMA_MIRROR
134140
if (tensor->numa_mirror_data == NULL) {
135141
return tensor->data; // Fast path: no NUMA mirrors
136142
}
137143
return ggml_numa_get_tensor_data(tensor); // NUMA-aware routing
138-
#else
139-
return tensor->data;
140-
#endif
141144
}
142145
```
143146
@@ -163,6 +166,19 @@ Use `llama-bench` to measure NUMA benefits:
163166
./llama-bench -m model.gguf --numa mirror
164167
```
165168

169+
There are models you can use for testing in our .devcontainer folder:
170+
171+
.devcontainer/DeepSeek-R1-0528-UD-IQ3_XXS.gguf
172+
.devcontainer/gpt-oss-20b-UD-Q4_K_XL.gguf
173+
.devcontainer/qwen2.5-0.5b-instruct-q8_0.gguf
174+
.devcontainer/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
175+
.devcontainer/Qwen3-32B-Q6_K.gguf
176+
177+
Use qwen2.5-0.5b-instruct-q8_0.gguf for a quick verification run, while a bigger, dense model like Qwen3-32B-Q6_K.gguf will be good to test relative speed gains.
178+
179+
If testing with `llama-cli`, always be sure to use the `--no-cnv` switch to prevent it from starting an interactive conversation.
180+
181+
166182
### System Requirements Check
167183
Verify NUMA topology:
168184
```bash
@@ -175,6 +191,7 @@ numactl --hardware
175191
Future versions may include:
176192
- Selective tensor mirroring policies
177193
- Custom NUMA node mapping
194+
- Limiting GGML threadpools to non-hyperthreaded cores
178195

179196
## Technical Notes
180197

@@ -202,10 +219,9 @@ Future versions may include:
202219

203220
### Verification
204221
Confirm NUMA mirroring is working:
205-
1. Build with `GGML_NUMA_MIRROR=ON`
206-
2. Run `numactl --hardware` to verify multiple NUMA nodes
207-
3. Test with `GGML_NUMA_DEBUG=1` for debug output
208-
4. Compare performance with and without `--numa mirror`
222+
1. Run `numactl --hardware` to verify multiple NUMA nodes
223+
2. Test with `--verbose` for debug output
224+
3. Compare performance with and without `--numa mirror`
209225

210226
## Conclusion
211227

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,3 +148,9 @@ poetry.toml
148148
/run-vim.sh
149149
/run-chat.sh
150150
.ccache/
151+
.devcontainer/devcontainer.json
152+
.devcontainer/Dockerfile
153+
.devcontainer/launch.json
154+
.devcontainer/README.md
155+
.devcontainer/tasks.json
156+
.devcontainer/zscaler.crt

common/arg.cpp

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1517,6 +1517,16 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
15171517
params.cpuparams.strict_cpu = std::stoul(value);
15181518
}
15191519
));
1520+
add_opt(common_arg(
1521+
{"--cpu-use-hyperthreading"},
1522+
"use both physical CPU cores and their hyperthread siblings (default: physical cores only)",
1523+
[](common_params & params) {
1524+
params.cpuparams.mask_valid = true;
1525+
if (!cpu_mask_set_physical_cores_with_hyperthreading(params.cpuparams.cpumask)) {
1526+
LOG_WRN("Failed to detect CPU topology, using all available CPUs\n");
1527+
}
1528+
}
1529+
));
15201530
add_opt(common_arg(
15211531
{"--prio"}, "N",
15221532
string_format("set process/thread priority : low(-1), normal(0), medium(1), high(2), realtime(3) (default: %d)\n", params.cpuparams.priority),

common/common.cpp

Lines changed: 220 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
#include <iostream>
2323
#include <iterator>
2424
#include <regex>
25+
#include <set>
2526
#include <sstream>
2627
#include <string>
2728
#include <thread>
@@ -116,10 +117,92 @@ int32_t cpu_get_num_physical_cores() {
116117

117118
return num_physical_cores > 0 ? num_physical_cores : default_threads;
118119
#endif
120+
// Try to use accurate topology detection first
121+
int32_t topology_cores = cpu_detect_physical_cores_topology();
122+
if (topology_cores > 0) {
123+
return topology_cores;
124+
}
125+
126+
// Fallback to heuristic if topology detection failed
119127
unsigned int n_threads = std::thread::hardware_concurrency();
120128
return n_threads > 0 ? (n_threads <= 4 ? n_threads : n_threads / 2) : 4;
121129
}
122130

131+
int32_t cpu_detect_physical_cores_topology() {
132+
std::vector<int> physical_cores;
133+
if (cpu_get_physical_cores_topology(physical_cores)) {
134+
return static_cast<int32_t>(physical_cores.size());
135+
}
136+
return 0; // Indicate detection failed
137+
}
138+
139+
bool cpu_get_physical_cores_topology(std::vector<int> & physical_cores) {
140+
physical_cores.clear();
141+
142+
#if defined(__linux__) && !defined(__ANDROID__)
143+
// Use Linux sysfs topology detection for accurate physical core detection
144+
int num_cpus = std::thread::hardware_concurrency();
145+
if (num_cpus <= 0) {
146+
return false;
147+
}
148+
149+
std::set<int> processed_cpus;
150+
151+
for (int cpu = 0; cpu < num_cpus; cpu++) {
152+
// Skip if we've already processed this CPU as part of another core's siblings
153+
if (processed_cpus.count(cpu) > 0) {
154+
continue;
155+
}
156+
157+
std::string thread_siblings_path = "/sys/devices/system/cpu/cpu" + std::to_string(cpu) + "/topology/thread_siblings_list";
158+
std::ifstream siblings_file(thread_siblings_path);
159+
160+
if (!siblings_file.is_open()) {
161+
// If we can't read topology for this CPU, skip it but don't mark as physical
162+
continue;
163+
}
164+
165+
std::string siblings_str;
166+
if (std::getline(siblings_file, siblings_str)) {
167+
// Parse the comma-separated list of sibling threads
168+
std::vector<int> siblings;
169+
std::stringstream ss(siblings_str);
170+
std::string cpu_str;
171+
172+
while (std::getline(ss, cpu_str, ',')) {
173+
try {
174+
int sibling_cpu = std::stoi(cpu_str);
175+
siblings.push_back(sibling_cpu);
176+
} catch (const std::exception &) {
177+
// Skip invalid entries
178+
}
179+
}
180+
181+
if (!siblings.empty()) {
182+
// Sort siblings to ensure we always pick the lowest-numbered one as primary
183+
std::sort(siblings.begin(), siblings.end());
184+
int primary_cpu = siblings[0];
185+
186+
// Only count this as a physical core if it's the current CPU (the lowest-numbered sibling)
187+
if (primary_cpu == cpu) {
188+
physical_cores.push_back(primary_cpu);
189+
}
190+
191+
// Mark all siblings as processed so we don't consider them again
192+
for (int sibling : siblings) {
193+
processed_cpus.insert(sibling);
194+
}
195+
}
196+
}
197+
}
198+
199+
return !physical_cores.empty();
200+
#else
201+
// Not supported on this platform
202+
return false;
203+
#endif
204+
}
205+
123206
#if defined(__x86_64__) && defined(__linux__) && !defined(__ANDROID__)
124207
#include <pthread.h>
125208

@@ -269,12 +352,148 @@ void postprocess_cpu_params(cpu_params& cpuparams, const cpu_params* role_model)
269352
}
270353
}
271354

272-
if (n_set && n_set < cpuparams.n_threads) {
355+
// If a CPU mask is set, use the number of set CPUs as the thread count
356+
if (cpuparams.mask_valid && n_set > 0) {
357+
cpuparams.n_threads = n_set;
358+
} else if (n_set && n_set < cpuparams.n_threads) {
273359
// Not enough set bits, may experience performance issues.
274360
LOG_WRN("Not enough set bits in CPU mask (%d) to satisfy requested thread count: %d\n", n_set, cpuparams.n_threads);
275361
}
276362
}
277363

364+
bool cpu_mask_set_physical_cores_only(bool (&boolmask)[GGML_MAX_N_THREADS]) {
365+
#ifdef _WIN32
366+
// Windows implementation would require different approach
367+
LOG_WRN("Physical core detection is not supported on Windows\n");
368+
return false;
369+
#else
370+
std::memset(boolmask, false, sizeof(bool) * GGML_MAX_N_THREADS);
371+
372+
// Use the common topology detection logic
373+
std::vector<int> physical_cores;
374+
if (!cpu_get_physical_cores_topology(physical_cores)) {
375+
// Fallback: if we couldn't detect topology, just use all CPUs
376+
int num_cpus = std::thread::hardware_concurrency();
377+
for (int cpu = 0; cpu < num_cpus && cpu < GGML_MAX_N_THREADS; cpu++) {
378+
boolmask[cpu] = true;
379+
}
380+
LOG_WRN("Could not detect CPU topology, using all CPUs\n");
381+
return false;
382+
}
383+
384+
// Set the mask for detected physical cores
385+
for (int core_id : physical_cores) {
386+
if (core_id < GGML_MAX_N_THREADS) {
387+
boolmask[core_id] = true;
388+
}
389+
}
390+
391+
LOG("Detected %zu physical cores (excluding hyperthreads): ", physical_cores.size());
392+
for (size_t i = 0; i < physical_cores.size(); i++) {
393+
if (i > 0) LOG(", ");
394+
LOG("%d", physical_cores[i]);
395+
}
396+
LOG("\n");
397+
398+
return true;
399+
#endif
400+
}
401+
402+
bool cpu_mask_set_physical_cores_with_hyperthreading(bool (&boolmask)[GGML_MAX_N_THREADS]) {
403+
#ifdef _WIN32
404+
// Windows implementation would require different approach
405+
LOG_WRN("--cpu-use-hyperthreading is not supported on Windows\n");
406+
return false;
407+
#else
408+
std::memset(boolmask, false, sizeof(bool) * GGML_MAX_N_THREADS);
409+
410+
int num_cpus = std::thread::hardware_concurrency();
411+
if (num_cpus <= 0) {
412+
return false;
413+
}
414+
415+
// Use the common topology detection logic to get all CPU sibling relationships
416+
std::set<int> processed_cpus;
417+
std::vector<int> all_cores_and_siblings;
418+
419+
for (int cpu = 0; cpu < num_cpus; cpu++) {
420+
// Skip if we've already processed this CPU as part of another core's siblings
421+
if (processed_cpus.count(cpu) > 0) {
422+
continue;
423+
}
424+
425+
std::string thread_siblings_path = "/sys/devices/system/cpu/cpu" + std::to_string(cpu) + "/topology/thread_siblings_list";
426+
std::ifstream siblings_file(thread_siblings_path);
427+
428+
if (!siblings_file.is_open()) {
429+
// If we can't read topology for this CPU, include it anyway
430+
all_cores_and_siblings.push_back(cpu);
431+
processed_cpus.insert(cpu);
432+
continue;
433+
}
434+
435+
std::string siblings_str;
436+
if (std::getline(siblings_file, siblings_str)) {
437+
// Parse the comma-separated list of sibling threads
438+
std::vector<int> siblings;
439+
std::stringstream ss(siblings_str);
440+
std::string cpu_str;
441+
442+
while (std::getline(ss, cpu_str, ',')) {
443+
try {
444+
int sibling_cpu = std::stoi(cpu_str);
445+
siblings.push_back(sibling_cpu);
446+
} catch (const std::exception &) {
447+
// Skip invalid entries
448+
}
449+
}
450+
451+
if (!siblings.empty()) {
452+
// Include ALL siblings (both physical core and hyperthreads)
453+
for (int sibling : siblings) {
454+
all_cores_and_siblings.push_back(sibling);
455+
processed_cpus.insert(sibling);
456+
}
457+
} else {
458+
// Fallback: include this CPU if no siblings found
459+
all_cores_and_siblings.push_back(cpu);
460+
processed_cpus.insert(cpu);
461+
}
462+
} else {
463+
// Fallback: include this CPU if we can't read the file
464+
all_cores_and_siblings.push_back(cpu);
465+
processed_cpus.insert(cpu);
466+
}
467+
}
468+
469+
if (all_cores_and_siblings.empty()) {
470+
// Fallback: if we couldn't detect topology, just use all CPUs
471+
for (int cpu = 0; cpu < num_cpus && cpu < GGML_MAX_N_THREADS; cpu++) {
472+
boolmask[cpu] = true;
473+
}
474+
LOG_WRN("Could not detect CPU topology, using all CPUs\n");
475+
return false;
476+
}
477+
478+
// Set the mask for all detected cores and their hyperthread siblings
479+
for (int cpu_id : all_cores_and_siblings) {
480+
if (cpu_id < GGML_MAX_N_THREADS) {
481+
boolmask[cpu_id] = true;
482+
}
483+
}
484+
485+
LOG("Using %zu CPU cores including hyperthreads: ", all_cores_and_siblings.size());
486+
std::sort(all_cores_and_siblings.begin(), all_cores_and_siblings.end());
487+
for (size_t i = 0; i < all_cores_and_siblings.size(); i++) {
488+
if (i > 0) LOG(", ");
489+
LOG("%d", all_cores_and_siblings[i]);
490+
}
491+
LOG("\n");
492+
493+
return true;
494+
#endif
495+
}
496+
278497
bool parse_cpu_range(const std::string & range, bool (&boolmask)[GGML_MAX_N_THREADS]) {
279498
size_t dash_loc = range.find('-');
280499
if (dash_loc == std::string::npos) {

common/common.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,10 @@ struct cpu_params {
6565

6666
int32_t cpu_get_num_physical_cores();
6767
int32_t cpu_get_num_math();
68+
int32_t cpu_detect_physical_cores_topology(); // Detect actual physical cores using CPU topology
69+
bool cpu_get_physical_cores_topology(std::vector<int> & physical_cores); // Get list of physical core IDs
70+
bool cpu_mask_set_physical_cores_only(bool(&boolmask)[GGML_MAX_N_THREADS]);
71+
bool cpu_mask_set_physical_cores_with_hyperthreading(bool(&boolmask)[GGML_MAX_N_THREADS]); // Set mask to include physical cores + hyperthread siblings
6872

6973
//
7074
// Common params
@@ -513,6 +517,7 @@ std::string common_params_get_system_info(const common_params & params);
513517

514518
bool parse_cpu_range(const std::string & range, bool(&boolmask)[GGML_MAX_N_THREADS]);
515519
bool parse_cpu_mask(const std::string & mask, bool(&boolmask)[GGML_MAX_N_THREADS]);
520+
bool cpu_mask_set_physical_cores_only(bool(&boolmask)[GGML_MAX_N_THREADS]);
516521
void postprocess_cpu_params(cpu_params & cpuparams, const cpu_params * role_model = nullptr);
517522
bool set_process_priority(enum ggml_sched_priority prio);
518523

0 commit comments

Comments
 (0)