Skip to content

Commit 3e13f6b

Browse files
committed
iterate - CPY kernel with extensive testing
1 parent 2300a3c commit 3e13f6b

File tree

11 files changed

+1535
-330
lines changed

11 files changed

+1535
-330
lines changed

.github/copilot-instructions.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -198,9 +198,6 @@ NUMA_GET_SOURCE_POINTER(src_data, tensor->src[0], float);
198198

199199
// 5. Synchronization - Essential for correctness
200200
NUMA_BARRIER_AUTO();
201-
202-
// 6. Early exit handling - Performance optimization
203-
NUMA_EARLY_EXIT_IF_NO_WORK(ctx);
204201
```
205202
206203
**🏗️ Composed Templates (Recommended for Common Patterns):**
@@ -536,10 +533,11 @@ cp tests/test-numa-mathematical-correctness-template.cpp tests/test-numa-mathema
536533

537534
**Required tests:**
538535
- Multi-dimensional: TINY → GIGANTIC_16GB tensor sizes (now includes GB-scale support)
539-
- Multi-threading: 1, 2, 4, 6, 8, 15, 16, 31, 32, 64, 128 threads
536+
- Multi-strategy (use Executor methods to force the strategy): Single-thread/Single-node, Multi-thread/Single-Node, and Multi-thread/Multi-Node (data parallel)
540537
- Hardware-specific Data Parallel: Data parallel tests with all numas available on the machine using max thread counts per numa node
541538
- Mathematical equivalence: Exact comparison with reference
542539
- Add to CMake and verify with `cmake --build build --target test-numa-mathematical-correctness-YOUR_OPERATION`
540+
- Optionally, use `--filter <regex>` to filter on and run specific tests, and `--summary-only` to just get a final test run summary.
543541

544542
## 🏗️ Current Architecture Status
545543

ggml/src/ggml-cpu/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
4343
ggml-cpu/numa-kernels/numa-kernels.h
4444
ggml-cpu/numa-kernels/add.c
4545
ggml-cpu/numa-kernels/add.h
46+
ggml-cpu/numa-kernels/cpy.c
47+
ggml-cpu/numa-kernels/cpy.h
4648
ggml-cpu/numa-kernels/mul.c
4749
ggml-cpu/numa-kernels/mul.h
4850
ggml-cpu/numa-kernels/div.c

ggml/src/ggml-cpu/numa-kernels/cpy.c

Lines changed: 421 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
/**
2+
* @file cpy.h
3+
* @brief NUMA-aware CPY/DUP kernel header with type conversion support
4+
* @author David Sanftenberg
5+
*/
6+
7+
#pragma once
8+
9+
#include "ggml.h"
10+
#include "ggml-numa-shared.h"
11+
12+
#ifdef __cplusplus
13+
extern "C" {
14+
#endif
15+
16+
/**
17+
* @brief Execute CPY operation using NUMA kernels with type conversion support
18+
*
19+
* Handles tensor copying with optional type conversion between:
20+
* - Same types (optimized memcpy path)
21+
* - F32 ↔ F16 conversion
22+
* - F32 ↔ BF16 conversion
23+
* - Quantized → F32 dequantization
24+
*
25+
* @param work_context Tensor context (struct ggml_tensor*)
26+
* @param params Compute parameters with threading info
27+
* @return GGML_STATUS_SUCCESS on completion, GGML_STATUS_FAILED on error
28+
*/
29+
enum ggml_status ggml_numa_kernel_cpy_execute(void * work_context, struct ggml_compute_params * params);
30+
31+
/**
32+
* @brief Query execution strategy for CPY operations
33+
* @param tensor Target tensor
34+
* @return Recommended NUMA execution strategy
35+
*/
36+
ggml_numa_execution_strategy_t ggml_numa_kernel_cpy_query(const struct ggml_tensor * tensor);
37+
38+
/**
39+
* @brief Register CPY kernel with metadata
40+
* @return Kernel registration information
41+
*/
42+
ggml_numa_kernel_registration_info_t ggml_numa_kernel_cpy_register(void);
43+
44+
/**
45+
* @brief Calculate work buffer size for CPY operations (unused - CPY doesn't need work buffers)
46+
* @param tensor Target tensor
47+
* @param total_numa_nodes Number of NUMA nodes
48+
* @param total_threads Total number of threads
49+
* @return Work buffer size (always 0 for CPY)
50+
*/
51+
size_t ggml_numa_kernel_cpy_work_buffer_calc(const struct ggml_tensor * tensor, int total_numa_nodes, int total_threads);
52+
53+
#ifdef __cplusplus
54+
}
55+
#endif

ggml/src/ggml-cpu/numa-kernels/cpy.old.c

Lines changed: 0 additions & 254 deletions
This file was deleted.

0 commit comments

Comments
 (0)