Skip to content

Commit 1a9cb88

Browse files
committed
iterate - improve the numa kernel shared macro system, tweaks to kernels
1 parent f0a2fd0 commit 1a9cb88

File tree

11 files changed

+1035
-1620
lines changed

11 files changed

+1035
-1620
lines changed
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Composable Macro System Update - September 9, 2025
2+
3+
## 🏆 Major Achievement: Composable Macro Architecture
4+
5+
### Overview
6+
Successfully updated `copilot-instructions.md` to reflect the revolutionary **composable macro system** with atomic building blocks that provides Lego-like flexibility for NUMA kernel development.
7+
8+
### Key Updates Made
9+
10+
#### 1. **Architecture Documentation Updates**
11+
- **Replaced "Shared Macro System"** with **"Composable Macro System"** throughout
12+
- **Added atomic building blocks section** with comprehensive documentation
13+
- **Documented hybrid approach** for complex operations like ROPE
14+
- **Updated all code examples** to use new composable patterns
15+
16+
#### 2. **New Composable Macro Categories**
17+
18+
**🧱 Atomic Building Blocks (Direct Use):**
19+
- `NUMA_INIT_CONTEXT` - Context initialization
20+
- `NUMA_VALIDATE_INPUTS` - Input validation
21+
- `NUMA_SLICE_ROWS_ATOMIC` - Thread slicing
22+
- `NUMA_GET_TYPED_POINTER` - Type-safe data access
23+
- `NUMA_BARRIER_AUTO` - Synchronization
24+
- `NUMA_EARLY_EXIT_IF_NO_WORK` - Performance optimization
25+
26+
**🏗️ Composed Templates (Common Patterns):**
27+
- `NUMA_ROWWISE_KERNEL_SETUP` - Complete one-line setup for row-wise operations
28+
- `NUMA_ELEMENTWISE_KERNEL_SETUP` - Element-wise operations
29+
- `NUMA_CUSTOM_KERNEL_SETUP` - Custom requirements
30+
31+
#### 3. **Implementation Approach Documentation**
32+
33+
**Full Composable Approach (80% of cases):**
34+
- Simple operations: ADD, MUL, RMS_NORM
35+
- One-line setup with `NUMA_ROWWISE_KERNEL_SETUP`
36+
- Proven pattern with 100% test success rates
37+
38+
**Hybrid Approach (Complex operations):**
39+
- Operations requiring specialized logic: ROPE, matrix operations
40+
- Basic composable macros for setup/validation
41+
- Custom mathematical logic preserved for correctness
42+
- Example: ROPE kernel with 32/32 tests passed (100% success rate)
43+
44+
#### 4. **Success Story Integration**
45+
- **Added ROPE migration case study** demonstrating hybrid approach success
46+
- **Documented 100% test success rates** for all implemented kernels
47+
- **Proven validation** of composable architecture effectiveness
48+
49+
#### 5. **Updated Implementation Patterns**
50+
- **Modernized all code examples** to use composable macros
51+
- **Updated implementation checklist** with new approaches
52+
- **Enhanced AI agent guidelines** for composable macro usage
53+
- **Updated kernel status** to reflect new architecture
54+
55+
#### 6. **Performance and Maintenance Benefits**
56+
- **Lego-like Composability**: Mix atomic building blocks for any complexity
57+
- **Zero Maintenance**: Changes propagate automatically to all kernels
58+
- **Mathematical Correctness**: Proven with ROPE's complex sequence processing
59+
- **Performance**: Compile-time expansion with zero runtime overhead
60+
- **Consistent Behavior**: All composable components use identical logic
61+
62+
### Impact
63+
64+
#### **For AI Agents/Developers:**
65+
- **Clear guidance** on choosing between full composable vs hybrid approaches
66+
- **Proven patterns** for both simple and complex kernel development
67+
- **Reduced development time** through template-based approach
68+
- **Consistent behavior** across all NUMA kernels
69+
70+
#### **For System Architecture:**
71+
- **Scalable foundation** for adding new kernels with minimal effort
72+
- **Maintainable codebase** with centralized logic in atomic building blocks
73+
- **Proven validation** through successful complex kernel migrations
74+
- **Future-ready architecture** supporting various operation complexities
75+
76+
### Files Updated
77+
- `/workspaces/llama-cpp-dbsanfte-dev/.github/copilot-instructions.md` - Comprehensive update with new composable macro architecture
78+
79+
### Validation
80+
- All existing kernels continue to work with new architecture
81+
- ROPE kernel successfully migrated using hybrid approach (32/32 tests passed)
82+
- RMS_NORM kernel successfully using full composable approach (21/21 tests passed)
83+
- Complete test suite passes with 100% success rate
84+
85+
### Next Steps
86+
The composable macro system is now ready for:
87+
1. **Expanding to remaining operations** (CPY, SOFT_MAX, GLU priority candidates)
88+
2. **Training new AI agents** on the composable architecture patterns
89+
3. **Scaling NUMA kernel development** with proven building blocks
90+
4. **Maintaining mathematical correctness** across all complexity levels
91+
92+
This update establishes the composable macro system as the **standard architecture** for NUMA kernel development, providing both simplicity for common cases and flexibility for complex mathematical operations.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# ROPE Kernel Performance Optimization - September 9, 2025
2+
3+
## Summary
4+
Optimized ROPE (Rotary Position Embedding) kernel parameter extraction by replacing inefficient `memcpy()` operations with direct GGML helper function calls.
5+
6+
## Problem
7+
The ROPE kernel was using multiple `memcpy()` calls in the hot path to extract float parameters from the tensor's `op_params` array:
8+
9+
```c
10+
// OLD: Inefficient memcpy operations
11+
memcpy(&rope_params.freq_base, (int32_t *) dst->op_params + 5, sizeof(float));
12+
memcpy(&rope_params.freq_scale, (int32_t *) dst->op_params + 6, sizeof(float));
13+
memcpy(&rope_params.ext_factor, (int32_t *) dst->op_params + 7, sizeof(float));
14+
memcpy(&rope_params.attn_factor, (int32_t *) dst->op_params + 8, sizeof(float));
15+
memcpy(&rope_params.beta_fast, (int32_t *) dst->op_params + 9, sizeof(float));
16+
memcpy(&rope_params.beta_slow, (int32_t *) dst->op_params + 10, sizeof(float));
17+
memcpy(&rope_params.sections, (int32_t *) dst->op_params + 11, sizeof(int)*4);
18+
```
19+
20+
This approach was:
21+
- Performance-inefficient for single value extraction
22+
- Called in the hot path of every ROPE operation
23+
- Using memcpy for 4-byte values (overkill)
24+
- Not leveraging existing GGML infrastructure
25+
26+
## Solution
27+
Replaced all memcpy operations with existing GGML helper functions that handle type conversion properly:
28+
29+
```c
30+
// NEW: Direct helper function calls
31+
rope_params.freq_base = ggml_get_op_params_f32(dst, 5);
32+
rope_params.freq_scale = ggml_get_op_params_f32(dst, 6);
33+
rope_params.ext_factor = ggml_get_op_params_f32(dst, 7);
34+
rope_params.attn_factor = ggml_get_op_params_f32(dst, 8);
35+
rope_params.beta_fast = ggml_get_op_params_f32(dst, 9);
36+
rope_params.beta_slow = ggml_get_op_params_f32(dst, 10);
37+
38+
// Sections array (4 integers) extracted efficiently
39+
rope_params.sections[0] = ggml_get_op_params_i32(dst, 11);
40+
rope_params.sections[1] = ggml_get_op_params_i32(dst, 12);
41+
rope_params.sections[2] = ggml_get_op_params_i32(dst, 13);
42+
rope_params.sections[3] = ggml_get_op_params_i32(dst, 14);
43+
```
44+
45+
## Benefits
46+
1. **Performance Improvement**: Direct pointer access vs. memory copying overhead
47+
2. **Code Clarity**: More readable and intention-revealing
48+
3. **Type Safety**: Uses GGML's established parameter extraction pattern
49+
4. **Consistency**: Matches usage patterns found elsewhere in the codebase
50+
5. **Maintainability**: Leverages existing, tested GGML infrastructure
51+
52+
## Files Modified
53+
- `ggml/src/ggml-cpu/numa-kernels/rope.c`: Updated both F32 and F16 implementations
54+
55+
## Validation
56+
- ✅ All 32 ROPE mathematical correctness tests pass (100% success rate)
57+
- ✅ Build succeeds with only minor warnings
58+
- ✅ Integration test passes with real model inference
59+
- ✅ No functional changes - purely performance optimization
60+
61+
## Performance Impact
62+
- Reduced function call overhead in ROPE hot path
63+
- Eliminated unnecessary memory copying for single value extraction
64+
- More cache-friendly parameter access pattern
65+
- Integration test shows ROPE operations working correctly: 1152 operations (960 single_multi, 192 data_parallel)
66+
67+
## Context
68+
This optimization was identified during code review and addresses the user's concern about "lots of memcpy is going to really hurt performance." The fix demonstrates that the GGML library already provides proper helper functions for parameter extraction, making the memcpy approach unnecessary.
69+
70+
## Author
71+
David Sanftenberg

0 commit comments

Comments
 (0)