You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NUMA mirroring implementation with inference performance boost
- Achieved 5% inference speed improvement (14.6 -> 15.3 t/s)
- Clean explicit NUMA setup during model loading
- Ultra-minimal hot path with thread-local NUMA node access
- Working NUMA mirrors for all model weights
- Performance: text generation improved, prompt processing needs optimization
Performance Results (Qwen3-30B-A3B):
- Text Generation: 14.6 -> 15.3 t/s (+5% improvement)
- Prompt Processing: 176 -> 152 t/s (14% regression - needs investigation)
Technical Implementation:
- tensor_data(): O(1) NUMA-aware access via thread-local ggml_current_numa_node
- tensor_set_data_with_numa_mirrors(): Explicit NUMA setup for model weights
- NUMA coordinator: Thread binding and memory locality
- Clean separation: model loading (explicit setup) vs inference (fast access)
0 commit comments