- 
                Notifications
    You must be signed in to change notification settings 
- Fork 5
Hexagon NPU FastRPC Backend Overview
The Hexagon NPU FastRPC backend provides hardware acceleration for GGML operations using Qualcomm's Hexagon NPU capabilities. This backend is built entirely from scratch using Qualcomm's FastRPC framework, providing a low-level alternative to QNN SDK-based implementations. By bypassing high-level abstractions and programming directly with HVX intrinsics, it is designed to offload compute-intensive operations from the CPU to Qualcomm's specialized Hexagon NPU hardware, enabling maximum performance and power efficiency for machine learning workloads on Snapdragon platforms.
- Built from Scratch: Complete custom implementation using Qualcomm's FastRPC framework for direct hardware control
- Minimal Abstraction Layer: Lightweight, efficient abstraction objects on the host side for seamless GGML graph/tensor/buffer offloading
- Zero-Copy Communication: FastRPC and ION buffers enable efficient data sharing between CPU and NPU domains
- 
Raw HVX Intrinsics: Critical operations like mul_matandaddimplemented using direct Hexagon Vector Extensions
- Custom Thread Pool: 4-thread parallel execution matching NPU hardware capabilities with intelligent load balancing
- VTCM Management: Thread-specific VTCM operations that fully leverage high-speed VTCM to reduce memory bandwidth pressure
- L2 Cache Optimization: Prefetching and cache-aware memory access patterns for maximum bandwidth utilization
- Hardware-Accelerated Formats: Q4_0, Q8_0, Q4_K quantized data types with custom HVX dequantization functions
- Mixed Precision Operations: Support for quantized (Q4_0, Q8_0, Q4_K) and FP32 mixed operations with efficient conversion tables
- 
Matrix Multiplication (mul_mat): Custom HVX implementation with 4-thread parallelization for transformer workloads
- Element-wise Operations: Add, multiply operations with broadcasting support using direct HVX intrinsics
- RMS Normalization: Hand-optimized kernels for layer normalization operations common in modern LLMs
- Graph-level Execution: Entire computation graphs executed on NPU to minimize CPU-NPU memory transfers and maximize efficiency
- Android: Primary target platform with Snapdragon SoCs featuring Hexagon NPU (8 Gen 1+, 8cx Gen 3+)
- Linux: Development and testing support for Hexagon-enabled platforms
- Windows on ARM: Cross-platform compatibility for Snapdragon-based Windows devices
- Minimal Overhead: FastRPC implementation provides direct hardware access with virtually no abstraction penalties
- Maximum Hardware Utilization: Custom implementation leverages all 4 hardware threads and HVX units on the Hexagon NPU
- Memory Bandwidth Optimization: Graph-level execution and zero-copy transfers reduce bottlenecks between CPU and NPU
- Power Efficiency: Direct NPU execution typically provides 2-5x better performance-per-watt compared to CPU execution
- Scalable Architecture: Designed to efficiently handle both small and large model workloads
- 
Host Device Management: host_device.cpp- NPU device interface and lifecycle management
- 
Host Graph Coordination: graph.cpp- graph creation, caching, and execution coordination
- 
Buffer Management: buffer.cpp- RPC memory buffers, ION allocation, and zero-copy data transfer
- Type Conversion Utilities: Efficient host-device data format conversion and RPC interface helpers
- 
Device Runtime: device.cpp- core NPU-side runtime executing on Hexagon hardware
- 
Graph Execution Engine: graph.cpp- NPU-side graph computation with 4-thread parallelization
- 
HVX Operation Kernels: op_impl.cpp- hand-optimized HVX intrinsic implementations
- 
Quantization Kernels: quants.cpp- HVX-optimized dequantization for Q4_0, Q8_0, Q4_K
- 
Thread Management: thread_pool.hpp- custom 4-thread pool using QURT primitives
- 
Memory Management: vtcm_mem.hpp- VTCM allocation with RAII semantics
- 
Interface Definition: hexagon_npu.idl- defines the RPC contract between host CPU and Hexagon NPU
We conducted some performance testing to evaluate the Hexagon NPU FastRPC backend against CPU-only execution. The benchmarks focus on large matrix multiplication operations—a critical compute bottleneck in transformer-based models.
We extended the test-backend-ops to include large matrix multiplication scenarios that represent typical LLM inference patterns:
diff --git tests/test-backend-ops.cpp tests/test-backend-ops.cpp
index 9ec24d9f23c5bc93b1b1e98e890e1186632358f7..584150154eee761f2d300504c525d38265fe3eb0 100644
--- tests/test-backend-ops.cpp
+++ tests/test-backend-ops.cpp
@@ -4239,6 +4239,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  1, 1024, {3, 2}, {1, 1}));
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  8, 1024, {3, 2}, {1, 1}));
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 1024, {3, 2}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 8192, 1, 8192, {1, 1}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16384, 1, 16384, {1, 1}, {1, 1}));
         }
     }
     for (ggml_type type_a : other_types) {The following table compares execution time (lower is better) across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:
| devices | commit | type | src0_dim | src1_dim | cpu_time(us) | host_total(us) | host_param_update(us) | device_total(us) | device_dequant(us) | device_compute(us) | 
|---|---|---|---|---|---|---|---|---|---|---|
| 8gen2 | 8409dd1e9 | F32 | 16384x16384 | 16384x1 | 38935 | 285529 | 296 | 89518 | 0 | 89518 | 
| 8gen2 | 8409dd1e9 | Q8_0 | 16384x16384 | 16384x1 | 8930 | 327385 | 774 | 255894 | 245178 | 10716 | 
| 8gen2 | 8409dd1e9 | Q4_0 | 16384x16384 | 16384x1 | 12503 | 143390 | 735 | 96932 | 86927 | 10005 | 
The benchmark results reveal several important insights about the Hexagon NPU FastRPC implementation:
- 
Dequantization Bottleneck: - For quantized formats, 90-96% of NPU time is spent on dequantization, making this our primary optimization target
 
- 
Compute Efficiency: - When data resides in VTCM memory, the NPU shows excellent computational performance (~10,000 μs) for matrix multiplication operations
- This represents a ~4× improvement over CPU FP32 performance when comparing pure computation time
- However, the overall NPU performance is currently limited by memory transfers and dequantization overhead
 
- 
Memory Access Patterns: - Pure F32 computation on NPU (~90,000 μs) is slower than CPU (~39,000 μs) due to memory access patterns
- The significant performance difference between F32 and post-dequantization Q4_0/Q8_0 execution suggests optimization potential through better VTCM utilization
 
- 
Communication Overhead: - The host_param_update time (700-800 μs) represents FastRPC communication overhead
- While minimal for large operations, this overhead becomes proportionally significant for smaller tensor operations
- Batching operations into larger computation graphs would help amortize these costs
 
- Extended Operation Coverage: Additional custom HVX implementations for more GGML operations (attention, softmax, etc.)
- Dynamic Thread Scheduling: Runtime load balancing and work-stealing across the 4 hardware threads
- Performance Profiling Suite: Custom profiling tools for FastRPC execution paths and bottleneck analysis
- Advanced Graph Fusion: Sophisticated operation fusion techniques to minimize memory transfers and maximize NPU utilization