Skip to content

Hexagon NPU FastRPC Backend Overview

nullname edited this page May 28, 2025 · 4 revisions

Overview

The Hexagon NPU FastRPC backend provides hardware acceleration for GGML operations using Qualcomm's Hexagon NPU capabilities. This backend is built entirely from scratch using Qualcomm's FastRPC framework, providing a low-level alternative to QNN SDK-based implementations. By bypassing high-level abstractions and programming directly with HVX intrinsics, it is designed to offload compute-intensive operations from the CPU to Qualcomm's specialized Hexagon NPU hardware, enabling maximum performance and power efficiency for machine learning workloads on Snapdragon platforms.

Key Features

Custom FastRPC Implementation

  • Built from Scratch: Complete custom implementation using Qualcomm's FastRPC framework for direct hardware control
  • Minimal Abstraction Layer: Lightweight, efficient abstraction objects on the host side for seamless GGML graph/tensor/buffer offloading
  • Zero-Copy Communication: FastRPC and ION buffers enable efficient data sharing between CPU and NPU domains

Direct Hardware Programming

  • Raw HVX Intrinsics: Critical operations like mul_mat and add implemented using direct Hexagon Vector Extensions
  • Custom Thread Pool: 4-thread parallel execution matching NPU hardware capabilities with intelligent load balancing
  • VTCM Management: Thread-specific VTCM operations that fully leverage high-speed VTCM to reduce memory bandwidth pressure
  • L2 Cache Optimization: Prefetching and cache-aware memory access patterns for maximum bandwidth utilization

Advanced Quantization Support

  • Hardware-Accelerated Formats: Q4_0, Q8_0, Q4_K quantized data types with custom HVX dequantization functions
  • Mixed Precision Operations: Support for quantized (Q4_0, Q8_0, Q4_K) and FP32 mixed operations with efficient conversion tables

Supported Operations

FastRPC Custom Kernels

  • Matrix Multiplication (mul_mat): Custom HVX implementation with 4-thread parallelization for transformer workloads
  • Element-wise Operations: Add, multiply operations with broadcasting support using direct HVX intrinsics
  • RMS Normalization: Hand-optimized kernels for layer normalization operations common in modern LLMs
  • Graph-level Execution: Entire computation graphs executed on NPU to minimize CPU-NPU memory transfers and maximize efficiency

Platform Support

  • Android: Primary target platform with Snapdragon SoCs featuring Hexagon NPU (8 Gen 1+, 8cx Gen 3+)
  • Linux: Development and testing support for Hexagon-enabled platforms
  • Windows on ARM: Cross-platform compatibility for Snapdragon-based Windows devices

Key Advantages

  • Minimal Overhead: FastRPC implementation provides direct hardware access with virtually no abstraction penalties
  • Maximum Hardware Utilization: Custom implementation leverages all 4 hardware threads and HVX units on the Hexagon NPU
  • Memory Bandwidth Optimization: Graph-level execution and zero-copy transfers reduce bottlenecks between CPU and NPU
  • Power Efficiency: Direct NPU execution typically provides 2-5x better performance-per-watt compared to CPU execution
  • Scalable Architecture: Designed to efficiently handle both small and large model workloads

Technical Implementation

Based on FastRPC Framework

Host Side (ggml/src/ggml-qnn/npu/host/)

  • Host Device Management: host_device.cpp - NPU device interface and lifecycle management
  • Host Graph Coordination: graph.cpp - graph creation, caching, and execution coordination
  • Buffer Management: buffer.cpp - RPC memory buffers, ION allocation, and zero-copy data transfer
  • Type Conversion Utilities: Efficient host-device data format conversion and RPC interface helpers

Device Side (ggml/src/ggml-qnn/npu/device/)

  • Device Runtime: device.cpp - core NPU-side runtime executing on Hexagon hardware
  • Graph Execution Engine: graph.cpp - NPU-side graph computation with 4-thread parallelization
  • HVX Operation Kernels: op_impl.cpp - hand-optimized HVX intrinsic implementations
  • Quantization Kernels: quants.cpp - HVX-optimized dequantization for Q4_0, Q8_0, Q4_K
  • Thread Management: thread_pool.hpp - custom 4-thread pool using QURT primitives
  • Memory Management: vtcm_mem.hpp - VTCM allocation with RAII semantics

FastRPC Interface (ggml/src/ggml-qnn/npu/idl/)

  • Interface Definition: hexagon_npu.idl - defines the RPC contract between host CPU and Hexagon NPU

Benchmark Results

We conducted some performance testing to evaluate the Hexagon NPU FastRPC backend against CPU-only execution. The benchmarks focus on large matrix multiplication operations—a critical compute bottleneck in transformer-based models.

Testing Methodology

We extended the test-backend-ops to include large matrix multiplication scenarios that represent typical LLM inference patterns:

diff --git tests/test-backend-ops.cpp tests/test-backend-ops.cpp
index 9ec24d9f23c5bc93b1b1e98e890e1186632358f7..584150154eee761f2d300504c525d38265fe3eb0 100644
--- tests/test-backend-ops.cpp
+++ tests/test-backend-ops.cpp
@@ -4239,6 +4239,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  1, 1024, {3, 2}, {1, 1}));
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  8, 1024, {3, 2}, {1, 1}));
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 1024, {3, 2}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 8192, 1, 8192, {1, 1}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16384, 1, 16384, {1, 1}, {1, 1}));
         }
     }
     for (ggml_type type_a : other_types) {

Performance Results

The following table compares execution time (lower is better) across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:

devices commit type src0_dim src1_dim cpu_time(us) host_total(us) host_param_update(us) device_total(us) device_dequant(us) device_compute(us)
8gen2 8409dd1e9 F32 16384x16384 16384x1 38935 285529 296 89518 0 89518
8gen2 8409dd1e9 Q8_0 16384x16384 16384x1 8930 327385 774 255894 245178 10716
8gen2 8409dd1e9 Q4_0 16384x16384 16384x1 12503 143390 735 96932 86927 10005

Key Observations

The benchmark results reveal several important insights about the Hexagon NPU FastRPC implementation:

  • Dequantization Bottleneck:

    • For quantized formats, 90-96% of NPU time is spent on dequantization, making this our primary optimization target
  • Compute Efficiency:

    • When data resides in VTCM memory, the NPU shows excellent computational performance (~10,000 μs) for matrix multiplication operations
    • This represents a ~4× improvement over CPU FP32 performance when comparing pure computation time
    • However, the overall NPU performance is currently limited by memory transfers and dequantization overhead
  • Memory Access Patterns:

    • Pure F32 computation on NPU (~90,000 μs) is slower than CPU (~39,000 μs) due to memory access patterns
    • The significant performance difference between F32 and post-dequantization Q4_0/Q8_0 execution suggests optimization potential through better VTCM utilization
  • Communication Overhead:

    • The host_param_update time (700-800 μs) represents FastRPC communication overhead
    • While minimal for large operations, this overhead becomes proportionally significant for smaller tensor operations
    • Batching operations into larger computation graphs would help amortize these costs

Future Developments

  • Extended Operation Coverage: Additional custom HVX implementations for more GGML operations (attention, softmax, etc.)
  • Dynamic Thread Scheduling: Runtime load balancing and work-stealing across the 4 hardware threads
  • Performance Profiling Suite: Custom profiling tools for FastRPC execution paths and bottleneck analysis
  • Advanced Graph Fusion: Sophisticated operation fusion techniques to minimize memory transfers and maximize NPU utilization