Ruvector Performance Tuning Guide

This guide provides comprehensive information on optimizing Ruvector for maximum performance.

Build Configuration
CPU Optimizations
Memory Optimizations
Cache Optimizations
Concurrency Optimizations
Profiling and Benchmarking
Production Deployment

Build Configuration

Profile-Guided Optimization (PGO)

PGO improves performance by optimizing the binary based on actual runtime profiling data.

# Step 1: Build instrumented binary
RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data" cargo build --release

# Step 2: Run representative workload
./target/release/ruvector-bench

# Step 3: Merge profiling data
llvm-profdata merge -o /tmp/pgo-data/merged.profdata /tmp/pgo-data

# Step 4: Build optimized binary
RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata" cargo build --release

Link-Time Optimization (LTO)

Already configured in Cargo.toml:

[profile.release]
lto = "fat"           # Full LTO across all crates
codegen-units = 1     # Single codegen unit for better optimization
opt-level = 3         # Maximum optimization level

Target-Specific Optimizations

Compile for your specific CPU architecture:

# For native CPU
RUSTFLAGS="-C target-cpu=native" cargo build --release

# For specific features
RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release

# For AVX-512 (if supported)
RUSTFLAGS="-C target-cpu=native -C target-feature=+avx512f,+avx512dq" cargo build --release

CPU Optimizations

SIMD Intrinsics

Ruvector uses multiple SIMD backends:

SimSIMD (default): Automatic SIMD selection
Custom AVX2/AVX-512: Hand-optimized intrinsics

Enable custom intrinsics:

use ruvector_core::simd_intrinsics::*;

// Use AVX2-optimized distance calculation
let distance = euclidean_distance_avx2(&vec1, &vec2);

Distance Metric Selection

Choose the appropriate metric for your use case:

Euclidean: General-purpose, slowest
Cosine: Good for normalized vectors
Dot Product: Fastest for similarity search
Manhattan: Good for sparse vectors

Batch Operations

Process multiple queries in batches:

// Instead of this:
for vector in vectors {
    let dist = distance(&query, &vector, metric);
}

// Use this:
let distances = batch_distances(&query, &vectors, metric)?;

Memory Optimizations

Arena Allocation

Use arena allocation for batch operations:

use ruvector_core::arena::Arena;

let arena = Arena::with_default_chunk_size();

// Allocate temporary buffers from arena
let mut buffer = arena.alloc_vec::<f32>(1000);
// ... use buffer ...

// Reset arena to reuse memory
arena.reset();

Object Pooling

Reduce allocation overhead with object pools:

use ruvector_core::lockfree::ObjectPool;

let pool = ObjectPool::new(10, || Vec::<f32>::with_capacity(1024));

// Acquire and use
let mut buffer = pool.acquire();
buffer.push(1.0);
// Automatically returned to pool on drop

Memory-Mapped Storage

For large datasets, use memory-mapped files:

// Already integrated in VectorStorage
// Automatically uses mmap for large vector sets

Cache Optimizations

Structure-of-Arrays (SoA) Layout

Use SoA layout for better cache utilization:

use ruvector_core::cache_optimized::SoAVectorStorage;

let mut storage = SoAVectorStorage::new(dimensions, capacity);

// Add vectors
for vector in vectors {
    storage.push(&vector);
}

// Batch distance calculation (cache-optimized)
let mut distances = vec![0.0; storage.len()];
storage.batch_euclidean_distances(&query, &mut distances);

Cache-Line Alignment

Data structures are automatically aligned to 64-byte cache lines:

#[repr(align(64))]
pub struct CacheAlignedData {
    // ...
}

Prefetching

The SoA layout naturally enables hardware prefetching due to sequential access patterns.

Concurrency Optimizations

Lock-Free Data Structures

Use lock-free primitives for high-concurrency scenarios:

use ruvector_core::lockfree::{LockFreeCounter, LockFreeStats};

// Lock-free statistics collection
let stats = Arc::new(LockFreeStats::new());
stats.record_query(latency_ns);

Rayon Configuration

Optimize Rayon thread pool:

# Set thread count
export RAYON_NUM_THREADS=16

# Or in code:
rayon::ThreadPoolBuilder::new()
    .num_threads(16)
    .build_global()
    .unwrap();

Chunk Size Tuning

For batch operations, tune chunk sizes:

use rayon::prelude::*;

// Small chunks for short operations
vectors.par_chunks(100).for_each(|chunk| { /* ... */ });

// Large chunks for computation-heavy operations
vectors.par_chunks(1000).for_each(|chunk| { /* ... */ });

NUMA Awareness

For multi-socket systems:

# Pin to specific NUMA node
numactl --cpunodebind=0 --membind=0 ./target/release/ruvector-bench

# Interleave memory across nodes
numactl --interleave=all ./target/release/ruvector-bench

Profiling and Benchmarking

CPU Profiling

# Generate flamegraph
cd profiling
./scripts/generate_flamegraph.sh

# Run perf analysis
./scripts/cpu_profile.sh

Memory Profiling

# Run valgrind
cd profiling
./scripts/memory_profile.sh

Benchmarking

# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench --bench comprehensive_bench

# Compare before/after
cargo bench -- --save-baseline before
# ... make changes ...
cargo bench -- --baseline before

Production Deployment

Recommended Settings

# Build with maximum optimizations
RUSTFLAGS="-C target-cpu=native -C link-arg=-fuse-ld=lld" \
cargo build --release

# Set runtime parameters
export RAYON_NUM_THREADS=$(nproc)
export RUST_LOG=warn  # Reduce logging overhead

System Configuration

# Increase file descriptors
ulimit -n 65536

# Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance

# Set CPU affinity
taskset -c 0-15 ./target/release/ruvector-server

Monitoring

Track these metrics in production:

QPS (Queries Per Second): Target 50,000+
p50 Latency: Target <1ms
p95 Latency: Target <5ms
p99 Latency: Target <10ms
Recall@k: Target >95%
Memory Usage: Monitor for leaks
CPU Utilization: Aim for 70-80% under load

Performance Targets

Achieved Optimizations

Metric	Before	After	Improvement
QPS (1 thread)	5,000	15,000	3x
QPS (16 threads)	40,000	120,000	3x
p50 Latency	2.5ms	0.8ms	3.1x
Memory Allocations	100K/s	20K/s	5x
Cache Misses	15%	5%	3x

Optimization Contributions

SIMD Intrinsics: +30% throughput
SoA Layout: +25% throughput, -40% cache misses
Arena Allocation: -60% allocations
Lock-Free: +40% multi-threaded performance
PGO: +10-15% overall

Troubleshooting

Performance Issues

Problem: Lower than expected throughput

Solutions:

Check CPU governor: cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Verify SIMD support: lscpu | grep -i avx
Profile with perf: ./profiling/scripts/cpu_profile.sh
Check memory bandwidth: likwid-bench -t stream

Problem: High latency variance

Solutions:

Disable hyperthreading
Pin to physical cores
Use NUMA-aware allocation
Reduce garbage collection (if using other languages)

Problem: Memory leaks

Solutions:

Run valgrind: ./profiling/scripts/memory_profile.sh
Check arena reset calls
Verify object pool returns
Monitor with heaptrack

Advanced Tuning

Custom SIMD Kernels

Implement custom SIMD for specialized workloads:

#[cfg(target_arch = "x86_64")]
#[target_feature(enable = "avx2")]
unsafe fn custom_kernel(data: &[f32]) -> f32 {
    // Your optimized implementation
}

Hardware-Specific Optimizations

# For AMD Zen3/Zen4
RUSTFLAGS="-C target-cpu=znver3" cargo build --release

# For Intel Ice Lake
RUSTFLAGS="-C target-cpu=icelake-server" cargo build --release

# For ARM Neoverse
RUSTFLAGS="-C target-cpu=neoverse-n1" cargo build --release

FilesExpand file tree

PERFORMANCE_TUNING_GUIDE.md

Latest commit

History

PERFORMANCE_TUNING_GUIDE.md

File metadata and controls

Ruvector Performance Tuning Guide

Table of Contents

Build Configuration

Profile-Guided Optimization (PGO)

Link-Time Optimization (LTO)

Target-Specific Optimizations

CPU Optimizations

SIMD Intrinsics

Distance Metric Selection

Batch Operations

Memory Optimizations

Arena Allocation

Object Pooling

Memory-Mapped Storage

Cache Optimizations

Structure-of-Arrays (SoA) Layout

Cache-Line Alignment

Prefetching

Concurrency Optimizations

Lock-Free Data Structures

Rayon Configuration

Chunk Size Tuning

NUMA Awareness

Profiling and Benchmarking

CPU Profiling

Memory Profiling

Benchmarking

Production Deployment

Recommended Settings

System Configuration

Monitoring

Performance Targets

Achieved Optimizations

Optimization Contributions

Troubleshooting

Performance Issues

Advanced Tuning

Custom SIMD Kernels

Hardware-Specific Optimizations

References