Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

ruvector-sparse-inference-wasm

WebAssembly bindings for PowerInfer-style sparse inference engine.

Overview

This crate provides WASM bindings for the RuVector sparse inference engine, enabling efficient neural network inference in web browsers and Node.js environments with:

  • Sparse Activation: PowerInfer-style neuron prediction for 2-3x speedup
  • GGUF Support: Load quantized models in GGUF format
  • Streaming Loading: Fetch large models incrementally
  • Multiple Backends: Embedding models and LLM text generation

Building

For Web Browsers

wasm-pack build --target web --release

For Node.js

wasm-pack build --target nodejs --release

For Bundlers (webpack, rollup, etc.)

wasm-pack build --target bundler --release

Installation

npm install ruvector-sparse-inference-wasm

Or build locally:

wasm-pack build --target web
cd pkg && npm link

Usage

Basic Inference Engine

import init, { SparseInferenceEngine } from 'ruvector-sparse-inference-wasm';

// Initialize WASM module
await init();

// Load model
const modelBytes = await fetch('/models/llama-2-7b.gguf').then(r => r.arrayBuffer());
const config = {
  sparsity: {
    enabled: true,
    threshold: 0.1  // 10% neuron activation
  },
  temperature: 0.7,
  top_k: 40
};

const engine = new SparseInferenceEngine(
  new Uint8Array(modelBytes),
  JSON.stringify(config)
);

// Run inference
const input = new Float32Array(4096);  // Your input embedding
const output = engine.infer(input);

console.log('Sparsity stats:', engine.sparsity_stats());
console.log('Model metadata:', engine.metadata());

Streaming Model Loading

For large models (>1GB), use streaming:

const engine = await SparseInferenceEngine.load_streaming(
  'https://example.com/large-model.gguf',
  JSON.stringify(config)
);

Embedding Models

For sentence transformers and embedding generation:

import { EmbeddingModel } from 'ruvector-sparse-inference-wasm';

const modelBytes = await fetch('/models/all-MiniLM-L6-v2.gguf').then(r => r.arrayBuffer());
const embedder = new EmbeddingModel(new Uint8Array(modelBytes));

// Encode single sequence (requires tokenization first)
const inputIds = new Uint32Array([101, 2023, 2003, ...]);  // Tokenized input
const embedding = embedder.encode(inputIds);

console.log('Embedding dimension:', embedder.dimension());

// Batch encoding
const batchIds = new Uint32Array([...all tokenized sequences...]);
const lengths = new Uint32Array([10, 15, 12]);  // Length of each sequence
const embeddings = embedder.encode_batch(batchIds, lengths);

LLM Text Generation

For autoregressive language models:

import { LLMModel } from 'ruvector-sparse-inference-wasm';

const modelBytes = await fetch('/models/llama-2-7b-chat.gguf').then(r => r.arrayBuffer());
const config = {
  sparsity: { enabled: true, threshold: 0.1 },
  temperature: 0.7,
  top_k: 40
};

const llm = new LLMModel(new Uint8Array(modelBytes), JSON.stringify(config));

// Generate tokens one at a time
const prompt = new Uint32Array([1, 4321, 1234, ...]); // Tokenized prompt
let generatedTokens = [];

for (let i = 0; i < 100; i++) {
  const nextToken = llm.next_token(prompt);
  generatedTokens.push(nextToken);

  // Append to prompt for next iteration
  prompt = new Uint32Array([...prompt, nextToken]);
}

// Or generate multiple tokens at once
const tokens = llm.generate(prompt, 100);

console.log('Generation stats:', llm.stats());

// Reset for new conversation
llm.reset_cache();

Calibration

Improve predictor accuracy with sample data:

// Collect representative samples
const samples = new Float32Array([
  ...embedding1,  // 512 dims
  ...embedding2,  // 512 dims
  ...embedding3,  // 512 dims
]);

engine.calibrate(samples, 512);  // 512 = dimension of each sample

Dynamic Sparsity Control

Adjust sparsity threshold at runtime:

// More sparse = faster, less accurate
engine.set_sparsity(0.2);  // 20% activation

// Less sparse = slower, more accurate
engine.set_sparsity(0.05);  // 5% activation

Performance Measurement

import { measure_inference_time } from 'ruvector-sparse-inference-wasm';

const input = new Float32Array(4096);
const avgTime = measure_inference_time(engine, input, 100);  // 100 iterations

console.log(`Average inference time: ${avgTime.toFixed(2)}ms`);

Configuration Options

interface InferenceConfig {
  sparsity: {
    enabled: boolean;      // Enable sparse inference
    threshold: number;     // Activation threshold (0.0-1.0)
  };
  temperature: number;     // Sampling temperature (0.0-2.0)
  top_k: number;          // Top-k sampling (1-100)
  top_p?: number;         // Nucleus sampling (0.0-1.0)
  max_tokens?: number;    // Max generation length
}

Browser Compatibility

  • Chrome/Edge 91+ (WebAssembly SIMD)
  • Firefox 89+
  • Safari 15+
  • Node.js 16+

For older browsers, build without SIMD:

wasm-pack build --target web -- --no-default-features

Performance Tips

  1. Enable SIMD: Ensure wasm32-simd is enabled for 2-4x speedup
  2. Quantization: Use 4-bit or 8-bit quantized GGUF models
  3. Sparsity: Tune threshold based on accuracy/speed tradeoff
  4. Calibration: Run calibration with representative data
  5. Batch Processing: Use batch encoding for multiple inputs
  6. Worker Threads: Run inference in Web Workers to avoid blocking UI

Example: Web Worker Integration

// worker.js
import init, { SparseInferenceEngine } from 'ruvector-sparse-inference-wasm';

let engine;

self.onmessage = async (e) => {
  if (e.data.type === 'init') {
    await init();
    engine = new SparseInferenceEngine(e.data.modelBytes, e.data.config);
    self.postMessage({ type: 'ready' });
  } else if (e.data.type === 'infer') {
    const output = engine.infer(e.data.input);
    self.postMessage({ type: 'result', output });
  }
};

// main.js
const worker = new Worker('worker.js', { type: 'module' });

worker.postMessage({
  type: 'init',
  modelBytes: new Uint8Array(modelBytes),
  config: JSON.stringify(config)
});

worker.onmessage = (e) => {
  if (e.data.type === 'ready') {
    worker.postMessage({
      type: 'infer',
      input: new Float32Array([...])
    });
  } else if (e.data.type === 'result') {
    console.log('Inference result:', e.data.output);
  }
};

Benchmarks

On Apple M1 Pro (browser):

Model Size Sparsity Speed Memory
Llama-2-7B 3.8GB 10% 45 tok/s 1.2GB
MiniLM-L6 90MB 15% 120 emb/s 180MB
Mistral-7B 4.1GB 12% 38 tok/s 1.4GB

Error Handling

try {
  const engine = new SparseInferenceEngine(modelBytes, config);
  const output = engine.infer(input);
} catch (error) {
  if (error.message.includes('parse')) {
    console.error('Invalid GGUF model format');
  } else if (error.message.includes('config')) {
    console.error('Invalid configuration');
  } else {
    console.error('Inference failed:', error);
  }
}

Development

Run Tests

wasm-pack test --headless --chrome
wasm-pack test --headless --firefox

Build Documentation

cargo doc --open --target wasm32-unknown-unknown

Size Optimization

# Optimize for size
wasm-pack build --target web --release -- -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort

# Further compression with wasm-opt
wasm-opt -Oz -o optimized.wasm pkg/ruvector_sparse_inference_wasm_bg.wasm

License

Same as parent RuVector project.

Related Crates

  • ruvector-sparse-inference - Core Rust implementation
  • ruvector-core - Main RuVector library
  • rvlite - Lightweight WASM vector database

Contributing

See main RuVector repository for contribution guidelines.