WebAssembly bindings for PowerInfer-style sparse inference engine.
This crate provides WASM bindings for the RuVector sparse inference engine, enabling efficient neural network inference in web browsers and Node.js environments with:
- Sparse Activation: PowerInfer-style neuron prediction for 2-3x speedup
- GGUF Support: Load quantized models in GGUF format
- Streaming Loading: Fetch large models incrementally
- Multiple Backends: Embedding models and LLM text generation
wasm-pack build --target web --releasewasm-pack build --target nodejs --releasewasm-pack build --target bundler --releasenpm install ruvector-sparse-inference-wasmOr build locally:
wasm-pack build --target web
cd pkg && npm linkimport init, { SparseInferenceEngine } from 'ruvector-sparse-inference-wasm';
// Initialize WASM module
await init();
// Load model
const modelBytes = await fetch('/models/llama-2-7b.gguf').then(r => r.arrayBuffer());
const config = {
sparsity: {
enabled: true,
threshold: 0.1 // 10% neuron activation
},
temperature: 0.7,
top_k: 40
};
const engine = new SparseInferenceEngine(
new Uint8Array(modelBytes),
JSON.stringify(config)
);
// Run inference
const input = new Float32Array(4096); // Your input embedding
const output = engine.infer(input);
console.log('Sparsity stats:', engine.sparsity_stats());
console.log('Model metadata:', engine.metadata());For large models (>1GB), use streaming:
const engine = await SparseInferenceEngine.load_streaming(
'https://example.com/large-model.gguf',
JSON.stringify(config)
);For sentence transformers and embedding generation:
import { EmbeddingModel } from 'ruvector-sparse-inference-wasm';
const modelBytes = await fetch('/models/all-MiniLM-L6-v2.gguf').then(r => r.arrayBuffer());
const embedder = new EmbeddingModel(new Uint8Array(modelBytes));
// Encode single sequence (requires tokenization first)
const inputIds = new Uint32Array([101, 2023, 2003, ...]); // Tokenized input
const embedding = embedder.encode(inputIds);
console.log('Embedding dimension:', embedder.dimension());
// Batch encoding
const batchIds = new Uint32Array([...all tokenized sequences...]);
const lengths = new Uint32Array([10, 15, 12]); // Length of each sequence
const embeddings = embedder.encode_batch(batchIds, lengths);For autoregressive language models:
import { LLMModel } from 'ruvector-sparse-inference-wasm';
const modelBytes = await fetch('/models/llama-2-7b-chat.gguf').then(r => r.arrayBuffer());
const config = {
sparsity: { enabled: true, threshold: 0.1 },
temperature: 0.7,
top_k: 40
};
const llm = new LLMModel(new Uint8Array(modelBytes), JSON.stringify(config));
// Generate tokens one at a time
const prompt = new Uint32Array([1, 4321, 1234, ...]); // Tokenized prompt
let generatedTokens = [];
for (let i = 0; i < 100; i++) {
const nextToken = llm.next_token(prompt);
generatedTokens.push(nextToken);
// Append to prompt for next iteration
prompt = new Uint32Array([...prompt, nextToken]);
}
// Or generate multiple tokens at once
const tokens = llm.generate(prompt, 100);
console.log('Generation stats:', llm.stats());
// Reset for new conversation
llm.reset_cache();Improve predictor accuracy with sample data:
// Collect representative samples
const samples = new Float32Array([
...embedding1, // 512 dims
...embedding2, // 512 dims
...embedding3, // 512 dims
]);
engine.calibrate(samples, 512); // 512 = dimension of each sampleAdjust sparsity threshold at runtime:
// More sparse = faster, less accurate
engine.set_sparsity(0.2); // 20% activation
// Less sparse = slower, more accurate
engine.set_sparsity(0.05); // 5% activationimport { measure_inference_time } from 'ruvector-sparse-inference-wasm';
const input = new Float32Array(4096);
const avgTime = measure_inference_time(engine, input, 100); // 100 iterations
console.log(`Average inference time: ${avgTime.toFixed(2)}ms`);interface InferenceConfig {
sparsity: {
enabled: boolean; // Enable sparse inference
threshold: number; // Activation threshold (0.0-1.0)
};
temperature: number; // Sampling temperature (0.0-2.0)
top_k: number; // Top-k sampling (1-100)
top_p?: number; // Nucleus sampling (0.0-1.0)
max_tokens?: number; // Max generation length
}- Chrome/Edge 91+ (WebAssembly SIMD)
- Firefox 89+
- Safari 15+
- Node.js 16+
For older browsers, build without SIMD:
wasm-pack build --target web -- --no-default-features- Enable SIMD: Ensure
wasm32-simdis enabled for 2-4x speedup - Quantization: Use 4-bit or 8-bit quantized GGUF models
- Sparsity: Tune threshold based on accuracy/speed tradeoff
- Calibration: Run calibration with representative data
- Batch Processing: Use batch encoding for multiple inputs
- Worker Threads: Run inference in Web Workers to avoid blocking UI
// worker.js
import init, { SparseInferenceEngine } from 'ruvector-sparse-inference-wasm';
let engine;
self.onmessage = async (e) => {
if (e.data.type === 'init') {
await init();
engine = new SparseInferenceEngine(e.data.modelBytes, e.data.config);
self.postMessage({ type: 'ready' });
} else if (e.data.type === 'infer') {
const output = engine.infer(e.data.input);
self.postMessage({ type: 'result', output });
}
};
// main.js
const worker = new Worker('worker.js', { type: 'module' });
worker.postMessage({
type: 'init',
modelBytes: new Uint8Array(modelBytes),
config: JSON.stringify(config)
});
worker.onmessage = (e) => {
if (e.data.type === 'ready') {
worker.postMessage({
type: 'infer',
input: new Float32Array([...])
});
} else if (e.data.type === 'result') {
console.log('Inference result:', e.data.output);
}
};On Apple M1 Pro (browser):
| Model | Size | Sparsity | Speed | Memory |
|---|---|---|---|---|
| Llama-2-7B | 3.8GB | 10% | 45 tok/s | 1.2GB |
| MiniLM-L6 | 90MB | 15% | 120 emb/s | 180MB |
| Mistral-7B | 4.1GB | 12% | 38 tok/s | 1.4GB |
try {
const engine = new SparseInferenceEngine(modelBytes, config);
const output = engine.infer(input);
} catch (error) {
if (error.message.includes('parse')) {
console.error('Invalid GGUF model format');
} else if (error.message.includes('config')) {
console.error('Invalid configuration');
} else {
console.error('Inference failed:', error);
}
}wasm-pack test --headless --chrome
wasm-pack test --headless --firefoxcargo doc --open --target wasm32-unknown-unknown# Optimize for size
wasm-pack build --target web --release -- -Z build-std=std,panic_abort -Z build-std-features=panic_immediate_abort
# Further compression with wasm-opt
wasm-opt -Oz -o optimized.wasm pkg/ruvector_sparse_inference_wasm_bg.wasmSame as parent RuVector project.
ruvector-sparse-inference- Core Rust implementationruvector-core- Main RuVector libraryrvlite- Lightweight WASM vector database
See main RuVector repository for contribution guidelines.