Skip to content

Latest commit

 

History

History
406 lines (319 loc) · 12.8 KB

File metadata and controls

406 lines (319 loc) · 12.8 KB

Critical Gaps in RuVector-PGlite Implementation Plan

🚨 Major Architectural Flaws Discovered

After researching actual PGlite extension development, the original implementation plan has critical flaws that must be addressed.


❌ What Was WRONG in the Original Plan

1. pgrx Does NOT Support WASM Compilation

Original Assumption: Use pgrx with wasm32-unknown-unknown target

# ❌ THIS DOESN'T WORK
[lib]
crate-type = ["cdylib"]

[target.wasm32-unknown-unknown]
# pgrx is not designed for WASM target

Reality:

  • pgrx is designed to build native PostgreSQL extensions (.so, .dylib, .dll)
  • pgrx is used to build extensions that run WebAssembly (via Extism), not extensions compiled to WebAssembly
  • No evidence of pgrx supporting wasm32 as a compilation target

Sources:

2. Wrong Build Toolchain

Original Plan: cargo pgrx package --target wasm32

Reality: PGlite extensions require:

# ✅ CORRECT: Emscripten toolchain
emcc -o extension.wasm extension.c \
  -I$POSTGRES_INCLUDE \
  -s MAIN_MODULE=1 \
  -s ASYNCIFY

Required Tools:

  • ✅ Emscripten SDK (emsdk)
  • ✅ PostgreSQL headers for WASM
  • ✅ Tar packaging for .tar.gz bundles
  • ❌ NOT cargo pgrx

3. Misunderstood Extension Structure

Original Plan: Build a standalone .wasm file

Reality: PGlite extensions are .tar.gz tarballs containing:

vector.tar.gz
├── extension/
│   ├── vector.so.wasm    # WASM compiled extension
│   ├── vector.control    # Extension metadata
│   ├── vector--*.sql     # SQL install scripts
│   └── data/             # Any data files

Actual pgvector Implementation:

// packages/pglite/src/vector/index.ts
const setup = async (_pg: PGliteInterface, emscriptenOpts: any) => {
  return {
    emscriptenOpts,
    bundlePath: new URL('../../release/vector.tar.gz', import.meta.url),
  }
}

export const vector = { name: 'pgvector', setup }

Source: PGlite vector extension source

4. Missing Build Process Details

What Was Missing:

  • How to clone PGlite with submodules
  • How to add extension to postgres-pglite/pglite/Makefile
  • How to build within PGlite's build system
  • Emscripten compilation flags (MAIN_MODULE, ASYNCIFY)
  • Tarball packaging steps

Actual Process (source):

# 1. Clone PGlite
git clone --recurse-submodules git@github.com:electric-sql/pglite.git
cd pglite && pnpm i

# 2. Add extension as submodule
cd postgres-pglite/pglite
git submodule add <extension_url>

# 3. Register in Makefile
echo "SUBDIRS += ruvector" >> Makefile

# 4. Build (creates .tar.gz)
pnpm build:all
# Output: packages/pglite/release/ruvector.tar.gz

5. No Rust-Specific Guidance

Gap: How to write a Rust extension that compiles with Emscripten?

Missing Details:

  • Rust → C FFI interface layer
  • #[no_mangle] exports for PostgreSQL API
  • Memory management (Emscripten vs Rust allocator)
  • Build script for emcc + rustc

Possible Approaches:

Option A: Pure C Extension

// ruvector_pglite.c
#include "postgres.h"
#include "fmgr.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(vector_cosine_distance);
Datum vector_cosine_distance(PG_FUNCTION_ARGS) {
    // Call Rust library via FFI
    float32 result = rust_cosine_distance(...);
    PG_RETURN_FLOAT4(result);
}

Then compile:

emcc -o ruvector.wasm ruvector_pglite.c libruvector_core.a \
  -I$PG_INCLUDE -s MAIN_MODULE=1

Option B: Rust with C Wrapper

// ruvector_core/src/ffi.rs
#[no_mangle]
pub extern "C" fn rust_cosine_distance(
    a: *const f32,
    b: *const f32,
    len: usize
) -> f32 {
    // Safe Rust implementation
}

Then build:

# Build Rust to WASM staticlib
cargo build --target wasm32-unknown-emscripten --release

# Link with C wrapper
emcc -o ruvector.wasm wrapper.c libruvector_core.a \
  -I$PG_INCLUDE -s MAIN_MODULE=1

6. Size Targets May Be Unrealistic

Original Target: 500KB-1MB WASM

Reality Check:

  • pgvector (minimal, C-based): ~200KB compiled to WASM
  • Full ruvector features (even stripped): likely 2-5MB
  • Rust std library adds ~100-300KB
  • PostgreSQL runtime overhead: varies

Revised Targets:

  • Minimal (types + distances): ~500KB-1MB ✅
  • With HNSW index: ~1-2MB
  • With quantization: ~2-3MB
  • Full features: 5-10MB (defeats purpose)

7. No TypeScript Plugin API Consideration

Missing Alternative: PGlite's custom plugin API

Instead of a PostgreSQL extension, could build a TypeScript plugin that provides vector operations via PGlite's namespace API:

// Hybrid approach: TypeScript + WASM compute kernel
import { Extension } from '@electric-sql/pglite'
import init, { cosineDistance } from './ruvector_core.wasm'

const setup = async (pg: PGliteInterface) => {
  await init() // Initialize WASM

  return {
    namespaceObj: {
      vector: {
        cosineDistance: (a: Float32Array, b: Float32Array) =>
          cosineDistance(a, b), // WASM function
        // Other vector operations...
      }
    }
  }
}

export const ruvector = { name: 'ruvector', setup }

Usage:

const db = await PGlite.create({ extensions: { ruvector } })

// Use via JavaScript API (not SQL)
const dist = db.ruvector.vector.cosineDistance(vec1, vec2)

Pros:

  • ✅ No Emscripten/PostgreSQL build complexity
  • ✅ Direct WASM (no PostgreSQL FFI overhead)
  • ✅ Easier to build and maintain
  • ✅ Can still use Rust → wasm-bindgen

Cons:

  • ❌ Not SQL-compatible (no SELECT ... ORDER BY embedding <=> $1)
  • ❌ Can't use PostgreSQL indexes
  • ❌ Not a drop-in pgvector replacement

✅ What's ACTUALLY Needed

Corrected Architecture Options

Option 1: Full PostgreSQL Extension (Complex but SQL-compatible)

┌─────────────────────────────────────────┐
│  ruvector-core (Rust library)          │
│  - Vector types, distances, HNSW       │
│  - Compiles to: libruvector_core.a      │
│  - Target: wasm32-unknown-emscripten    │
└─────────────────────────────────────────┘
              ▲
              │ C FFI
              │
┌─────────────┴───────────────────────────┐
│  ruvector_pglite_wrapper.c              │
│  - PostgreSQL extension entry points    │
│  - PG_FUNCTION_INFO_V1 macros          │
│  - Calls Rust via FFI                  │
└─────────────────────────────────────────┘
              │ Emscripten
              ▼
┌─────────────────────────────────────────┐
│  ruvector.tar.gz                        │
│  ├── ruvector.so.wasm                   │
│  ├── ruvector.control                   │
│  └── ruvector--0.1.0.sql                │
└─────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│  @ruvector/pglite (TypeScript)          │
│  - Extension loader                     │
│  - Minimal wrapper (like pgvector)      │
└─────────────────────────────────────────┘

Build Process:

  1. Fork PGlite repo
  2. Add ruvector as submodule in postgres-pglite/pglite/
  3. Create Makefile with Emscripten rules
  4. Build Rust core to WASM staticlib
  5. Link with C wrapper
  6. Package to .tar.gz
  7. Create TypeScript loader

Pros: ✅ Full SQL compatibility, ✅ PostgreSQL indexes Cons: ❌ Complex build, ❌ Large size, ❌ Tight coupling to PGlite

Option 2: Hybrid TypeScript Plugin (Simpler, WASM-native)

┌─────────────────────────────────────────┐
│  ruvector-core (Rust library)          │
│  - Vector operations only               │
│  - No PostgreSQL dependencies           │
│  - wasm-bindgen for JS interop          │
│  - Target: wasm32-unknown-unknown       │
└─────────────────────────────────────────┘
              │ wasm-pack
              ▼
┌─────────────────────────────────────────┐
│  ruvector_core_bg.wasm + .js glue       │
└─────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────┐
│  @ruvector/pglite (TypeScript plugin)   │
│  - PGlite Extension interface           │
│  - Namespace API for vector ops         │
│  - SQL function wrappers (via exec)     │
└─────────────────────────────────────────┘

Build Process:

# 1. Build Rust to WASM
cd ruvector-core
wasm-pack build --target web

# 2. Create TypeScript wrapper
cd ../npm/packages/pglite
pnpm build

# 3. Publish
pnpm publish

Pros: ✅ Simple build, ✅ Small size, ✅ Easy maintenance Cons: ❌ Limited SQL integration, ❌ No native indexes

Option 3: Minimal SQL Extension (Best Balance)

Start with Option 1 but with minimal features:

  • ✅ Core vector type
  • ✅ Distance operators (<->, <=>, <#>)
  • ❌ Skip HNSW (use flat scan)
  • ❌ Skip quantization
  • ❌ Skip advanced features

Target Size: ~200-500KB (comparable to pgvector)


📋 Revised Implementation Checklist

Prerequisites

  • Clone PGlite repo with submodules
  • Install Emscripten SDK (emsdk)
  • Study pgvector's PGlite implementation
  • Understand PGlite's build system

Development

  • Create ruvector-core (no PostgreSQL deps)
  • Add C FFI layer (ffi.rs with #[no_mangle])
  • Write C wrapper (ruvector_wrapper.c)
  • Create extension control file (ruvector.control)
  • Write SQL install script (ruvector--0.1.0.sql)

Building

  • Add as submodule to PGlite
  • Configure Emscripten Makefile
  • Build Rust to WASM staticlib
  • Link with C wrapper using emcc
  • Package to .tar.gz

Testing

  • Write Vitest tests
  • Test in browser environment
  • Benchmark against pgvector
  • Validate SQL compatibility

Publishing

  • Create TypeScript loader
  • Add to PGlite extensions catalog
  • Publish to npm
  • Write documentation

🎯 Recommended Path Forward

Start with Option 2 (TypeScript Plugin) for these reasons:

  1. Immediate Value: Can ship in 1-2 weeks vs 6+ weeks
  2. Learning Path: Understand PGlite before committing to Option 1
  3. Proof of Concept: Validate demand for ruvector-pglite
  4. Simpler: No Emscripten complexity
  5. Upgradeable: Can migrate to Option 1 later if SQL is critical

Then, if SQL compatibility is required, upgrade to Option 1 (Full Extension).


📚 Additional Research Needed

  1. Emscripten + Rust: Best practices for compiling Rust to WASM with emcc
  2. PGlite Build System: Deep dive into their Makefile and build scripts
  3. PostgreSQL C API: Required functions for minimal extension
  4. Memory Management: Emscripten's memory model vs Rust
  5. Size Optimization: Dead code elimination, LTO for WASM

Sources


Next Step: Choose an implementation option and update the plan accordingly.