High-Performance Text Embedding Engine for Longbow Vector Database
Fletcher is a pure Go transformer-based embedding engine designed for maximum throughput on commodity hardware. It converts text into dense vector embeddings using state-of-the-art transformer models, with native hardware acceleration for both Apple Silicon (Metal GPU) and x86 CPUs (BLAS).
Fletcher is the vector engine that feeds Longbow, a high-performance distributed vector database. While Longbow handles vector storage, indexing (HNSW), and search, Fletcher focuses exclusively on one thing: converting text to vectors as fast as possible.
- Multi-Model Support: BERT, Nomic-Embed-Text, and custom transformer architectures
- Metal GPU Acceleration: Hand-optimized FP16 kernels for Apple Silicon achieving 24,000+ vec/s
- CGO BLAS Backend: Hardware-accelerated CPU inference via Accelerate (macOS) or OpenBLAS (Linux)
- Modern Transformer Operations: RoPE (Rotary Positional Embeddings), SwiGLU, LayerNorm
- Pure Go Implementation: Zero Python dependencies - pure Go inference pipeline
- Apache Arrow Integration: Native Flight protocol for seamless Longbow communication
- Production-Ready: Built-in admission control, concurrent request batching, OpenTelemetry support
Fletcher significantly outperforms standard PyTorch/SentenceTransformer implementations:
| Metric | Fletcher (Metal) | PyTorch (MPS) | Speedup |
|---|---|---|---|
| Peak Throughput | ~24,200 vec/s | 14,800 vec/s | 1.6x |
| Sustained (500K) | ~21,000 vec/s | 8,200 vec/s | 2.5x |
| Single Latency | 0.48 ms | 4.77 ms | 9.9x |
Benchmark: Apple M3 Pro (12-core), prajjwal1/bert-tiny model, batch size 32.
For detailed benchmarks including memory usage and CPU performance, see Performance Documentation.
Fletcher operates in three modes:
Convert text files or generate embeddings for analysis:
./fletcher --model nomic-embed-text --gpu --text "Hello world"Serve embeddings via HTTP with concurrent request batching:
./fletcher --listen :8080 --model bert-tiny --gpu --max-concurrent 16384High-performance gRPC endpoint using Apache Arrow Flight:
./fletcher --flight :9090 --model nomic-embed-text --gpumacOS: Xcode Command Line Tools
xcode-select --installLinux: OpenBLAS development libraries
# Debian/Ubuntu
sudo apt-get install libopenblas-dev
# RHEL/CentOS
sudo yum install openblas-devel# CGO enabled by default for optimal performance
CGO_ENABLED=1 go build -o bin/fletcher ./cmd/fletcherMulti-architecture builds with Metal, CUDA, and CPU backends:
# CPU-optimized (OpenBLAS)
docker build -f Dockerfile -t fletcher:cpu .
# Metal (Apple Silicon)
docker build -f Dockerfile.metal -t fletcher:metal .
# CUDA (NVIDIA GPUs)
docker build -f Dockerfile.cuda -t fletcher:cuda .# Embed text with GPU acceleration
./fletcher --model bert-tiny --gpu --vocab vocab.txt --weights bert.bin --text "Machine learning is fascinating"
# Generate 1000 Lorem Ipsum test embeddings
./fletcher --vocab vocab.txt --lorem 1000 --gpu
# Send embeddings to Longbow database
./fletcher --vocab vocab.txt --lorem 100 --server localhost:3000 --dataset my_vectors# Start HTTP server
./fletcher --listen :8080 --model nomic-embed-text --gpu --max-vram 4GB
# Start Flight server for Arrow RPC
./fletcher --flight :9090 --vocab vocab.txt --weights nomic.bin# Run sustained load test for 10 minutes
./fletcher --duration 10m --lorem 10000 --gpu| Flag | Default | Description |
|---|---|---|
--model |
bert-tiny |
Model architecture (bert-tiny, nomic-embed-text) |
--gpu |
false |
Enable Metal GPU acceleration (macOS only) |
--vocab |
vocab.txt |
Path to BERT-style WordPiece vocabulary |
--weights |
(required) | Path to model weights binary |
--precision |
fp32 |
Compute precision (fp32, fp16) |
--listen |
(none) | HTTP server address (e.g., :8080) |
--flight |
(none) | Flight server address (e.g., :9090) |
--server |
(none) | Longbow server endpoint |
--dataset |
fletcher_dataset |
Target dataset name in Longbow |
--max-concurrent |
16384 |
Max concurrent embeddings in flight |
--max-vram |
4GB |
VRAM admission control limit |
--transport-fmt |
fp32 |
Transport format (fp32, fp16) |
--otel |
false |
Enable OpenTelemetry tracing |
Fletcher and Longbow communicate via Apache Arrow Flight for zero-copy data transfer:
# Terminal 1: Start Longbow server
longbow serve --port 3000
# Terminal 2: Generate and stream embeddings
./fletcher --vocab vocab.txt --lorem 10000 --server localhost:3000 --dataset documentsFletcher outputs Apache Arrow record batches with schema:
{
"text": string,
"embedding": fixed_size_list<float32>[dim]
}
- Usage Guide - Detailed CLI and server usage
- Model Support - Supported architectures and weights format
- GPU Acceleration - Metal kernel implementation details
- Performance Benchmarks - Comprehensive throughput analysis
- API Reference - HTTP and Flight API specifications
# Unit tests
go test -tags metal ./...
# With race detection
go test -tags metal -race ./...
# Coverage
go test -tags metal -coverprofile=coverage.out ./...# CPU profiling
./fletcher --cpuprofile cpu.pprof --lorem 10000
go tool pprof cpu.pprof
# Memory profiling with pprof server
./fletcher --listen :8080 --gpu
# Visit http://localhost:8080/debug/pproflongbow-fletcher/
├── cmd/fletcher/ # CLI entry point
├── internal/
│ ├── embeddings/ # Embedding engine core
│ ├── device/ # Metal/CPU backend abstraction
│ ├── tokenizer/ # WordPiece tokenizer
│ ├── model/ # Transformer architecture
│ ├── client/ # Arrow Flight client
│ └── server/ # HTTP/Flight servers
├── scripts/ # Benchmark and test scripts
├── helm/ # Kubernetes deployment
└── docs/ # Documentation
MIT License - See LICENSE for details.
If you find this project useful, please consider sponsoring to support continued development.
- Longbow - Distributed vector database
- Longbow-Archer - HNSW index implementation
- Longbow-Quarrel - LLM inference engine with Metal backend