-
Notifications
You must be signed in to change notification settings - Fork 0
Description
We want to be able to use Harald to call on knowledge and/or the other Herald/s by helping with:
Rust code using hnsw_rs for Approximate Nearest Neighbors (ANN)
Crafting JSONL datasets from the Marvel API
Building a repeatable data‑training pipeline so all Heralds share the same Marvel‑characters knowledge
🎯 Objectives
Fetch & Format Data
Write a Rust module to call the Marvel API and emit JSONL entries (one JSON object per line)
Generate Embeddings & Index
Use hnsw_rs to convert each JSONL entry into a vector and build an HNSW index
Orchestrate with Harald
Create CLI commands or scripts in Harald that:
Run the fetch/JSONL step
Build or update the ANN index
Trigger training tasks for the other Herald personas
Document & Teach
Include a plain‑language guide on:
What “ANN” means (Approximate Nearest Neighbors)
How HNSW works under the hood
Common distance metrics (L1, L2, Cosine, Jaccard, etc.)
🔍 Background
What is ANN?
“ANN” stands for Approximate Nearest Neighbors. Instead of exhaustively comparing every data point, it uses probabilistic, sub‑linear algorithms to quickly find the closest vectors in high‑dimensional space—trading a tiny bit of accuracy for big speed gains.
How does hnsw_rs work?
Implements the Hierarchical Navigable Small World graph from Malkov & Yashunin, letting you “zoom in” on your query’s neighborhood across multiple graph layers.
Supports these built‑in distances:
Numeric vectors: L1, L2, Cosine, Jaccard, Hamming
Specialized: Levenshtein on u16, Hellinger, Jeffreys divergence (symmetrized KL), Jensen‑Shannon (metric)
Custom: Implement your own via a Rust trait (any T: Serialize + Clone + Send + Sync)
Has SIMD optimizations for common cases
🛠️ Tasks
Marvel Data Fetcher
Create src/marvel_fetch.rs that:
Reads MARVEL_PUBLIC_KEY & MARVEL_PRIVATE_KEY
Paginates through character endpoints
Writes out characters.jsonl
Embedding & Index Builder
In src/indexer.rs:
Load characters.jsonl
Use your favorite embedding model (local or API)
Insert vectors into hnsw_rs::Hnsw and persist via file_dump() / file_load()
Harald Orchestration Scripts
Add commands to Harald’s CLI:
bash
Copy
Edit
herald run fetch-marvel # generates JSONL
herald run build-index # creates/updates HNSW index
herald run train-heralds # kicks off the Herald training pipeline
Plain‑English Guide
Write a new docs/ANN-and-hnsw_rs.md covering:
ANN fundamentals
Overview of HNSW layers and graph navigation
Breakdown of supported distance metrics
Automated Tests
Add unit tests to verify:
JSONL format validity
Vectors round‑trip through index (query → correct character)
Performance benchmarks on a sample subset
💡 Future Ideas
Marvel API Query + Retrain
Give Harald the ability to periodically call the Marvel API, detect new or updated characters, and refresh the knowledge base automatically.
Pop‑Culture Expansion
Enable Harald to pull data from IMDB or Amazon Music to teach the HeraldStack about movies, shows, and music that align with Bryan’s interests and pop‑culture AI references.
🔗 References
hnsw_rs on docs.rs: https://docs.rs/crate/hnsw_rs/latest/source/Cargo.toml
HNSW ANN paper (arXiv): https://arxiv.org/abs/1603.09320
Pinecone HNSW overview: https://www.pinecone.io/learn/series/faiss/hnsw/
Metadata
Metadata
Assignees
Labels
Projects
Status