Skip to content

HeraldStack Hierarchical Navigable Small World graphs Expertise #9

@BryanChasko

Description

@BryanChasko

We want to be able to use Harald to call on knowledge and/or the other Herald/s by helping with:

Rust code using hnsw_rs for Approximate Nearest Neighbors (ANN)

Crafting JSONL datasets from the Marvel API

Building a repeatable data‑training pipeline so all Heralds share the same Marvel‑characters knowledge

🎯 Objectives
Fetch & Format Data

Write a Rust module to call the Marvel API and emit JSONL entries (one JSON object per line)

Generate Embeddings & Index

Use hnsw_rs to convert each JSONL entry into a vector and build an HNSW index

Orchestrate with Harald

Create CLI commands or scripts in Harald that:

Run the fetch/JSONL step

Build or update the ANN index

Trigger training tasks for the other Herald personas

Document & Teach

Include a plain‑language guide on:

What “ANN” means (Approximate Nearest Neighbors)

How HNSW works under the hood

Common distance metrics (L1, L2, Cosine, Jaccard, etc.)

🔍 Background
What is ANN?
“ANN” stands for Approximate Nearest Neighbors. Instead of exhaustively comparing every data point, it uses probabilistic, sub‑linear algorithms to quickly find the closest vectors in high‑dimensional space—trading a tiny bit of accuracy for big speed gains.

How does hnsw_rs work?
Implements the Hierarchical Navigable Small World graph from Malkov & Yashunin, letting you “zoom in” on your query’s neighborhood across multiple graph layers.

Supports these built‑in distances:

Numeric vectors: L1, L2, Cosine, Jaccard, Hamming

Specialized: Levenshtein on u16, Hellinger, Jeffreys divergence (symmetrized KL), Jensen‑Shannon (metric)

Custom: Implement your own via a Rust trait (any T: Serialize + Clone + Send + Sync)

Has SIMD optimizations for common cases

🛠️ Tasks
Marvel Data Fetcher

Create src/marvel_fetch.rs that:

Reads MARVEL_PUBLIC_KEY & MARVEL_PRIVATE_KEY

Paginates through character endpoints

Writes out characters.jsonl

Embedding & Index Builder

In src/indexer.rs:

Load characters.jsonl

Use your favorite embedding model (local or API)

Insert vectors into hnsw_rs::Hnsw and persist via file_dump() / file_load()

Harald Orchestration Scripts

Add commands to Harald’s CLI:

bash
Copy
Edit
herald run fetch-marvel # generates JSONL
herald run build-index # creates/updates HNSW index
herald run train-heralds # kicks off the Herald training pipeline
Plain‑English Guide

Write a new docs/ANN-and-hnsw_rs.md covering:

ANN fundamentals

Overview of HNSW layers and graph navigation

Breakdown of supported distance metrics

Automated Tests

Add unit tests to verify:

JSONL format validity

Vectors round‑trip through index (query → correct character)

Performance benchmarks on a sample subset

💡 Future Ideas
Marvel API Query + Retrain
Give Harald the ability to periodically call the Marvel API, detect new or updated characters, and refresh the knowledge base automatically.

Pop‑Culture Expansion
Enable Harald to pull data from IMDB or Amazon Music to teach the HeraldStack about movies, shows, and music that align with Bryan’s interests and pop‑culture AI references.

🔗 References
hnsw_rs on docs.rs: https://docs.rs/crate/hnsw_rs/latest/source/Cargo.toml

HNSW ANN paper (arXiv): https://arxiv.org/abs/1603.09320

Pinecone HNSW overview: https://www.pinecone.io/learn/series/faiss/hnsw/

Metadata

Metadata

Assignees

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions