Local-first knowledge retrieval system for Obsidian notes using DuckDB with vector search. Find accompanied blog.
Try the Live Demo β explore.ssp.sh
Browse, search, and discover hidden connections in my public Data Engineering Second Brain (source). This repo contains both the ingestion pipeline (CLI) and the web-app source code.
- Semantic search: Find notes by meaning using BGE-M3 embeddings (1024-dim)
- Backlinks: Find all notes linking to a specific note
- Graph traversal: Discover notes N hops away via wikilinks
- Hidden connections: Surface semantically similar notes that aren't directly linked
- Shared tags (Hyperedge): Find notes sharing multiple tags via hyperedge graph
- Graph-boosted search: Combine semantic similarity with graph connectivity
- Your notes never leave your machine: Everything runs locally with DuckDB and
BAAI/bge-m3embedding model.
# Install dependencies
uv sync
# Configure your vault path
cp .env.example .env
# Edit .env and set VAULT_PATH to your Obsidian vault location.env configuration:
# Path to your Obsidian vault
VAULT_PATH=/path/to/your/obsidian/vault
# Database file location (default: second_brain.duckdb)
DB_PATH=second_brain.duckdbmake ingest
# or pass path directly:
uv run python scripts/ingest.py /path/to/obsidian/vault# Semantic search
uv run python scripts/query.py semantic "data modeling best practices" --limit 10
# Find backlinks
uv run python scripts/query.py backlinks "note-slug"
# Graph connections (1-3 hops)
uv run python scripts/query.py connections "note-slug" --hops 2
# Hidden connections (similar but unlinked)
uv run python scripts/query.py hidden "search query" --seed "note-slug"
# Shared tags (hyperedge query) - find notes sharing 2+ tags
uv run python scripts/query.py shared-tags "note-slug" --min-shared 2
# Graph-boosted search - semantic + graph connectivity boost
uv run python scripts/query.py graph-boosted "search query" --seed "note-slug" --boost 1.2
# Raw SQL
uv run python scripts/query.py sql "SELECT title, slug FROM notes LIMIT 10"make test-all # Run all test queries
make stats # Show database statistics| Table | Description |
|---|---|
| notes | Note metadata, content, frontmatter |
| links | Wikilink graph edges |
| chunks | Chunked content for RAG retrieval |
| embeddings | 1024-dim vectors (BAAI/bge-m3) |
| hyperedges | Multiway relations (tags, folders) |
| hyperedge_members | Note membership in hyperedges |
- DuckDB with VSS extension (vector similarity search)
- sentence-transformers for embeddings
- Python 3.11+
The system uses two distinct data sources that are combined in different ways:
During ingestion, the parser extracts all [[wikilinks]] from your Obsidian notes using regex:
[[Target Note]] β link from current note to "target-note"
[[Target Note|display]] β same, with custom display text
[[Note#Section]] β link with heading anchor (anchor stripped)
These are stored in the links table as directed edges:
source_slug: "my-note" β target_slug: "target-note"
This creates a graph of your knowledge base based on explicit connections you made.
Each note is split into chunks (~512 tokens), and each chunk is converted to a 384-dimensional vector using all-MiniLM-L6-v2. These vectors capture semantic meaning - similar concepts have vectors that are close together in vector space.
"functional programming principles" β [0.23, -0.45, 0.12, ...] (384 floats)
"FP best practices in Python" β [0.21, -0.42, 0.15, ...] (similar vector)
DuckDB's VSS extension with HNSW index enables fast cosine similarity search across all embeddings.
How it works: Pure SQL query on the links table - finds all notes where target_slug matches your query.
Example:
uv run python scripts/query.py backlinks "functional-data-engineering"SELECT source_slug, link_text FROM links
WHERE target_slug = 'functional-data-engineering'Result: Notes that explicitly link to Functional Data Engineering:
- Python β linked as "Functional Data Engineering"
- Maxime Beauchemin β linked as "Functional Data Engineering"
- Idempotency β linked as "Functional Data Engineering"
These are connections you made via [[Functional Data Engineering]] wikilinks.
How it works: Recursive graph traversal using SQL CTE. Follows wikilink edges N hops from a starting note.
Example:
uv run python scripts/query.py connections "python" --hops 2WITH RECURSIVE connected AS (
SELECT target_slug, 1 as hop FROM links WHERE source_slug = 'python'
UNION
SELECT l.target_slug, c.hop + 1 FROM connected c
JOIN links l ON l.source_slug = c.slug WHERE c.hop < 2
)
SELECT DISTINCT slug, hop FROM connectedResult:
- Hop 1: Notes that Python links to (Functional Data Engineering, Data Engineering, etc.)
- Hop 2: Notes those link to (Airflow, Dagster, etc.)
No embeddings involved - purely following your wikilink graph.
How it works:
- Your query text is embedded into a 384-dim vector
- DuckDB finds chunks with highest cosine similarity via HNSW index
- Returns notes containing those chunks
Example:
uv run python scripts/query.py semantic "data modeling best practices"SELECT title, 1 - array_cosine_distance(embedding, $query_vector) as similarity
FROM embeddings e JOIN chunks c ON c.chunk_id = e.chunk_id
ORDER BY similarity DESCResult: Notes semantically similar to "data modeling best practices":
- Data Modeling: The Unsung Hero (similarity: 0.75)
- Semantic Layer (similarity: 0.69)
- Kimball Dimensional Modeling (similarity: 0.65)
These notes match by meaning, not by explicit links. A note about "dimensional modeling techniques" matches even if it never mentions "best practices".
4. Hidden Connections (Embeddings + Graph combined)
How it works: This is the most interesting query - it combines both data sources:
- Find semantically similar notes (embeddings)
- Filter OUT notes that are already linked (graph)
- What remains = notes you should probably link but haven't
Example:
uv run python scripts/query.py hidden "Python pipelines" --seed "functional-data-engineering"WITH semantic_similar AS (
-- Find notes similar to "Python pipelines" via embeddings
SELECT slug, similarity FROM embeddings WHERE similarity > 0.4
),
direct_links AS (
-- Get all notes already linked to/from the seed note
SELECT target_slug FROM links WHERE source_slug = 'functional-data-engineering'
UNION
SELECT source_slug FROM links WHERE target_slug = 'functional-data-engineering'
)
-- Return similar notes that are NOT linked
SELECT * FROM semantic_similar
WHERE slug NOT IN (SELECT * FROM direct_links)Result: Notes about Python pipelines that aren't linked to Functional Data Engineering:
- Orchestration (similarity: 0.63) - not linked!
- Pipeline Evolution (similarity: 0.61) - not linked!
- Python for Data Engineers (similarity: 0.59) - not linked!
These are discovery suggestions - semantically related content you might want to connect.
| Query Type | Data Source | Use Case |
|---|---|---|
| Backlinks | Wikilinks (graph) | "What links to this note?" |
| Connections | Wikilinks (graph) | "What's N hops away in my graph?" |
| Semantic | Embeddings (vectors) | "Find notes about this topic" |
| Hidden | Both combined | "What should I link but haven't?" |
| Shared Tags | Hyperedges | "Notes sharing multiple tags with this one" |
| Graph-Boosted | Embeddings + Graph | "Semantic search boosted by graph proximity" |
Your notes never leave your machine. Everything runs locally:
| Component | Where it runs |
|---|---|
| Embedding model | Local CPU (sentence-transformers) |
| Vector database | Local file (DuckDB) |
| All queries | Local |
Unlike OpenAI/Anthropic APIs that send your text to the cloud, the BAAI/bge-m3 (replace with any model you like) embedding model runs entirely in-process on your machine. This makes it safe for:
- Personal journals and private notes
- Confidential work documents
- Sensitive research materials
- Anything you wouldn't want uploaded to a third party
Network activity:
- One-time model download (~1.8GB from Hugging Face) on first run
- After that: zero network activity, works fully offline
# Pre-download model for offline use
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('BAAI/bge-m3')"