Skip to content

Commit 6b8dffb

Browse files
author
Reza Shamji
committed
Add generic KG embedding module with RRF hybrid search support
New file: embed_kg.py (442 lines) Purpose: Generic embedding-based knowledge graph search supporting three modes: 1. Embedding-only: Dense vector similarity using KaLM 2. BM25-only: Keyword matching (via ARK's GraphIndex) 3. Hybrid: RRF fusion combining both for best of both worlds Key Components: - clean_query(): Copied from ark/src/core/index.py to ensure embedding/BM25 consistency - Cleans node names identically across both search methods - Comment explains why copy was necessary (SLURM -m flag module import issues) - KGEmbedder class: - load_nodes(): Loads KG parquet + cleans names to match BM25 index - embed_all_nodes(): One-time operation using KaLM encode_document() - embed_query(): Per-request encoding using KaLM encode_query() - search_embedding_only(): Dense similarity search returning top-k results - reciprocal_rank_fusion(): Combines embedding + BM25 via RRF formula - score = 1/(rrf_k + rank) for each method, then summed - Both rankers weighted equally (symmetric fusion) - Returns fused_score + metadata (embedding_rank, bm25_rank, individual scores) - save_embeddings() / load_embeddings(): Disk persistence for efficiency - CLI interface: - --graph-path: KG directory with nodes.parquet - --output-path: Embeddings save location (default: graph_path/embeddings_kalm.npy) - --batch-size: Embedding batch size (default: 256) - Validates inputs, runs 3-step pipeline (load → embed → save) Compatible with: PRIME, MAG, AMAZON, OptmusKG, any ARK-compatible KG Impact: Enables exploring hybrid search strategies (RRF fusion) without modifying ARK core, prepares for search-mode parameter integration in Phase 1A.
1 parent b9a4c33 commit 6b8dffb

File tree

1 file changed

+441
-0
lines changed

1 file changed

+441
-0
lines changed

0 commit comments

Comments
 (0)