Commit 6b8dffb
Reza Shamji
Add generic KG embedding module with RRF hybrid search support
New file: embed_kg.py (442 lines)
Purpose: Generic embedding-based knowledge graph search supporting three modes:
1. Embedding-only: Dense vector similarity using KaLM
2. BM25-only: Keyword matching (via ARK's GraphIndex)
3. Hybrid: RRF fusion combining both for best of both worlds
Key Components:
- clean_query(): Copied from ark/src/core/index.py to ensure embedding/BM25 consistency
- Cleans node names identically across both search methods
- Comment explains why copy was necessary (SLURM -m flag module import issues)
- KGEmbedder class:
- load_nodes(): Loads KG parquet + cleans names to match BM25 index
- embed_all_nodes(): One-time operation using KaLM encode_document()
- embed_query(): Per-request encoding using KaLM encode_query()
- search_embedding_only(): Dense similarity search returning top-k results
- reciprocal_rank_fusion(): Combines embedding + BM25 via RRF formula
- score = 1/(rrf_k + rank) for each method, then summed
- Both rankers weighted equally (symmetric fusion)
- Returns fused_score + metadata (embedding_rank, bm25_rank, individual scores)
- save_embeddings() / load_embeddings(): Disk persistence for efficiency
- CLI interface:
- --graph-path: KG directory with nodes.parquet
- --output-path: Embeddings save location (default: graph_path/embeddings_kalm.npy)
- --batch-size: Embedding batch size (default: 256)
- Validates inputs, runs 3-step pipeline (load → embed → save)
Compatible with: PRIME, MAG, AMAZON, OptmusKG, any ARK-compatible KG
Impact: Enables exploring hybrid search strategies (RRF fusion) without modifying ARK core, prepares for search-mode parameter integration in Phase 1A.1 parent b9a4c33 commit 6b8dffb
1 file changed
+441
-0
lines changed
0 commit comments