The spnl RAPTOR indexer uses a two-phase approach to work with leann-rs's
bulk-only LeannBuilder:
- Phase 1 -- Build the initial HNSW index from base document fragments
- Phase 2 -- For each base fragment, search the index for similar passages, generate LLM summaries, collect all summaries
- Phase 3 -- Rebuild the entire HNSW index from scratch with all original passages + all RAPTOR summaries
In the original RAPTOR implementation (and spnl's previous lancedb-based approach), summaries were inserted into the vector database incrementally. This meant:
- Summary S1 (generated from fragments A, B, C) gets inserted into the index
- When generating S2 from fragments D, E, F, the search might also find S1 as a relevant neighbor
- This creates a richer, more interconnected knowledge graph where summaries can reference and build upon other summaries
With the current batch-rebuild approach:
- All summaries are generated against the base fragments only
- No summary can influence the search results used to generate another summary
- The resulting knowledge graph is shallower (one level of summarization only)
Add an append_to_index() method that takes an existing index path and adds
new passages to it:
impl LeannBuilder {
pub fn append_to_index(
&mut self,
index_path: &Path,
provider: &dyn EmbeddingProvider,
) -> Result<()> {
// 1. Load existing HNSW graph
// 2. Load existing passages
// 3. Compute embeddings for new chunks
// 4. Insert new nodes into the graph (incremental HNSW insertion)
// 5. Re-write the index files
}
}Add a lower-level API for inserting individual vectors into an existing graph:
impl HnswGraph {
pub fn insert(&mut self, vector: &[f32], level: i32) -> usize {
// Standard HNSW insertion algorithm
// Returns the node ID of the inserted vector
}
}This is more fundamental and would enable Option A, but also other use cases.
- HNSW supports incremental insertion by design (the original algorithm is incremental), so this is architecturally sound
- The current
build_hnsw()function already does incremental insertion internally during construction; the API just doesn't expose single-node insertion after the initial build - CSR (compact) format would need to be re-computed after insertions, or insertions would need to work on the standard format first
- The
VectorStoragewould need to support appending new vectors - The passages JSONL file and offset index would need append support
- Current approach: O(N) embeddings computed twice (once for initial build, once for rebuild with summaries). Total embedding calls = 2N + S where S is the number of summaries.
- Incremental approach: O(N + S) embedding calls total. Each summary is embedded and inserted once.
- For large corpora, the rebuild cost is significant because all original embeddings must be recomputed during Phase 3.
A partial mitigation would be to cache the embeddings from Phase 1 and use
build_index_from_embeddings() in Phase 3. This avoids recomputing base
embeddings but still requires a full graph rebuild. leann-rs already supports
this via LeannBuilder::build_index_from_embeddings().
Medium -- the two-phase approach is functionally correct and produces useful RAPTOR summaries. The main loss is the depth of cross-referencing between summaries, which may not significantly impact retrieval quality for most document sizes. The performance overhead of double-embedding is more pressing for large corpora.