Skip to content

Conversation

Mpdreamz
Copy link
Member

@Mpdreamz Mpdreamz commented Oct 15, 2025

This PR documents our Elasticsearch integration architecture, which implements an incremental indexing strategy for maintaining both lexical (traditional full-text) and semantic (vector embeddings) search indices.

The Problem We're Solving

Every time we rebuild our documentation, we need to:

  • Index potentially thousands of markdown documents
  • Generate expensive semantic embeddings for semantic search
  • Keep two separate indices (lexical + semantic) in sync
  • Avoid reprocessing unchanged documents

The Solution: Hash-Based Change Detection + Incremental Sync

During Indexing (Lexical Index):

  • We compute a hash from each document's URL, body, and headings
  • If the hash matches what's already indexed → only update batch_index_date
  • If the hash changed → full document upsert with new last_updated timestamp
  • This means we track every document in the current batch, but only fully update what actually changed.

The hash checking happens on the server (elasticsearch) through the reindex API.

Synchronize from lexical to semantic index

The StopAsync method orchestrates a 5-phase synchronization:

  1. Finalize lexical index - Drain buffers, refresh, apply aliases
  2. Bootstrap semantic index - Create if it doesn't exist yet
  3. Sync updates - Copy only documents where last_updated >= batch_date from lexical → semantic
  • Only changed docs get re-embedded
  1. Sync deletions - Reindex from lexical to semantic where: batch_index_date < batch_date and make those operations a delete.
  2. Cleanup - Delete stale docs batch_index_date < batch_date from lexical index itself.

Why This Matters

  • Massive performance gains: If only 10 out of 10,000 docs changed, we only regenerate 10 semantic embeddings
  • Cost savings: Semantic embedding generation is expensive (compute + potential API costs)
  • Zero-downtime: Time-stamped indices with alias swapping means no search interruption
  • Data consistency: Both indices stay perfectly synchronized despite being updated separately
  • Change tracking, we can use last_updated to boost recency (if needed).

Visualization

New docs includes a step-by-step visual walkthrough showing how documents flow through the system, with color-coded states (blue = batch-tracked, green = updated, red = marked for deletion).

For Reviewers

  • ElasticsearchExporter.cs - Base exporter with channel management
  • ElasticsearchMarkdownExporter.cs - Orchestration logic and sync phases

@Mpdreamz Mpdreamz requested review from a team as code owners October 15, 2025 16:12
@Mpdreamz Mpdreamz self-assigned this Oct 15, 2025
@Mpdreamz Mpdreamz requested a review from cotti October 15, 2025 16:12
@Mpdreamz Mpdreamz changed the title feature/incremental indexing Incremental semantic index ingestion. Oct 15, 2025
Copy link

🔍 Preview links for changed docs

Copy link
Member

@reakaleek reakaleek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice visualizations!

@Mpdreamz Mpdreamz merged commit a6ed98a into main Oct 16, 2025
23 checks passed
@Mpdreamz Mpdreamz deleted the feature/incremental-indexing branch October 16, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants