feat: incremental Meilisearch indexing with HF Hub ID tracker#759
Merged
feat: incremental Meilisearch indexing with HF Hub ID tracker#759
Conversation
- Add `--incremental` flag to `populate-search-engine` command: only embeds new/changed chunks and deletes stale ones, uploading directly to the main index (no temp-index swap needed). - Introduce `src/doc_builder/embeddings_tracker.py` to persist the set of indexed document IDs as JSON in the HF Hub dataset `hf-doc-build/doc-builder-embeddings-tracker`. - Add `migrations/export_meili_ids_to_hf.py` as a one-time bootstrap script to seed the tracker from the existing Meilisearch index. - Add `get_all_document_ids` and `delete_documents_from_db` helpers to `meilisearch_helper.py`. - Replace all hardcoded `https://edge.meilisearch.com` references with a `--meilisearch_url` CLI argument (mirrors the existing `--meilisearch_key` pattern) across all commands and migration scripts. - Update the CI workflow to use `--incremental` and supply the new `MEILISEARCH_URL` and `HF_TOKEN` secrets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mishig25
commented
Mar 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--incrementalflag forpopulate-search-enginethat only embeds new/changed chunks and deletes stale ones — no full rebuild or temp-index swap needed. Uses the doc ID format (which includes a content hash) to detect changes automatically.src/doc_builder/embeddings_tracker.pypersists the current set of indexed document IDs as JSON in the HF Hub datasethf-doc-build/doc-builder-embeddings-tracker. Loaded on each incremental run to diff against the freshly-computed IDs.migrations/export_meili_ids_to_hf.pyis a one-time script to seed the tracker from the existing Meilisearch index.--meilisearch_urlCLI arg: Replaced all hardcodedhttps://edge.meilisearch.comreferences with a--meilisearch_urlargument (mirrors--meilisearch_key) across all commands and migration scripts.--incremental, and expectsMEILISEARCH_URLandHF_TOKENsecrets.New helpers in
meilisearch_helper.pyget_all_document_ids(client, index_name) -> set[str]— paginates through all docs fetching only theidfield.delete_documents_from_db(client, index_name, doc_ids)— deletes a batch of doc IDs (decorated with@wait_for_task_completion).Bootstrap instructions (run once)
Test plan
populate-search-engine --incrementallocally with--skip-downloadon a small set of libraries and verify only new/changed chunks are embeddedMEILISEARCH_URLandHF_TOKENsecrets to the repo's GitHub Actions secretsworkflow_dispatchand verify it completes successfully🤖 Generated with Claude Code