Skip to content

feat: incremental Meilisearch indexing with HF Hub ID tracker#759

Merged
mishig25 merged 3 commits intomainfrom
feat/incremental-meilisearch-indexing
Mar 9, 2026
Merged

feat: incremental Meilisearch indexing with HF Hub ID tracker#759
mishig25 merged 3 commits intomainfrom
feat/incremental-meilisearch-indexing

Conversation

@mishig25
Copy link
Contributor

@mishig25 mishig25 commented Feb 26, 2026

Summary

  • Incremental indexing: New --incremental flag for populate-search-engine that only embeds new/changed chunks and deletes stale ones — no full rebuild or temp-index swap needed. Uses the doc ID format (which includes a content hash) to detect changes automatically.
  • HF Hub ID tracker: New src/doc_builder/embeddings_tracker.py persists the current set of indexed document IDs as JSON in the HF Hub dataset hf-doc-build/doc-builder-embeddings-tracker. Loaded on each incremental run to diff against the freshly-computed IDs.
  • Bootstrap migration: New migrations/export_meili_ids_to_hf.py is a one-time script to seed the tracker from the existing Meilisearch index.
  • --meilisearch_url CLI arg: Replaced all hardcoded https://edge.meilisearch.com references with a --meilisearch_url argument (mirrors --meilisearch_key) across all commands and migration scripts.
  • CI workflow updated: Uses --incremental, and expects MEILISEARCH_URL and HF_TOKEN secrets.

New helpers in meilisearch_helper.py

  • get_all_document_ids(client, index_name) -> set[str] — paginates through all docs fetching only the id field.
  • delete_documents_from_db(client, index_name, doc_ids) — deletes a batch of doc IDs (decorated with @wait_for_task_completion).

Bootstrap instructions (run once)

uv run python migrations/export_meili_ids_to_hf.py \
  --meilisearch_key <key> \
  --meilisearch_url <url> \
  --hf_token <token>

Test plan

  • Run populate-search-engine --incremental locally with --skip-download on a small set of libraries and verify only new/changed chunks are embedded
  • Verify stale IDs are deleted from Meilisearch
  • Confirm tracker JSON is updated on HF Hub after the run
  • Add MEILISEARCH_URL and HF_TOKEN secrets to the repo's GitHub Actions secrets
  • Trigger the workflow manually via workflow_dispatch and verify it completes successfully

🤖 Generated with Claude Code

mishig25 and others added 2 commits February 26, 2026 15:24
- Add `--incremental` flag to `populate-search-engine` command: only
  embeds new/changed chunks and deletes stale ones, uploading directly
  to the main index (no temp-index swap needed).
- Introduce `src/doc_builder/embeddings_tracker.py` to persist the set
  of indexed document IDs as JSON in the HF Hub dataset
  `hf-doc-build/doc-builder-embeddings-tracker`.
- Add `migrations/export_meili_ids_to_hf.py` as a one-time bootstrap
  script to seed the tracker from the existing Meilisearch index.
- Add `get_all_document_ids` and `delete_documents_from_db` helpers to
  `meilisearch_helper.py`.
- Replace all hardcoded `https://edge.meilisearch.com` references with a
  `--meilisearch_url` CLI argument (mirrors the existing `--meilisearch_key`
  pattern) across all commands and migration scripts.
- Update the CI workflow to use `--incremental` and supply the new
  `MEILISEARCH_URL` and `HF_TOKEN` secrets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mishig25 mishig25 marked this pull request as ready for review March 9, 2026 09:14
@mishig25 mishig25 merged commit 7566b95 into main Mar 9, 2026
4 checks passed
@mishig25 mishig25 deleted the feat/incremental-meilisearch-indexing branch March 9, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant