feat: incremental Meilisearch indexing with HF Hub ID tracker by mishig25 · Pull Request #759 · huggingface/doc-builder

mishig25 · 2026-02-26T15:24:29Z

Summary

Incremental indexing: New --incremental flag for populate-search-engine that only embeds new/changed chunks and deletes stale ones — no full rebuild or temp-index swap needed. Uses the doc ID format (which includes a content hash) to detect changes automatically.
HF Hub ID tracker: New src/doc_builder/embeddings_tracker.py persists the current set of indexed document IDs as JSON in the HF Hub dataset hf-doc-build/doc-builder-embeddings-tracker. Loaded on each incremental run to diff against the freshly-computed IDs.
Bootstrap migration: New migrations/export_meili_ids_to_hf.py is a one-time script to seed the tracker from the existing Meilisearch index.
--meilisearch_url CLI arg: Replaced all hardcoded https://edge.meilisearch.com references with a --meilisearch_url argument (mirrors --meilisearch_key) across all commands and migration scripts.
CI workflow updated: Uses --incremental, and expects MEILISEARCH_URL and HF_TOKEN secrets.

New helpers in `meilisearch_helper.py`

get_all_document_ids(client, index_name) -> set[str] — paginates through all docs fetching only the id field.
delete_documents_from_db(client, index_name, doc_ids) — deletes a batch of doc IDs (decorated with @wait_for_task_completion).

Bootstrap instructions (run once)

uv run python migrations/export_meili_ids_to_hf.py \
  --meilisearch_key <key> \
  --meilisearch_url <url> \
  --hf_token <token>

Test plan

Run populate-search-engine --incremental locally with --skip-download on a small set of libraries and verify only new/changed chunks are embedded
Verify stale IDs are deleted from Meilisearch
Confirm tracker JSON is updated on HF Hub after the run
Add MEILISEARCH_URL and HF_TOKEN secrets to the repo's GitHub Actions secrets
Trigger the workflow manually via workflow_dispatch and verify it completes successfully

🤖 Generated with Claude Code

- Add `--incremental` flag to `populate-search-engine` command: only embeds new/changed chunks and deletes stale ones, uploading directly to the main index (no temp-index swap needed). - Introduce `src/doc_builder/embeddings_tracker.py` to persist the set of indexed document IDs as JSON in the HF Hub dataset `hf-doc-build/doc-builder-embeddings-tracker`. - Add `migrations/export_meili_ids_to_hf.py` as a one-time bootstrap script to seed the tracker from the existing Meilisearch index. - Add `get_all_document_ids` and `delete_documents_from_db` helpers to `meilisearch_helper.py`. - Replace all hardcoded `https://edge.meilisearch.com` references with a `--meilisearch_url` CLI argument (mirrors the existing `--meilisearch_key` pattern) across all commands and migration scripts. - Update the CI workflow to use `--incremental` and supply the new `MEILISEARCH_URL` and `HF_TOKEN` secrets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.github/workflows/populate_search_engine.yml

mishig25 and others added 2 commits February 26, 2026 15:24

check

7800aff

mishig25 commented Mar 9, 2026

View reviewed changes

.github/workflows/populate_search_engine.yml Outdated Show resolved Hide resolved

Apply suggestion from @mishig25

2da6f15

mishig25 marked this pull request as ready for review March 9, 2026 09:14

mishig25 merged commit 7566b95 into main Mar 9, 2026
4 checks passed

mishig25 deleted the feat/incremental-meilisearch-indexing branch March 9, 2026 09:14

mishig25 mentioned this pull request Mar 10, 2026

Summary: Meilisearch Indexing Changes #722

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: incremental Meilisearch indexing with HF Hub ID tracker#759

feat: incremental Meilisearch indexing with HF Hub ID tracker#759
mishig25 merged 3 commits intomainfrom
feat/incremental-meilisearch-indexing

mishig25 commented Feb 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mishig25 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New helpers in meilisearch_helper.py

Bootstrap instructions (run once)

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mishig25 commented Feb 26, 2026 •

edited

Loading

New helpers in `meilisearch_helper.py`