Skip to content

Commit 7566b95

Browse files
mishig25claude
andauthored
feat: incremental Meilisearch indexing with HF Hub ID tracker (#759)
* feat: incremental Meilisearch indexing with HF Hub ID tracker - Add `--incremental` flag to `populate-search-engine` command: only embeds new/changed chunks and deletes stale ones, uploading directly to the main index (no temp-index swap needed). - Introduce `src/doc_builder/embeddings_tracker.py` to persist the set of indexed document IDs as JSON in the HF Hub dataset `hf-doc-build/doc-builder-embeddings-tracker`. - Add `migrations/export_meili_ids_to_hf.py` as a one-time bootstrap script to seed the tracker from the existing Meilisearch index. - Add `get_all_document_ids` and `delete_documents_from_db` helpers to `meilisearch_helper.py`. - Replace all hardcoded `https://edge.meilisearch.com` references with a `--meilisearch_url` CLI argument (mirrors the existing `--meilisearch_key` pattern) across all commands and migration scripts. - Update the CI workflow to use `--incremental` and supply the new `MEILISEARCH_URL` and `HF_TOKEN` secrets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * check * Apply suggestion from @mishig25 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 25e4f59 commit 7566b95

File tree

9 files changed

+343
-93
lines changed

9 files changed

+343
-93
lines changed

.github/workflows/populate_search_engine.yml

Lines changed: 3 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -34,59 +34,6 @@ jobs:
3434
HF_IE_URL: ${{ secrets.HF_IE_URL }}
3535
HF_IE_TOKEN: ${{ secrets.HF_IE_TOKEN }}
3636
MEILISEARCH_KEY: ${{ secrets.MEILISEARCH_KEY }}
37-
run: uv run doc-builder populate-search-engine
38-
# gradio-job:
39-
# runs-on: ubuntu-latest
40-
# steps:
41-
# - name: Checkout doc-builder
42-
# uses: actions/checkout@v4
43-
44-
# - name: Install uv
45-
# uses: astral-sh/setup-uv@v4
46-
# with:
47-
# version: "latest"
48-
49-
# - name: Set up Python 3.10
50-
# run: uv python install 3.10
51-
52-
# - name: Install doc-builder
53-
# run: uv sync --extra dev
54-
55-
# - name: Add gradio docs to meilisearch
56-
# env:
57-
# HF_IE_URL: ${{ secrets.HF_IE_URL }}
58-
# HF_IE_TOKEN: ${{ secrets.HF_IE_TOKEN }}
59-
# MEILISEARCH_KEY: ${{ secrets.MEILISEARCH_KEY }}
60-
# run: uv run doc-builder add-gradio-docs
61-
62-
# cleanup-job:
63-
# needs: [process-docs, gradio-job]
64-
# runs-on: ubuntu-latest
65-
# if: always() # This ensures that the cleanup job runs regardless of the result
66-
# steps:
67-
# - name: Checkout doc-builder
68-
# uses: actions/checkout@v4
69-
70-
# - name: Install uv
71-
# uses: astral-sh/setup-uv@v4
72-
# with:
73-
# version: "latest"
74-
75-
# - name: Set up Python 3.10
76-
# run: uv python install 3.10
77-
78-
# - name: Install doc-builder
79-
# run: uv sync --extra dev
80-
81-
# - name: Success Cleanup
82-
# if: needs.process-docs.result == 'success' # Runs if job succeeded
83-
# env:
84-
# MEILISEARCH_KEY: ${{ secrets.MEILISEARCH_KEY }}
85-
# run: uv run doc-builder meilisearch-clean --swap
86-
87-
# - name: Failure Cleanup
88-
# if: needs.process-docs.result == 'failure' # Runs if job failed
89-
# env:
90-
# MEILISEARCH_KEY: ${{ secrets.MEILISEARCH_KEY }}
91-
# run: uv run doc-builder meilisearch-clean
92-
37+
MEILISEARCH_URL: ${{ secrets.MEILISEARCH_URL }}
38+
HF_TOKEN: ${{ secrets.HF_EMBED_DATASETS_TOKEN }}
39+
run: uv run doc-builder populate-search-engine --incremental

migrations/clear_meili_index.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,13 @@
1818
def main():
1919
parser = argparse.ArgumentParser(description="Delete all documents from a Meilisearch index")
2020
parser.add_argument("--meilisearch_key", type=str, required=True, help="Meilisearch API key")
21+
parser.add_argument("--meilisearch_url", type=str, required=True, help="Meilisearch URL")
2122
parser.add_argument("--temp", action="store_true", help="Clear the temp index instead of the main index")
2223
args = parser.parse_args()
2324

2425
index_name = MEILI_INDEX_TEMP if args.temp else MEILI_INDEX
2526

26-
client = meilisearch.Client("https://edge.meilisearch.com", args.meilisearch_key)
27+
client = meilisearch.Client(args.meilisearch_url, args.meilisearch_key)
2728
clear_embedding_db(client, index_name)
2829
print(f"[meilisearch] successfully cleared all documents from {index_name}")
2930

migrations/create_meili_index.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,13 @@
1818
def main():
1919
parser = argparse.ArgumentParser(description="Create a Meilisearch index for docs semantic search")
2020
parser.add_argument("--meilisearch_key", type=str, required=True, help="Meilisearch API key")
21+
parser.add_argument("--meilisearch_url", type=str, required=True, help="Meilisearch URL")
2122
parser.add_argument("--temp", action="store_true", help="Create the temp index instead of the main index")
2223
args = parser.parse_args()
2324

2425
index_name = MEILI_INDEX_TEMP if args.temp else MEILI_INDEX
2526

26-
client = meilisearch.Client("https://edge.meilisearch.com", args.meilisearch_key)
27+
client = meilisearch.Client(args.meilisearch_url, args.meilisearch_key)
2728
create_embedding_db(client, index_name)
2829
update_db_settings(client, index_name)
2930
print(f"[meilisearch] successfully created {index_name}")
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
#!/usr/bin/env python3
2+
"""
3+
One-time migration script: export all existing Meilisearch document IDs to the
4+
HF Hub tracker dataset (hf-doc-build/doc-builder-embeddings-tracker).
5+
6+
This bootstraps the tracker so that subsequent `populate-search-engine
7+
--incremental` runs can diff against it instead of re-indexing everything.
8+
9+
Usage:
10+
uv run python migrations/export_meili_ids_to_hf.py \
11+
--meilisearch_key <key> \
12+
--meilisearch_url <url> \
13+
[--hf_token <token>] # falls back to HF_TOKEN env var
14+
"""
15+
16+
import argparse
17+
18+
import meilisearch
19+
20+
from doc_builder.build_embeddings import MEILI_INDEX
21+
from doc_builder.embeddings_tracker import save_tracker
22+
from doc_builder.meilisearch_helper import get_all_document_ids
23+
24+
25+
def main():
26+
parser = argparse.ArgumentParser(description="Export all Meilisearch document IDs to the HF Hub tracker dataset.")
27+
parser.add_argument("--meilisearch_key", type=str, required=True, help="Meilisearch API key")
28+
parser.add_argument("--meilisearch_url", type=str, required=True, help="Meilisearch URL")
29+
parser.add_argument(
30+
"--hf_token",
31+
type=str,
32+
required=False,
33+
default=None,
34+
help="HuggingFace token with write access (falls back to HF_TOKEN env var)",
35+
)
36+
args = parser.parse_args()
37+
38+
client = meilisearch.Client(args.meilisearch_url, args.meilisearch_key)
39+
40+
print(f"Fetching all document IDs from Meilisearch index '{MEILI_INDEX}'...")
41+
ids = get_all_document_ids(client, MEILI_INDEX)
42+
print(f"Found {len(ids)} documents in '{MEILI_INDEX}'")
43+
44+
print("Pushing ID list to HF Hub tracker...")
45+
save_tracker(ids, hf_token=args.hf_token)
46+
print("Done. The tracker is now ready for incremental updates.")
47+
48+
49+
if __name__ == "__main__":
50+
main()

migrations/swap_meili_indexes.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@
1717
def main():
1818
parser = argparse.ArgumentParser(description="Swap main and temp Meilisearch indexes")
1919
parser.add_argument("--meilisearch_key", type=str, required=True, help="Meilisearch API key")
20+
parser.add_argument("--meilisearch_url", type=str, required=True, help="Meilisearch URL")
2021
args = parser.parse_args()
2122

22-
client = meilisearch.Client("https://edge.meilisearch.com", args.meilisearch_key)
23+
client = meilisearch.Client(args.meilisearch_url, args.meilisearch_key)
2324
swap_indexes(client, MEILI_INDEX, MEILI_INDEX_TEMP)
2425
print(f"[meilisearch] successfully swapped {MEILI_INDEX} and {MEILI_INDEX_TEMP}")
2526

src/doc_builder/build_embeddings.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@
4343
"text source_page_url source_page_title library embedding heading1 heading2 heading3 heading4 heading5 page",
4444
)
4545

46-
MEILI_INDEX = "docs-semantic-search-v2"
47-
MEILI_INDEX_TEMP = "docs-semantic-search-v2-temp"
46+
MEILI_INDEX = "docs-semantic-search"
47+
MEILI_INDEX_TEMP = "docs-semantic-search-temp"
4848

4949
_re_md_anchor = re.compile(r"\[\[(.*)]]")
5050
_re_non_alphaneumeric = re.compile(r"[^a-z0-9\s]+", re.IGNORECASE)
@@ -778,6 +778,7 @@ def build_embeddings(
778778
hf_ie_url,
779779
hf_ie_token,
780780
meilisearch_key,
781+
meilisearch_url,
781782
version="main",
782783
version_tag="main",
783784
language="en",
@@ -832,17 +833,17 @@ def build_embeddings(
832833
embeddings = call_embedding_inference(chunks, hf_ie_url, hf_ie_token, is_python_module)
833834

834835
# Step 3: push embeddings to vector database (meilisearch)
835-
client = meilisearch.Client("https://edge.meilisearch.com", meilisearch_key)
836+
client = meilisearch.Client(meilisearch_url, meilisearch_key)
836837
ITEMS_PER_CHUNK = 5000 # a value that was found experimentally
837838
for chunk_embeddings in tqdm(chunk_list(embeddings, ITEMS_PER_CHUNK), desc="Uploading data to meilisearch"):
838839
add_embeddings_to_db(client, MEILI_INDEX_TEMP, chunk_embeddings)
839840

840841

841-
def clean_meilisearch(meilisearch_key: str, swap: bool):
842+
def clean_meilisearch(meilisearch_key: str, swap: bool, meilisearch_url: str):
842843
"""
843844
Swap & delete temp index.
844845
"""
845-
client = meilisearch.Client("https://edge.meilisearch.com", meilisearch_key)
846+
client = meilisearch.Client(meilisearch_url, meilisearch_key)
846847
if swap:
847848
swap_indexes(client, MEILI_INDEX, MEILI_INDEX_TEMP)
848849
delete_embedding_db(client, MEILI_INDEX_TEMP)
@@ -851,7 +852,7 @@ def clean_meilisearch(meilisearch_key: str, swap: bool):
851852
print("[meilisearch] successfully swapped & deleted temp index.")
852853

853854

854-
def add_gradio_docs(hf_ie_url: str, hf_ie_token: str, meilisearch_key: str):
855+
def add_gradio_docs(hf_ie_url: str, hf_ie_token: str, meilisearch_key: str, meilisearch_url: str):
855856
"""Add Gradio documentation to embeddings."""
856857
# Step 1: download the documentation
857858
url = "https://huggingface.co/datasets/gradio/docs/resolve/main/docs.json"
@@ -894,7 +895,7 @@ def add_gradio_docs(hf_ie_url: str, hf_ie_token: str, meilisearch_key: str):
894895
embeddings.extend(batch_embeddings)
895896

896897
# Step 3: push embeddings to vector database (meilisearch)
897-
client = meilisearch.Client("https://edge.meilisearch.com", meilisearch_key)
898+
client = meilisearch.Client(meilisearch_url, meilisearch_key)
898899
ITEMS_PER_CHUNK = 5000 # a value that was found experimentally
899900
for chunk_embeddings in tqdm(chunk_list(embeddings, ITEMS_PER_CHUNK), desc="Uploading gradio docs to meilisearch"):
900901
add_embeddings_to_db(client, MEILI_INDEX_TEMP, chunk_embeddings)

0 commit comments

Comments
 (0)