-
Notifications
You must be signed in to change notification settings - Fork 47
Closed
Description
Summary of Meilisearch Indexing Changes
Workflow Changes
- New populate-search-engine command (Add populate-search-engine command for processing HF doc-build dataset #685): Replaced the old
Daily Build Embeddingsmatrix job with a streamlined command that fetches pre-built docs from hf-doc-build/doc-build dataset with intelligent markdown chunking based on headings. - Improved document ID generation (Improve Meilisearch document ID generation #719): Changed to a readable
{library}-{page}-{hash}format instead of full SHA1 hashes. - Migration scripts for index management (feat: add scripts for Meilisearch index management #718): Added scripts for clearing and creating Meilisearch indexes.
- Index swap script (feat: add script to swap Meilisearch indexes #720): Added script to swap indexes atomically.
- Simplified workflow (Remove cleanup job from populate_search_engine workflow configuration #717): Removed the cleanup job that previously handled success/failure scenarios with automatic index swapping.
- Refactored embedding inference (refactor: update embedding inference to use URL and token directly #711): Updated to use a direct URL and token approach (
HF_IE_URL) instead of the previous name and namespace pattern. - Incremental embeddings mode (Add incremental embeddings mode to reduce costs #737): Added
--incrementalflag to only process new/changed documents. Tracks document IDs inhf-doc-build/doc-builder-embeddings-trackerdataset. Automatically removes stale entries when pages are updated or deleted. Significantly reduces costs by avoiding re-embedding unchanged documents.
TODO
- Initial populate of the docs semantic search index
- Use this new vector index for hf.co/docs embedding endpoints
- Create PR that will efficiently add vectors only to changed pages (feat: incremental Meilisearch indexing with HF Hub ID tracker #759)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels