test: Add HF Space Dockerfile using pre-built leaderboard image #3838

isaac-chung · 2026-01-03T20:01:01Z

Adds a lightweight Dockerfile for HuggingFace Space deployment that uses the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base. Also adds a workflow to test the Dockerfile.

Note: The workflow has GITHUB_TOKEN so it'll be able to pull the base image. But it won't be visible from the HF Space, so the images need to be made public.

Adds a lightweight Dockerfile for HuggingFace Space deployment that uses the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base. Also adds a workflow to test the Dockerfile. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

.github/workflows/hf_space_docker.yml

isaac-chung · 2026-01-04T10:05:31Z

@Samoed can you add a secret in the HF space? https://huggingface.co/docs/hub/spaces-overview#managing-secrets
I can remove the test.

Samoed · 2026-01-04T10:21:02Z

I don't think that token would work, because we would need to authorize somehow to pull image. I can't find documentation for that on HF

isaac-chung · 2026-01-04T10:27:22Z

I don't think that token would work, because we would need to authorize somehow to pull image. I can't find documentation for that on HF

AFAIU from what I tried, we need either cases to be true to pull the image:

Authorize images/packages to be public in this repo, OR
Use GITHUB_TOKEN and pull the image normally (example done in CI)

Would you mind adding the token and we can try this out?

Samoed · 2026-01-04T10:51:53Z

Use GITHUB_TOKEN and pull the image normally (example done in CI)

In ci you basically do docker login ..., but we can't do the same in HF, because we can't run commands before image running

isaac-chung · 2026-01-04T11:11:40Z

Ah yes, you're right. We'll have to wait for the packages to be made public then.

KennethEnevoldsen · 2026-01-04T11:11:50Z

public!

isaac-chung · 2026-01-04T11:35:36Z

Also https://huggingface.co/spaces/mteb/leaderboard/discussions/175

KennethEnevoldsen · 2026-01-04T11:39:04Z

@isaac-chung feel free to merge that - it looks good (but do it at a time where you have to time to check if the leaderboard breaks)

isaac-chung · 2026-01-04T11:42:00Z

@KennethEnevoldsen btw I don't have write permissions to the HF LB space, so I'm not able to merge that PR yet :(

KennethEnevoldsen · 2026-01-04T11:44:07Z

Oh you should have write - you should have them now

isaac-chung · 2026-01-04T11:45:19Z

Thanks! Confirming that I now see the merge button.

Samoed · 2026-01-04T11:49:58Z

Do we need this pr now?

isaac-chung · 2026-01-04T11:51:17Z

Yes. The point is to keep a copy of the LB dockerfile, so we won't miss updates in future.

isaac-chung · 2026-01-04T11:55:02Z

LB restarted and seems to be running well 🎉

isaac-chung · 2026-01-04T12:02:55Z

Before we merge this, I'd like to include a small CI workflow to simply run this HF space dockerfile.

KennethEnevoldsen · 2026-01-04T16:06:34Z

Yeah it would be great to confirm that this runs so that we see the error here before we see it on the leaderboard

Samoed · 2026-01-04T16:21:34Z

But don't we test it already in https://github.com/embeddings-benchmark/mteb/blob/main/.github/workflows/leaderboard_docker.yml

Samoed

I think we're testing basically the same image with same command in https://github.com/embeddings-benchmark/mteb/blob/main/.github/workflows/leaderboard_docker.yml

Samoed · 2026-01-05T20:12:12Z

.github/workflows/hf_space_docker.yml

+    branches: [ main ]
+    paths:
+      - 'Dockerfile.hf-space'
+      - '.github/workflows/hf_space_docker.yml'
+  pull_request:
+    branches: [ main ]
+    paths:
+      - 'Dockerfile.hf-space'
+      - '.github/workflows/hf_space_docker.yml'


I think this action won't be triggered

Did you check the right sections? https://github.com/embeddings-benchmark/mteb/actions/runs/20725664296/job/59500466946?pr=3838

I checked this. This will be triggered only on changed in dockerfile for spaces, but I'm not sure if it will be enough to test then

I think it's better to move these checks to leaderboard_docker.yml to test this more frequently and make sure that our leaderboard will work

Correct, that's what the paths section of the workflow file shows. I'm not sure there's new info here.

Keeping these tests separate makes more sense.

Dockerfile.hf-space

isaac-chung · 2026-01-05T20:22:10Z

Once again, the point of this test is to track the exact same dockerfile used in the LB. We then will be able to:

track any code that will affect the LB in the HF Space
Test the exact same dockerfile used in the HF Space

Even though there's a lot of overlap, this test completes the LB test coverage.
e.g. If the images were not made public, this test would have failed.

Dockerfile.hf-space

isaac-chung · 2026-01-05T21:26:25Z

Let's try to keep the nit to the least if possible :) going to merge now

@isaac-chung

* feat: Added `get_benchmark_result()` to BenchmarkResults to obtain a benchmark table (#3771) * Update BenchmarkResults to output results of benchmark * added score column and correct TYPE_CHECKING * address comments * address comments * fix import * fix tests * fix tests * change BenchmarkResults to Pydantic dataclass * change benchmark to pydantic dataclass * fix tests * fix model * fix * lint * remove future * fix after review * add test * reapply comments from review * remove mock benchmark * add documentation * added actual results * Update docs/usage/loading_results.md Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * add actual results --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 2.5.0 Automatically generated by python-semantic-release * fix: legacy clustering processing (#3791) fix clustering processing * 2.5.1 Automatically generated by python-semantic-release * better clustering fix (#3793) * docs: update MIEB contributing guide for MTEB v2 AbsTask structure (#3787) * docs: update MIEB contributing guide for MTEB v2 AbsTask structure * Update docs/mieb/readme.md * Update docs/mieb/readme.md * model: add octen_models (#3789) * model: add octen_models * add issue link for document prompt * Add leaderboard timing logs and join_revisions() speedups (#3790) * feat: add detailed timing logs to leaderboard initialization Add comprehensive timing information to track performance of each step in the leaderboard building process: - Loading benchmark results (from cache or remote) - Fetching and processing benchmarks - Filtering models and generating tables - Creating Gradio components and interface - Prerun phase for cache population Each step logs start and completion times with elapsed duration to help identify performance bottlenecks during leaderboard initialization. * perf: optimize benchmark processing with caching and vectorized operations Implemented 3 high-impact optimizations to reduce benchmark processing time: 1. Cache get_model_metas() calls using @functools.lru_cache - Eliminates 59 redundant calls (once per benchmark) - Now called once and cached for all benchmarks 2. Replace pandas groupby().apply() with vectorized operations - Replaced deprecated .apply(keep_best) pattern - Uses sort_values() + groupby().first() instead - Avoids nested function calls per group 3. Cache version string parsing with @functools.lru_cache - Eliminates redundant parsing of same version strings - Uses LRU cache with 10,000 entry limit Performance improvements: - Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction) - join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction) - Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction) This significantly improves leaderboard startup time by reducing the benchmark processing bottleneck. * Update mteb/leaderboard/app.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: ensure deterministic revision grouping in join_revisions() - Replace groupby(revision_clean) with groupby(revision) - Remove non-deterministic iloc[0] access for revision selection - Tasks with different original revisions (None vs external) now kept separate - Each ModelResult has consistent revision across all its task_results This resolves the issue where tasks with different original revisions that mapped to the same cleaned value would be grouped together non-deterministically. * refactor: use default lru_cache maxsize for _get_cached_model_metas * refactor: remove optimization markers from comments * Apply suggestion from @isaac-chung --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Optimize validate filter scores only (#3792) * feat: add detailed timing logs to leaderboard initialization Add comprehensive timing information to track performance of each step in the leaderboard building process: - Loading benchmark results (from cache or remote) - Fetching and processing benchmarks - Filtering models and generating tables - Creating Gradio components and interface - Prerun phase for cache population Each step logs start and completion times with elapsed duration to help identify performance bottlenecks during leaderboard initialization. * perf: optimize benchmark processing with caching and vectorized operations Implemented 3 high-impact optimizations to reduce benchmark processing time: 1. Cache get_model_metas() calls using @functools.lru_cache - Eliminates 59 redundant calls (once per benchmark) - Now called once and cached for all benchmarks 2. Replace pandas groupby().apply() with vectorized operations - Replaced deprecated .apply(keep_best) pattern - Uses sort_values() + groupby().first() instead - Avoids nested function calls per group 3. Cache version string parsing with @functools.lru_cache - Eliminates redundant parsing of same version strings - Uses LRU cache with 10,000 entry limit Performance improvements: - Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction) - join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction) - Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction) This significantly improves leaderboard startup time by reducing the benchmark processing bottleneck. * perf: optimize validate_and_filter_scores filtering logic * Update mteb/results/task_result.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: Add model_type in model_meta for all models (#3751) * Add model_type in model_meta for all models * added literal for model_type * update jina embedding model type * Added model_type to from_cross_encoder() method * update test * change location in model_meta to pass test * update late_interaction model and fix test * update late_interaction for colnomic models * update test * Update mteb/models/model_meta.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix naming * remove is_cross_encoder field and convert it into property --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 2.5.2 Automatically generated by python-semantic-release * fix: Added warnings.warn when logging warnings (#3753) * Added warnings.warn when logging warnings * address comments * Added depreciation warning * made better * address comments * address comments * address comments --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * 2.5.3 Automatically generated by python-semantic-release * save kwargs passed to get_model in model_meta (#3785) * save kwargs passed to get_model in model_meta * add save_kwargs to load_model * removed copy of meta * Update mteb/models/model_meta.py * try to run with kwargs * try to move kwargs * add tests * change model in tests --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: add typecheck (#3550) * add pytyped * start typing * finish evaluators * add more types * Update mteb/results/benchmark_results.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * apply comments * continue typechecking * fix typehint * typechecking * fix tests * fix type errors again * fix cache * add more types * fix method * roll back pyproject * activate PGH * install more types * almost finish * fix search wrappers * add ci * fix tests * fix 3.10 types * rollback overload * fixes after merge * change to iterable * add fixes * remove summarization scores hint * simplify deprecated_evaluator * simplify model conversion * add comment for typechecking * remove casts * remove duplicated function --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * 2.5.4 Automatically generated by python-semantic-release * Add benchmark aliases (#3767) * add benchmark aliases * split to aliases * move aliases * create aliases in separate function * simplify a bit * add test * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * add default value * add MTEB alias --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * Add function for creating mock images (#3803) * create function for creating mock tasks * add annotations * docs: add benchmark filtering examples (#3805) * docs: add benchmark filtering examples * Apply suggestion from @Samoed Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * docs: remove custom benchmarks subsection * docs: expand filtering section with content tabs * docs: fix code block indentation in content tabs * build: include docs deps in dev group --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update generate_model_card with get_benchmark_result() (#3796) * update generate_model_card with get_benchmark_result() * add support for list of benchmarks * split parameters * fix type * generate card * add tests * add tests * add tabulate to test dependencies * correct tests --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * Update the API of Bytedance/Seed1.6-embedding-1215 (#3814) * update reference website of Seed1.6-embedding-1215 * update Bytedance/Seed1.6-embedding-1215 model * fix: repo exists check (#3813) * fix repo exists check * add test * 2.5.5 Automatically generated by python-semantic-release * feat: Add leaderboard CLI command (#3802) * feat: add leaderboard CLI command with cache-path option * test: add comprehensive tests for leaderboard CLI command * try to fix install * fix: lazy-load leaderboard to avoid requiring deps for CLI * Update mteb/cli/build_cli.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * make lint * remove AGENTS.md * move import to top of file * log the default cache path * Improve leaderboard tests to verify actual cache paths Address PR feedback by modifying leaderboard tests to verify the actual cache paths passed to get_leaderboard_app instead of mocking ResultCache. - Updated test_leaderboard_custom_cache_path to create real ResultCache instances and verify the correct custom cache path is used - Updated test_leaderboard_default_cache to verify the default cache path is used - Removed ResultCache mocking in favor of testing actual cache behavior - Used patch.dict to mock the leaderboard module import while preserving real cache functionality This provides better test coverage by validating that the cache objects passed to the leaderboard app have the correct paths, as suggested in PR comment: #3802 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Combine leaderboard cache tests using pytest parametrize Address PR feedback by combining test_leaderboard_custom_cache_path and test_leaderboard_default_cache into a single parametrized test. - Created test_leaderboard_cache_paths with parametrize decorator - Tests both custom cache path and default cache path scenarios - Each test case covers different host, port, and share configurations - Removed redundant test_leaderboard_args as functionality is now covered by the parametrized test - Improved test maintainability by reducing code duplication This addresses PR comment: #3802 (comment) "Can be combined with the following test using a parametrize argument" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Update make run-leaderboard to use new CLI and remove app.py main block Address PR feedback by updating the project to use the new leaderboard CLI: - Updated Makefile run-leaderboard target to use `python -m mteb leaderboard` instead of `python -m mteb.leaderboard.app` - Removed the `if __name__ == "__main__":` block from mteb/leaderboard/app.py as this functionality is now handled by the CLI command This completes the integration of the new leaderboard CLI command into the project's build system and removes deprecated direct module execution. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * feat: add theme and head parameters to leaderboard CLI * fix: suppress leaderboard warnings on CLI launch * test: update leaderboard tests for theme and head params * Revert "Update make run-leaderboard to use new CLI and remove app.py main block" This reverts commit d4df501. * Update mteb/cli/build_cli.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * docs: update leaderboard CLI usage * update docs to show defaults * fix: apply ruff formatting --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> * 2.6.0 Automatically generated by python-semantic-release * Add filter for model type (#3799) * Add filter for model type * fix literal issue * fix * remove white space * remove logic in filter_tasks * remove info in leaderboard * add tests * update tests * add default in model types * fix model filter --------- Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * add model: bflhc/Octen-Embedding-4B (#3816) * fix: Download cached results zip from cached-data branch (#3795) * Optimize leaderboard startup by downloading cached results from cached-data branch - Modify _load_results() to first try downloading __cached_results.json.gz from the cached-data branch - Only fallback to full repository clone if the direct download fails - Add gzip decompression to handle the compressed cache file - This reduces startup time significantly by avoiding full repo cloning when possible - Added comprehensive logging to track download progress and fallback behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * make lint * Fix leaderboard stability test with enhanced debugging - Remove prevent_thread_lock=True to keep Gradio process alive - Add comprehensive exception handling for HTTP, gzip, and file operations - Optimize test completion with HTTP 200 health checking (300s → ~140s) - Add detailed logging and warning suppressions for better debugging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Update tests/test_leaderboard.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Add comprehensive tests for leaderboard caching exception handling - Add 46 unit tests covering HTTP downloads, gzip decompression, file I/O, and JSON validation - Reorganize leaderboard tests into focused modules for better maintainability - Update Makefile with improved leaderboard test commands 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Increase cached results download size limit to 500MB The cached results file has grown to ~92.7MB, exceeding the previous 50MB limit. This change increases the limit to 500MB to accommodate current and future file sizes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Fix leaderboard tests by adding missing dependency to install-for-tests GitHub Actions were failing because cachetools was not installed during CI test runs. The leaderboard extra was already defined with cachetools>=5.2.0 but wasn't included in the install-for-tests target used by CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Remove LogFlusher functionality from leaderboard app Addresses PR comment feedback indicating the log flushing optimization was unnecessary at this stage. Removes: - LogFlusher class with batching logic - Global _log_flusher instance - _flush_logs() wrapper function - All calls to _flush_logs() throughout the app - Complete test file test_log_flushing.py Leaderboard functionality remains unchanged and tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * remove _validate_benchmark_json * Refactor leaderboard caching to use ResultCache and consolidate tests Move download_cached_results_from_branch to ResultCache class and reduce TestDownloadCachedResultsFromBranch from 23 to 13 test cases while maintaining full coverage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * lint and remove unreachable code * Move shared test fixtures to parent conftest.py - Created tests/conftest.py with shared fixtures (mock_benchmark_json, mock_invalid_json, mock_gzipped_content) for use across all tests - Removed duplicate fixtures from tests/test_leaderboard/conftest.py - Kept leaderboard-specific fixtures in test_leaderboard/conftest.py - Fixes TestDownloadCachedResultsFromBranch test failures by making fixtures accessible to test_result_cache.py All 25 tests now passing (23 in test_result_cache.py, 2 in test_integration.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * make method private * Fix content type validation test to match implementation behavior The test_content_type_handling test was expecting warnings for unexpected content types, but the actual implementation raises exceptions. Updated test to use pytest.raises() for proper exception validation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * update cache based on review comments * type check * Remove unused leaderboard_test_config fixture * fix: remove unused mock_invalid_json fixture * rm AGENTS/,d * reduce number of excepts in app.py --------- Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * 2.6.1 Automatically generated by python-semantic-release * ci: Switch CI to use `uv` (#3702) * use uv to all make commands * read the docs a bit more... * try out system flag * fix: remove redundant pip install uv commands from Makefile Removes duplicate uv installations that were conflicting with the properly configured uv from astral-sh/setup-uv GitHub Action. The GitHub Action already installs and configures uv correctly, so the Makefile pip installs were overwriting this configuration and causing "No system Python installation found" errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: remove --system flag from uv pip install commands The astral-sh/setup-uv GitHub Action configures uv to manage its own Python installations, not to use system Python. The --system flag was causing "No system Python installation found" errors because uv expects to use its managed Python environment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: migrate Makefile to use correct uv workflow - Replace 'uv pip install' with 'uv sync' for dependency management - Add proper --extra flags for all optional dependencies - Use 'uv run' for all Python command executions - Follow official uv GitHub Actions best practices This aligns with uv's recommended project workflow and should resolve the CI environment issues we were experiencing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: update all GitHub Actions workflows to remove UV_SYSTEM_PYTHON - Remove UV_SYSTEM_PYTHON: 1 from all workflow files - Fix documentation.yml to use 'uv sync --group docs' instead of 'uv pip install' - Fix leaderboard_build.yml to use 'uv sync --extra leaderboard --group dev' - Ensures consistent uv workflow across all CI jobs Updated workflows: - lint.yml - documentation.yml - model_loading.yml - dataset_loading.yml - leaderboard_build.yml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: lint workflow to use correct dependency group - Change from 'make install' to 'uv sync --group lint' since pre-commit is in the lint group - Add explicit pre-commit install step - Use 'uv run' for lint commands (ruff, typos) to ensure proper environment - Fixes "pre-commit: No such file or directory" error in lint workflow 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * remove 3.14 * try out 3.14 again with python_full_version * specify torch version for pylate dep * try to skip colpali * try split torch * Add --no-sync flag and group/extra flags to uv run commands Address review comments from PR #3702: 1. Add --no-sync to all uv run commands in Makefile for: - Faster execution (avoids re-syncing on each command) - pip compatibility (users can remove 'uv run' prefix) 2. Add appropriate group/extra flags to uv run commands: - test commands: --group test - docs commands: --group docs - typecheck: --group typing - leaderboard: --extra leaderboard 3. Update CI workflows to use --no-sync and appropriate groups: - lint.yml: Add --no-sync --group lint to all uv run commands - documentation.yml: Add uv run --no-sync --group docs to mkdocs gh-deploy These changes improve performance while maintaining compatibility for contributors who prefer using pip directly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * try removing install block * add back install block * remove install block in doc CI without --no-sync * add uv lock file * replace install-for-test with just install * install pre-commit with uv * fix doc workflow * address review comments * remove no-sync from run-leaderboard make command * remove --no-sync from selected make commands * update typechecking * fix type checking * sync to install * fix tests * test pre-commit setup * remove test file * fix: separate install and install-for-tests with uv commands * fix: add leaderboard extra to typecheck command for gradio imports * fix: add faiss-cpu extra to test targets * fix: update CI workflows for uv dependency management * docs: update all documentation for uv migration - Add uv installation options alongside pip in README.md - Update installation.md with comprehensive migration guide for contributors - Add uv context to CONTRIBUTING.md for development setup - Update all usage docs to include uv alternatives for extras: - openai, leaderboard, image, xet, faiss-cpu dependencies - Fix incorrect extra name: faiss -> faiss-cpu in retrieval_backend.md - Ensure consistent dual-option approach (pip/uv) throughout documentation This provides users and contributors with modern, fast uv tooling while maintaining backward compatibility with existing pip workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: handle git lfs content for cached zip file (#3827) * 2.6.2 Automatically generated by python-semantic-release * fix: Allow passing device to model (#3812) * Allow passing device to model * revert incorrect modification and fix typeerror * add device to get_model and address comments * Correct CDEWrapper * 2.6.3 Automatically generated by python-semantic-release * fix: Add leaderboard docker workflow (#3828) * Add GitHub workflow to test leaderboard Dockerfile - Add .github/workflows/leaderboard_docker.yml workflow that: - Builds the Docker image - Tests container startup with 6-minute timeout - Monitors for container exit codes and failures - Shows progress updates during long initialization - Validates leaderboard dependencies are available - Include Dockerfile for leaderboard containerization - Fix missing typer dependency in leaderboard extras to resolve ModuleNotFoundError - Update uv.lock with typer dependency 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Optimize leaderboard Docker workflow with smart exit conditions - Add intelligent startup progress detection instead of blind 5.5min wait - Monitor key milestones: app startup, Step 1/7 completion - Only exit early on actual completion signals (Gradio server ready, full initialization) - Hard timeout failure at 5.5min regardless of progress - Improved logging with 30s progress updates - Tested with act: reduces wait time from 5.5min to ~2.5min when appropriate 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Remove redundant Test container dependencies step - Dependencies already verified during Docker build process - Runtime verification handled by smart exit conditions - Eliminates environment-specific import failures in act testing - Streamlines workflow to focus on essential container startup validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Fix Docker image availability and add GHCR caching - Add load: true to ensure built image is available for docker run - Add GHCR login and push for enhanced caching across workflow runs - Include commit-specific and latest tags for better cache utilization 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Fix Docker Hub push error by separating GHCR push from local testing - Only push to GHCR to avoid Docker Hub authentication issues - Pull GHCR image and tag locally for testing - Maintains same local tag name for test step compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Add free disk space step to prevent storage errors - Remove unused software installations (~10GB) - Clean Docker cache before build operations - Prevents "no space left on device" errors during image pull 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * Fix duplicate detection messages and improve server response validation - Add INIT_COMPLETE_DETECTED state tracking to prevent repeated "initialization complete" messages - Remove Step 1/7 detection logic that was causing duplicate output - Add server response test after initialization complete detection - Clean up progress status display and exit conditions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: remove redundant typer dependency from leaderboard extras * fix: only push Docker images from main branch * fix: use COPY instead of git clone in Dockerfile --------- Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> * 2.6.4 Automatically generated by python-semantic-release * docs: Fix docs build strict mode errors (#3809) * fix: resolve mkdocs strict mode errors * fix: remove duplicate line in installation.md * build: add --strict flag to mkdocs build * fix: resolve invalid BibTeX keys in task citations * feat: filter BibTeX warnings in strict docs build * Update docs/usage/defining_the_model.md Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: dynamic mkdocs path discovery for CI * fix: improve docs build script with clear warning counts * fix: resolve 6 real docs build warnings * fix: remove broken PylateSearchEncoder reference * Remove unused build scripts * docs: wrap multimodal example as code snippet * fix: export SklearnModelProtocol for docs API * docs: add API reference for SklearnModelProtocol * fix: remove SklearnModelProtocol export to avoid circular import * feat: add SklearnModelProtocol docs with lazy import * fix: convert Sphinx cross-references to MkDocs syntax for proper linking Convert :class: Sphinx syntax to [Text][module.path] MkDocs syntax to ensure cross-references are properly clickable in the generated documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * chore: remove SklearnModelProtocol docs to avoid circular import * chore: remove SklearnModelProtocol export from _evaluators * style: fix indentation in _evaluators/__init__.py * fix: enable mkdocs build --strict without warnings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix: rename evaluator to multilabel_classifier to avoid type conflict * fix: use evaluator_model instead of evaluator to avoid type conflict - Change parameter name from multilabel_classifier to evaluator_model - Maintains SklearnModelProtocol type hint as requested in PR review - Resolves mypy type error by using parent class's evaluator_model field - Keeps explicit protocol reference in docstring for clarity 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> * model: Add SauerkrautLM-ColPali visual document retrieval models (#3804) * model: Add SauerkrautLM-ColPali visual document retrieval models Add inference code and requirements for SauerkrautLM-ColPali visual document retrieval models. These are multi-vector embedding models based on the ColPali architecture: - ColQwen3 (Qwen3-VL backbone): 1.7B Turbo, 2B, 4B, 8B variants - ColLFM2 (LFM2-VL backbone): 450M variant - ColMinistral3 (Ministral3 backbone): 3B variant All models produce 128-dimensional embeddings per text/image token and use MaxSim (late interaction) for retrieval scoring. Model checkpoints: - https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1 - https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1 - https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1 - https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1 - https://huggingface.co/VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1 - https://huggingface.co/VAGOsolutions/SauerkrautLM-ColMinistral3-3b-v0.1 * fix: Address review comments - Remove loader functions, use classes directly in ModelMeta - Remove unused get_fused_embeddings method - Move model.to(device) and model.eval() to base class __init__ - Pass torch_dtype directly to ColMinistral3.from_pretrained * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update pyproject.toml Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: Update release_date to 2025-12-20 * fix: address review comments - remove partial, add adapted_from and training_datasets * Update mteb/models/model_implementations/slm_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: import COLPALI_CITATION from colpali_models and add model_type * add training datasets * fix: remove section headers and use PyPI package instead of Git URL * fix: resolve merge conflicts and remove section headers * fix: use COLPALI_TRAINING_DATA for training_datasets * fix: use exact n_parameters and memory_usage_mb values from HuggingFace * don't build 3.14 * lint --------- Co-authored-by: David Golchinfar <d.golchin@web.de> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix dataset generation tags (#3835) * fix: Extend framework annotations for `ModelMeta` (#3819) * Update framework and filter based on them * update ModelMeta of models * update ModelMeta of models * update ModelMeta of models * update ModelMeta of models * update ModelMeta of models * update ModelMeta of models * update ModelMeta * add csv * update ModelMeta * added framework to ModelMeta * update ModelMeta of models * update ModelMeta of models * update framework in ModelMeta of models * update framework * update framework * update framework in ModelMeta * fix tests * Add models * fix tests * add tags extraction in from_hub() * fix typecheck * apply suggestions * apply suggestions * keep only static method * delete csv and script * 2.6.5 Automatically generated by python-semantic-release * dataset: Vietnamese VN-MTEB TVPLRetrieval, NanoClimateFEVER-VN, NanoFEVER-VN, NanoDBPedia-VN, NanoNQ-VN, NanoHotpotQA-VN, NanoMSMARCO-VN (#3810) * [ADD] Vietnamese VN-MTEB TVPLRetrieval, NanoClimateFEVER-VN, NanoFEVER-VN, NanoDBPedia-VN, NanoNQ-VN, NanoHotpotQA-VN, NanoMSMARCO-VN * [UPDATE] descriptive stats * [UPDATE] bibtext * [UPDATE] dataset path * [UPDATE] nano db pedia retrieval * [UPDATE] size dataset from 1M corpus to 100k * [ADD] add note about what's different in nano version * [ADD] TVPLRetrieval description * test: Add HF Space Dockerfile using pre-built leaderboard image (#3838) * Add HF Space Dockerfile using pre-built leaderboard image Adds a lightweight Dockerfile for HuggingFace Space deployment that uses the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base. Also adds a workflow to test the Dockerfile. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Delete .github/workflows/hf_space_docker.yml * test: Add CI workflow for HF Space Dockerfile validation --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: MTEB Agent <agent@example.com> * Update uv.lock 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix lint and type errors - Remove unused LogOnce import from _create_dataloaders.py - Use specific type ignore codes [arg-type] in mock_tasks.py for PGH003 compliance - Fix type annotations in classification.py to use Array type instead of np.ndarray - Remove unused Iterable import from classification.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix duplicate modalities kwarg in random_baseline ModelMeta Remove modalities from _common_mock_metadata since each ModelMeta instance specifies its own modalities, which caused "got multiple values for keyword argument 'modalities'" error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix baselines * invalidate hf cache for maeb * temporarily skip 3.14 in tests --------- Co-authored-by: Munot Ayush Sunil <munotayush6@kgpian.iitkgp.ac.in> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: semantic-release <semantic-release> Co-authored-by: bflhc <kunka.xgw@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Quan Yuhan <yuhan_quan@qq.com> Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> Co-authored-by: dgolchin <david.golchinfar@h-brs.de> Co-authored-by: David Golchinfar <d.golchin@web.de> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: MTEB Agent <agent@example.com>

Samoed reviewed Jan 4, 2026

View reviewed changes

.github/workflows/hf_space_docker.yml Show resolved Hide resolved

Delete .github/workflows/hf_space_docker.yml

9adb9f0

isaac-chung requested a review from Samoed January 4, 2026 11:20

test: Add CI workflow for HF Space Dockerfile validation

d895b26

isaac-chung requested a review from KennethEnevoldsen January 5, 2026 19:04

isaac-chung changed the title ~~Add HF Space Dockerfile using pre-built leaderboard image~~ test: Add HF Space Dockerfile using pre-built leaderboard image Jan 5, 2026

Samoed reviewed Jan 5, 2026

View reviewed changes

Dockerfile.hf-space Show resolved Hide resolved

isaac-chung merged commit b905e27 into main Jan 5, 2026
11 checks passed

isaac-chung deleted the hf-space-dockerfile branch January 5, 2026 21:26

test: Add HF Space Dockerfile using pre-built leaderboard image #3838

test: Add HF Space Dockerfile using pre-built leaderboard image #3838

Uh oh!

Conversation

isaac-chung commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

Samoed commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

KennethEnevoldsen commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

KennethEnevoldsen commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

KennethEnevoldsen commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

Samoed commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

isaac-chung commented Jan 4, 2026

Uh oh!

KennethEnevoldsen commented Jan 4, 2026

Uh oh!

Samoed commented Jan 4, 2026

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaac-chung commented Jan 5, 2026

Uh oh!

Uh oh!

isaac-chung commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

isaac-chung commented Jan 3, 2026 •

edited

Loading

isaac-chung commented Jan 4, 2026 •

edited

Loading

Samoed Jan 5, 2026 •

edited

Loading