Skip to content

Conversation

@isaac-chung
Copy link
Collaborator

@isaac-chung isaac-chung commented Jan 3, 2026

Fixes #3821

Adds a lightweight Dockerfile for HuggingFace Space deployment that uses the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base. Also adds a workflow to test the Dockerfile.

Note: The workflow has GITHUB_TOKEN so it'll be able to pull the base image. But it won't be visible from the HF Space, so the images need to be made public.

Adds a lightweight Dockerfile for HuggingFace Space deployment that uses
the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base.
Also adds a workflow to test the Dockerfile.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@isaac-chung
Copy link
Collaborator Author

@Samoed can you add a secret in the HF space? https://huggingface.co/docs/hub/spaces-overview#managing-secrets
I can remove the test.

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

I don't think that token would work, because we would need to authorize somehow to pull image. I can't find documentation for that on HF

@isaac-chung
Copy link
Collaborator Author

isaac-chung commented Jan 4, 2026

I don't think that token would work, because we would need to authorize somehow to pull image. I can't find documentation for that on HF

AFAIU from what I tried, we need either cases to be true to pull the image:

  1. Authorize images/packages to be public in this repo, OR
  2. Use GITHUB_TOKEN and pull the image normally (example done in CI)

Would you mind adding the token and we can try this out?

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

Use GITHUB_TOKEN and pull the image normally (example done in CI)

In ci you basically do docker login ..., but we can't do the same in HF, because we can't run commands before image running

@isaac-chung
Copy link
Collaborator Author

Ah yes, you're right. We'll have to wait for the packages to be made public then.

@KennethEnevoldsen
Copy link
Contributor

public!

@isaac-chung isaac-chung requested a review from Samoed January 4, 2026 11:20
@isaac-chung
Copy link
Collaborator Author

Also https://huggingface.co/spaces/mteb/leaderboard/discussions/175

@KennethEnevoldsen
Copy link
Contributor

@isaac-chung feel free to merge that - it looks good (but do it at a time where you have to time to check if the leaderboard breaks)

@isaac-chung
Copy link
Collaborator Author

@KennethEnevoldsen btw I don't have write permissions to the HF LB space, so I'm not able to merge that PR yet :(

@KennethEnevoldsen
Copy link
Contributor

Oh you should have write - you should have them now

@isaac-chung
Copy link
Collaborator Author

Thanks! Confirming that I now see the merge button.

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

Do we need this pr now?

@isaac-chung
Copy link
Collaborator Author

Yes. The point is to keep a copy of the LB dockerfile, so we won't miss updates in future.

@isaac-chung
Copy link
Collaborator Author

LB restarted and seems to be running well 🎉

@isaac-chung
Copy link
Collaborator Author

Before we merge this, I'd like to include a small CI workflow to simply run this HF space dockerfile.

@KennethEnevoldsen
Copy link
Contributor

Yeah it would be great to confirm that this runs so that we see the error here before we see it on the leaderboard

@Samoed
Copy link
Member

Samoed commented Jan 4, 2026

@isaac-chung isaac-chung changed the title Add HF Space Dockerfile using pre-built leaderboard image test: Add HF Space Dockerfile using pre-built leaderboard image Jan 5, 2026
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're testing basically the same image with same command in https://github.com/embeddings-benchmark/mteb/blob/main/.github/workflows/leaderboard_docker.yml

Comment on lines +5 to +13
branches: [ main ]
paths:
- 'Dockerfile.hf-space'
- '.github/workflows/hf_space_docker.yml'
pull_request:
branches: [ main ]
paths:
- 'Dockerfile.hf-space'
- '.github/workflows/hf_space_docker.yml'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this action won't be triggered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Samoed Samoed Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this. This will be triggered only on changed in dockerfile for spaces, but I'm not sure if it will be enough to test then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to move these checks to leaderboard_docker.yml to test this more frequently and make sure that our leaderboard will work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that's what the paths section of the workflow file shows. I'm not sure there's new info here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping these tests separate makes more sense.

@isaac-chung
Copy link
Collaborator Author

Once again, the point of this test is to track the exact same dockerfile used in the LB. We then will be able to:

  1. track any code that will affect the LB in the HF Space
  2. Test the exact same dockerfile used in the HF Space

Even though there's a lot of overlap, this test completes the LB test coverage.
e.g. If the images were not made public, this test would have failed.

@isaac-chung
Copy link
Collaborator Author

Let's try to keep the nit to the least if possible :) going to merge now

@isaac-chung isaac-chung merged commit b905e27 into main Jan 5, 2026
11 checks passed
@isaac-chung isaac-chung deleted the hf-space-dockerfile branch January 5, 2026 21:26
isaac-chung added a commit that referenced this pull request Jan 7, 2026
* feat: Added `get_benchmark_result()` to BenchmarkResults to obtain a benchmark table (#3771)

* Update BenchmarkResults to output results of benchmark

* added score column and correct TYPE_CHECKING

* address comments

* address comments

* fix import

* fix tests

* fix tests

* change BenchmarkResults to Pydantic dataclass

* change benchmark to pydantic dataclass

* fix tests

* fix model

* fix

* lint

* remove future

* fix after review

* add test

* reapply comments from review

* remove mock benchmark

* add documentation

* added actual results

* Update docs/usage/loading_results.md

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* add actual results

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* 2.5.0

Automatically generated by python-semantic-release

* fix: legacy clustering processing (#3791)

fix clustering processing

* 2.5.1

Automatically generated by python-semantic-release

* better clustering fix (#3793)

* docs: update MIEB contributing guide for MTEB v2 AbsTask structure (#3787)

* docs: update MIEB contributing guide for MTEB v2 AbsTask structure

* Update docs/mieb/readme.md

* Update docs/mieb/readme.md

* model: add octen_models (#3789)

* model: add octen_models

* add issue link for document prompt

* Add leaderboard timing logs and join_revisions() speedups (#3790)

* feat: add detailed timing logs to leaderboard initialization

Add comprehensive timing information to track performance of each step
in the leaderboard building process:
- Loading benchmark results (from cache or remote)
- Fetching and processing benchmarks
- Filtering models and generating tables
- Creating Gradio components and interface
- Prerun phase for cache population

Each step logs start and completion times with elapsed duration to help
identify performance bottlenecks during leaderboard initialization.

* perf: optimize benchmark processing with caching and vectorized operations

Implemented 3 high-impact optimizations to reduce benchmark processing time:

1. Cache get_model_metas() calls using @functools.lru_cache
   - Eliminates 59 redundant calls (once per benchmark)
   - Now called once and cached for all benchmarks

2. Replace pandas groupby().apply() with vectorized operations
   - Replaced deprecated .apply(keep_best) pattern
   - Uses sort_values() + groupby().first() instead
   - Avoids nested function calls per group

3. Cache version string parsing with @functools.lru_cache
   - Eliminates redundant parsing of same version strings
   - Uses LRU cache with 10,000 entry limit

Performance improvements:
- Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction)
- join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction)
- Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction)

This significantly improves leaderboard startup time by reducing the
benchmark processing bottleneck.

* Update mteb/leaderboard/app.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: ensure deterministic revision grouping in join_revisions()

- Replace groupby(revision_clean) with groupby(revision)
- Remove non-deterministic iloc[0] access for revision selection
- Tasks with different original revisions (None vs external) now kept separate
- Each ModelResult has consistent revision across all its task_results

This resolves the issue where tasks with different original revisions that mapped
to the same cleaned value would be grouped together non-deterministically.

* refactor: use default lru_cache maxsize for _get_cached_model_metas

* refactor: remove optimization markers from comments

* Apply suggestion from @isaac-chung

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Optimize validate filter scores only (#3792)

* feat: add detailed timing logs to leaderboard initialization

Add comprehensive timing information to track performance of each step
in the leaderboard building process:
- Loading benchmark results (from cache or remote)
- Fetching and processing benchmarks
- Filtering models and generating tables
- Creating Gradio components and interface
- Prerun phase for cache population

Each step logs start and completion times with elapsed duration to help
identify performance bottlenecks during leaderboard initialization.

* perf: optimize benchmark processing with caching and vectorized operations

Implemented 3 high-impact optimizations to reduce benchmark processing time:

1. Cache get_model_metas() calls using @functools.lru_cache
   - Eliminates 59 redundant calls (once per benchmark)
   - Now called once and cached for all benchmarks

2. Replace pandas groupby().apply() with vectorized operations
   - Replaced deprecated .apply(keep_best) pattern
   - Uses sort_values() + groupby().first() instead
   - Avoids nested function calls per group

3. Cache version string parsing with @functools.lru_cache
   - Eliminates redundant parsing of same version strings
   - Uses LRU cache with 10,000 entry limit

Performance improvements:
- Benchmark processing: 131.17s → 44.73s (2.93x faster, 66% reduction)
- join_revisions(): 84.96s → 1.73s (49x faster, 98% reduction)
- Leaderboard Step 3: 121.28s → 48.23s (2.51x faster, 60% reduction)

This significantly improves leaderboard startup time by reducing the
benchmark processing bottleneck.

* perf: optimize validate_and_filter_scores filtering logic

* Update mteb/results/task_result.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: Add model_type in model_meta for all models (#3751)

* Add model_type in model_meta for all models

* added literal for model_type

* update jina embedding model type

* Added model_type to from_cross_encoder() method

* update test

* change location in model_meta to pass test

* update late_interaction model and fix test

* update late_interaction for colnomic models

* update test

* Update mteb/models/model_meta.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix naming

* remove is_cross_encoder field and convert it into property

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* 2.5.2

Automatically generated by python-semantic-release

* fix: Added warnings.warn when logging warnings (#3753)

* Added warnings.warn when logging warnings

* address comments

* Added depreciation warning

* made better

* address comments

* address comments

* address comments

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.5.3

Automatically generated by python-semantic-release

* save kwargs passed to get_model in model_meta (#3785)

* save kwargs passed to get_model in model_meta

* add save_kwargs to load_model

* removed copy of meta

* Update mteb/models/model_meta.py

* try to run with kwargs

* try to move kwargs

* add tests

* change model in tests

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: add typecheck (#3550)

* add pytyped

* start typing

* finish evaluators

* add more types

* Update mteb/results/benchmark_results.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* apply comments

* continue typechecking

* fix typehint

* typechecking

* fix tests

* fix type errors again

* fix cache

* add more types

* fix method

* roll back pyproject

* activate PGH

* install more types

* almost finish

* fix search wrappers

* add ci

* fix tests

* fix 3.10 types

* rollback overload

* fixes after merge

* change to iterable

* add fixes

* remove summarization scores hint

* simplify deprecated_evaluator

* simplify model conversion

* add comment for typechecking

* remove casts

* remove duplicated function

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.5.4

Automatically generated by python-semantic-release

* Add benchmark aliases (#3767)

* add benchmark aliases

* split to aliases

* move aliases

* create aliases in separate function

* simplify a bit

* add test

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* add default value

* add MTEB alias

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* Add function for creating mock images  (#3803)

* create function for creating mock tasks

* add annotations

* docs: add benchmark filtering examples (#3805)

* docs: add benchmark filtering examples

* Apply suggestion from @Samoed

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* docs: remove custom benchmarks subsection

* docs: expand filtering section with content tabs

* docs: fix code block indentation in content tabs

* build: include docs deps in dev group

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* update generate_model_card with get_benchmark_result() (#3796)

* update generate_model_card with get_benchmark_result()

* add support for list of benchmarks

* split parameters

* fix type

* generate card

* add tests

* add tests

* add tabulate to test dependencies

* correct tests

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* Update the API of Bytedance/Seed1.6-embedding-1215 (#3814)

* update reference website of Seed1.6-embedding-1215

* update Bytedance/Seed1.6-embedding-1215 model

* fix: repo exists check (#3813)

* fix repo exists check

* add test

* 2.5.5

Automatically generated by python-semantic-release

* feat: Add leaderboard CLI command (#3802)

* feat: add leaderboard CLI command with cache-path option

* test: add comprehensive tests for leaderboard CLI command

* try to fix install

* fix: lazy-load leaderboard to avoid requiring deps for CLI

* Update mteb/cli/build_cli.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* make lint

* remove AGENTS.md

* move import to top of file

* log the default cache path

* Improve leaderboard tests to verify actual cache paths

Address PR feedback by modifying leaderboard tests to verify the actual
cache paths passed to get_leaderboard_app instead of mocking ResultCache.

- Updated test_leaderboard_custom_cache_path to create real ResultCache instances
  and verify the correct custom cache path is used
- Updated test_leaderboard_default_cache to verify the default cache path is used
- Removed ResultCache mocking in favor of testing actual cache behavior
- Used patch.dict to mock the leaderboard module import while preserving
  real cache functionality

This provides better test coverage by validating that the cache objects
passed to the leaderboard app have the correct paths, as suggested in
PR comment: #3802 (comment)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Combine leaderboard cache tests using pytest parametrize

Address PR feedback by combining test_leaderboard_custom_cache_path and
test_leaderboard_default_cache into a single parametrized test.

- Created test_leaderboard_cache_paths with parametrize decorator
- Tests both custom cache path and default cache path scenarios
- Each test case covers different host, port, and share configurations
- Removed redundant test_leaderboard_args as functionality is now covered
  by the parametrized test
- Improved test maintainability by reducing code duplication

This addresses PR comment: #3802 (comment)
"Can be combined with the following test using a parametrize argument"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Update make run-leaderboard to use new CLI and remove app.py main block

Address PR feedback by updating the project to use the new leaderboard CLI:

- Updated Makefile run-leaderboard target to use `python -m mteb leaderboard`
  instead of `python -m mteb.leaderboard.app`
- Removed the `if __name__ == "__main__":` block from mteb/leaderboard/app.py
  as this functionality is now handled by the CLI command

This completes the integration of the new leaderboard CLI command into
the project's build system and removes deprecated direct module execution.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* feat: add theme and head parameters to leaderboard CLI

* fix: suppress leaderboard warnings on CLI launch

* test: update leaderboard tests for theme and head params

* Revert "Update make run-leaderboard to use new CLI and remove app.py main block"

This reverts commit d4df501.

* Update mteb/cli/build_cli.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* docs: update leaderboard CLI usage

* update docs to show defaults

* fix: apply ruff formatting

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

* 2.6.0

Automatically generated by python-semantic-release

* Add filter for model type (#3799)

* Add filter for model type

* fix literal issue

* fix

* remove white space

* remove logic in filter_tasks

* remove info in leaderboard

* add tests

* update tests

* add default in model types

* fix model filter

---------

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* add model: bflhc/Octen-Embedding-4B (#3816)

* fix: Download cached results zip from cached-data branch (#3795)

* Optimize leaderboard startup by downloading cached results from cached-data branch

- Modify _load_results() to first try downloading __cached_results.json.gz from the cached-data branch
- Only fallback to full repository clone if the direct download fails
- Add gzip decompression to handle the compressed cache file
- This reduces startup time significantly by avoiding full repo cloning when possible
- Added comprehensive logging to track download progress and fallback behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* make lint

* Fix leaderboard stability test with enhanced debugging

- Remove prevent_thread_lock=True to keep Gradio process alive
- Add comprehensive exception handling for HTTP, gzip, and file operations
- Optimize test completion with HTTP 200 health checking (300s → ~140s)
- Add detailed logging and warning suppressions for better debugging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Update tests/test_leaderboard.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Add comprehensive tests for leaderboard caching exception handling

- Add 46 unit tests covering HTTP downloads, gzip decompression, file I/O, and JSON validation
- Reorganize leaderboard tests into focused modules for better maintainability
- Update Makefile with improved leaderboard test commands

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Increase cached results download size limit to 500MB

The cached results file has grown to ~92.7MB, exceeding the previous 50MB limit.
This change increases the limit to 500MB to accommodate current and future file sizes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix leaderboard tests by adding missing dependency to install-for-tests

GitHub Actions were failing because cachetools was not installed during CI test runs.
The leaderboard extra was already defined with cachetools>=5.2.0 but wasn't included
in the install-for-tests target used by CI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Remove LogFlusher functionality from leaderboard app

Addresses PR comment feedback indicating the log flushing optimization
was unnecessary at this stage. Removes:
- LogFlusher class with batching logic
- Global _log_flusher instance
- _flush_logs() wrapper function
- All calls to _flush_logs() throughout the app
- Complete test file test_log_flushing.py

Leaderboard functionality remains unchanged and tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* remove _validate_benchmark_json

* Refactor leaderboard caching to use ResultCache and consolidate tests

Move download_cached_results_from_branch to ResultCache class and reduce TestDownloadCachedResultsFromBranch from 23 to 13 test cases while maintaining full coverage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* lint and remove unreachable code

* Move shared test fixtures to parent conftest.py

- Created tests/conftest.py with shared fixtures (mock_benchmark_json,
  mock_invalid_json, mock_gzipped_content) for use across all tests
- Removed duplicate fixtures from tests/test_leaderboard/conftest.py
- Kept leaderboard-specific fixtures in test_leaderboard/conftest.py
- Fixes TestDownloadCachedResultsFromBranch test failures by making
  fixtures accessible to test_result_cache.py

All 25 tests now passing (23 in test_result_cache.py, 2 in test_integration.py)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* make method private

* Fix content type validation test to match implementation behavior

The test_content_type_handling test was expecting warnings for unexpected
content types, but the actual implementation raises exceptions. Updated test
to use pytest.raises() for proper exception validation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* update cache based on review comments

* type check

* Remove unused leaderboard_test_config fixture

* fix: remove unused mock_invalid_json fixture

* rm AGENTS/,d

* reduce number of excepts in app.py

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* 2.6.1

Automatically generated by python-semantic-release

* ci: Switch CI to use `uv` (#3702)

* use uv to all make commands

* read the docs a bit more...

* try out system flag

* fix: remove redundant pip install uv commands from Makefile

Removes duplicate uv installations that were conflicting with the
properly configured uv from astral-sh/setup-uv GitHub Action.
The GitHub Action already installs and configures uv correctly,
so the Makefile pip installs were overwriting this configuration
and causing "No system Python installation found" errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: remove --system flag from uv pip install commands

The astral-sh/setup-uv GitHub Action configures uv to manage its own
Python installations, not to use system Python. The --system flag
was causing "No system Python installation found" errors because
uv expects to use its managed Python environment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: migrate Makefile to use correct uv workflow

- Replace 'uv pip install' with 'uv sync' for dependency management
- Add proper --extra flags for all optional dependencies
- Use 'uv run' for all Python command executions
- Follow official uv GitHub Actions best practices

This aligns with uv's recommended project workflow and should resolve
the CI environment issues we were experiencing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: update all GitHub Actions workflows to remove UV_SYSTEM_PYTHON

- Remove UV_SYSTEM_PYTHON: 1 from all workflow files
- Fix documentation.yml to use 'uv sync --group docs' instead of 'uv pip install'
- Fix leaderboard_build.yml to use 'uv sync --extra leaderboard --group dev'
- Ensures consistent uv workflow across all CI jobs

Updated workflows:
- lint.yml
- documentation.yml
- model_loading.yml
- dataset_loading.yml
- leaderboard_build.yml

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: lint workflow to use correct dependency group

- Change from 'make install' to 'uv sync --group lint' since pre-commit is in the lint group
- Add explicit pre-commit install step
- Use 'uv run' for lint commands (ruff, typos) to ensure proper environment
- Fixes "pre-commit: No such file or directory" error in lint workflow

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* remove 3.14

* try out 3.14 again with python_full_version

* specify torch version for pylate dep

* try to skip colpali

* try split torch

* Add --no-sync flag and group/extra flags to uv run commands

Address review comments from PR #3702:

1. Add --no-sync to all uv run commands in Makefile for:
   - Faster execution (avoids re-syncing on each command)
   - pip compatibility (users can remove 'uv run' prefix)

2. Add appropriate group/extra flags to uv run commands:
   - test commands: --group test
   - docs commands: --group docs
   - typecheck: --group typing
   - leaderboard: --extra leaderboard

3. Update CI workflows to use --no-sync and appropriate groups:
   - lint.yml: Add --no-sync --group lint to all uv run commands
   - documentation.yml: Add uv run --no-sync --group docs to mkdocs gh-deploy

These changes improve performance while maintaining compatibility for
contributors who prefer using pip directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* try removing install block

* add back install block

* remove install block in doc CI without --no-sync

* add uv lock file

* replace install-for-test with just install

* install pre-commit with uv

* fix doc workflow

* address review comments

* remove no-sync from run-leaderboard make command

* remove --no-sync from selected make commands

* update typechecking

* fix type checking

* sync to install

* fix tests

* test pre-commit setup

* remove test file

* fix: separate install and install-for-tests with uv commands

* fix: add leaderboard extra to typecheck command for gradio imports

* fix: add faiss-cpu extra to test targets

* fix: update CI workflows for uv dependency management

* docs: update all documentation for uv migration

- Add uv installation options alongside pip in README.md
- Update installation.md with comprehensive migration guide for contributors
- Add uv context to CONTRIBUTING.md for development setup
- Update all usage docs to include uv alternatives for extras:
  - openai, leaderboard, image, xet, faiss-cpu dependencies
- Fix incorrect extra name: faiss -> faiss-cpu in retrieval_backend.md
- Ensure consistent dual-option approach (pip/uv) throughout documentation

This provides users and contributors with modern, fast uv tooling while
maintaining backward compatibility with existing pip workflows.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix: handle git lfs content for cached zip file (#3827)

* 2.6.2

Automatically generated by python-semantic-release

* fix: Allow passing device to model (#3812)

* Allow passing device to model

* revert incorrect modification and fix typeerror

* add device to get_model and address comments

* Correct CDEWrapper

* 2.6.3

Automatically generated by python-semantic-release

* fix: Add leaderboard docker workflow (#3828)

* Add GitHub workflow to test leaderboard Dockerfile

- Add .github/workflows/leaderboard_docker.yml workflow that:
  - Builds the Docker image
  - Tests container startup with 6-minute timeout
  - Monitors for container exit codes and failures
  - Shows progress updates during long initialization
  - Validates leaderboard dependencies are available
- Include Dockerfile for leaderboard containerization
- Fix missing typer dependency in leaderboard extras to resolve ModuleNotFoundError
- Update uv.lock with typer dependency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Optimize leaderboard Docker workflow with smart exit conditions

- Add intelligent startup progress detection instead of blind 5.5min wait
- Monitor key milestones: app startup, Step 1/7 completion
- Only exit early on actual completion signals (Gradio server ready, full initialization)
- Hard timeout failure at 5.5min regardless of progress
- Improved logging with 30s progress updates
- Tested with act: reduces wait time from 5.5min to ~2.5min when appropriate

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Remove redundant Test container dependencies step

- Dependencies already verified during Docker build process
- Runtime verification handled by smart exit conditions
- Eliminates environment-specific import failures in act testing
- Streamlines workflow to focus on essential container startup validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix Docker image availability and add GHCR caching

- Add load: true to ensure built image is available for docker run
- Add GHCR login and push for enhanced caching across workflow runs
- Include commit-specific and latest tags for better cache utilization

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix Docker Hub push error by separating GHCR push from local testing

- Only push to GHCR to avoid Docker Hub authentication issues
- Pull GHCR image and tag locally for testing
- Maintains same local tag name for test step compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Add free disk space step to prevent storage errors

- Remove unused software installations (~10GB)
- Clean Docker cache before build operations
- Prevents "no space left on device" errors during image pull

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* Fix duplicate detection messages and improve server response validation

- Add INIT_COMPLETE_DETECTED state tracking to prevent repeated "initialization complete" messages
- Remove Step 1/7 detection logic that was causing duplicate output
- Add server response test after initialization complete detection
- Clean up progress status display and exit conditions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: remove redundant typer dependency from leaderboard extras

* fix: only push Docker images from main branch

* fix: use COPY instead of git clone in Dockerfile

---------

Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>

* 2.6.4

Automatically generated by python-semantic-release

* docs: Fix docs build strict mode errors (#3809)

* fix: resolve mkdocs strict mode errors

* fix: remove duplicate line in installation.md

* build: add --strict flag to mkdocs build

* fix: resolve invalid BibTeX keys in task citations

* feat: filter BibTeX warnings in strict docs build

* Update docs/usage/defining_the_model.md

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: dynamic mkdocs path discovery for CI

* fix: improve docs build script with clear warning counts

* fix: resolve 6 real docs build warnings

* fix: remove broken PylateSearchEncoder reference

* Remove unused build scripts

* docs: wrap multimodal example as code snippet

* fix: export SklearnModelProtocol for docs API

* docs: add API reference for SklearnModelProtocol

* fix: remove SklearnModelProtocol export to avoid circular import

* feat: add SklearnModelProtocol docs with lazy import

* fix: convert Sphinx cross-references to MkDocs syntax for proper linking

Convert :class: Sphinx syntax to [Text][module.path] MkDocs syntax to ensure
cross-references are properly clickable in the generated documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* chore: remove SklearnModelProtocol docs to avoid circular import

* chore: remove SklearnModelProtocol export from _evaluators

* style: fix indentation in _evaluators/__init__.py

* fix: enable mkdocs build --strict without warnings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: rename evaluator to multilabel_classifier to avoid type conflict

* fix: use evaluator_model instead of evaluator to avoid type conflict

- Change parameter name from multilabel_classifier to evaluator_model
- Maintains SklearnModelProtocol type hint as requested in PR review
- Resolves mypy type error by using parent class's evaluator_model field
- Keeps explicit protocol reference in docstring for clarity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

---------

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>

* model: Add SauerkrautLM-ColPali visual document retrieval models (#3804)

* model: Add SauerkrautLM-ColPali visual document retrieval models

Add inference code and requirements for SauerkrautLM-ColPali visual document retrieval models.

These are multi-vector embedding models based on the ColPali architecture:
- ColQwen3 (Qwen3-VL backbone): 1.7B Turbo, 2B, 4B, 8B variants
- ColLFM2 (LFM2-VL backbone): 450M variant
- ColMinistral3 (Ministral3 backbone): 3B variant

All models produce 128-dimensional embeddings per text/image token and use MaxSim (late interaction) for retrieval scoring.

Model checkpoints:
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-1.7b-Turbo-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-2b-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-4b-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColQwen3-8b-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColLFM2-450M-v0.1
- https://huggingface.co/VAGOsolutions/SauerkrautLM-ColMinistral3-3b-v0.1

* fix: Address review comments

- Remove loader functions, use classes directly in ModelMeta
- Remove unused get_fused_embeddings method
- Move model.to(device) and model.eval() to base class __init__
- Pass torch_dtype directly to ColMinistral3.from_pretrained

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* Update pyproject.toml

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: Update release_date to 2025-12-20

* fix: address review comments - remove partial, add adapted_from and training_datasets

* Update mteb/models/model_implementations/slm_models.py

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

* fix: import COLPALI_CITATION from colpali_models and add model_type

* add training datasets

* fix: remove section headers and use PyPI package instead of Git URL

* fix: resolve merge conflicts and remove section headers

* fix: use COLPALI_TRAINING_DATA for training_datasets

* fix: use exact n_parameters and memory_usage_mb values from HuggingFace

* don't build 3.14

* lint

---------

Co-authored-by: David Golchinfar <d.golchin@web.de>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

* fix dataset generation tags (#3835)

* fix: Extend framework annotations for `ModelMeta` (#3819)

* Update framework and filter based on them

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta of models

* update ModelMeta

* add csv

* update ModelMeta

* added framework to ModelMeta

* update ModelMeta of models

* update ModelMeta of models

* update framework in ModelMeta of models

* update framework

* update framework

* update framework in ModelMeta

* fix tests

* Add models

* fix tests

* add tags extraction in from_hub()

* fix typecheck

* apply suggestions

* apply suggestions

* keep only static method

* delete csv and script

* 2.6.5

Automatically generated by python-semantic-release

* dataset: Vietnamese VN-MTEB TVPLRetrieval, NanoClimateFEVER-VN, NanoFEVER-VN, NanoDBPedia-VN, NanoNQ-VN, NanoHotpotQA-VN, NanoMSMARCO-VN (#3810)

* [ADD] Vietnamese VN-MTEB TVPLRetrieval, NanoClimateFEVER-VN, NanoFEVER-VN, NanoDBPedia-VN, NanoNQ-VN, NanoHotpotQA-VN, NanoMSMARCO-VN

* [UPDATE] descriptive stats

* [UPDATE] bibtext

* [UPDATE] dataset path

* [UPDATE] nano db pedia retrieval

* [UPDATE] size dataset from 1M corpus to 100k

* [ADD] add note about what's different in nano version

* [ADD] TVPLRetrieval description

* test: Add HF Space Dockerfile using pre-built leaderboard image (#3838)

* Add HF Space Dockerfile using pre-built leaderboard image

Adds a lightweight Dockerfile for HuggingFace Space deployment that uses
the pre-built ghcr.io/embeddings-benchmark/mteb/leaderboard image as base.
Also adds a workflow to test the Dockerfile.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Delete .github/workflows/hf_space_docker.yml

* test: Add CI workflow for HF Space Dockerfile validation

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: MTEB Agent <agent@example.com>

* Update uv.lock

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix lint and type errors

- Remove unused LogOnce import from _create_dataloaders.py
- Use specific type ignore codes [arg-type] in mock_tasks.py for PGH003 compliance
- Fix type annotations in classification.py to use Array type instead of np.ndarray
- Remove unused Iterable import from classification.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix duplicate modalities kwarg in random_baseline ModelMeta

Remove modalities from _common_mock_metadata since each ModelMeta
instance specifies its own modalities, which caused "got multiple
values for keyword argument 'modalities'" error.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix baselines

* invalidate hf cache for maeb

* temporarily skip 3.14 in tests

---------

Co-authored-by: Munot Ayush Sunil <munotayush6@kgpian.iitkgp.ac.in>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: bflhc <kunka.xgw@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Quan Yuhan <yuhan_quan@qq.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
Co-authored-by: dgolchin <david.golchinfar@h-brs.de>
Co-authored-by: David Golchinfar <d.golchin@web.de>
Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com>
Co-authored-by: MTEB Agent <agent@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MTEB leaderboard down?

4 participants