Releases · etalab-ia/evalap · GitHub

22 May 14:00

AudreyCLEVY

v0.4 Latest

Latest

[0.4] - 2025-05-21

🚀 Features

Support thinking model in judge
Add nb_tool_call as ops metrics + add MCP_BRIDGE_URL + format
Parquet dataset support + ocr metrics + notebook demo
Add and handle new with_vision and prelude_prompt attribute
Calculation of the environmental impact of models for the response generation part.
Creation of two new environmental metrics: energy_consumption and gwp_consumption.

🔧 Improvements

[UI] display of the environmental brick in the OPS pane and experiments_set metric results.

Assets 2

02 Apr 19:00

dtrckd

v0.3.1

🚀 Features

[API] Support Anthropic, Openai, Mistral and Albert providers for judge models judge_model parameter in experiments (models are fetch from the openai api v1/models endpoints)
[SCRIPTS] ADD convenient scripts to run experiment from an isolated environment (e.g. like cortex, see the tutorial )

🔧 Improvements

[UI] Add a special card for orphan experiments at the bottom of the experiments list.
[UI] Order the experiment set from the newest first
[UI] Remove old confusing experiments menu in favor of only the experiment sets menu (renamed simply experiments)

Assets 2

27 Mar 17:55

dtrckd

v0.3

🚀 Features

Integrated MCP support and multi-step LLM generations with MCP client bridge.
Added experiment set with cross-validation parameters and demo notebooks.
Integrated multiple RAG metrics for deep evaluation.
Supported delete experiment route for admin users.
Introduced new retry and post routes with UI improvements.
Added experiments 'finished' and 'failure' ratio in overview.
New tests for an increase code coverage and addressed pydantic warnings.
Implemented loop limit and tool call step saving.
Improved sorting and metrics highlighting in the experiment set score table.

🐛 Bug Fixes

Enhanced error handling for missing metric input and baseline demo notebook.
Removed unnecessary attributes and improved schema validation.
Fixed various UI bugs and improved experiment view.
Improved notebook variable names and used public endpoints.
Enhanced GitHub Actions CI and addressed Alembic issues.
Corrected schema serialization and computation needs.
Improved experiment status updates and endpoint terminology.
Handled unknown model cases and improved dataset visibility.
Fixed various typos and improved model sorting and ops board status.
Improved schema validation and error detail return for API.
Addressed issues with experiment view and retry functionality.

🛠️ Code Improvements

Reorganized code structure (pip ready) and fixed import issues.
Moved API components to clients and adjusted imports accordingly.

🔥 Hotfixes

Addressed dataset and SQL float compatibility issues.
Updated configuration files for supervisord and Alembic.

⚙️ Operations

Added Docker and Streamlit configuration files.
Fix supervisord path to deploy.

Assets 2