List view
- No due date•1/4 issues closed
❓Problem Users did not intend to use the interface but rather to create their own visualizations. The usage logic envisioned in EvalAP 1 experiment = many test combinations does not seem to correspond to the main use case in terms of development. 💡 Solution Now that the teams can define their own metrics, we propose a more comprehensive evaluation approach where experiments are implemented in the same table without temporal separation but within a logic of monitoring product improvement. 🎯 Goal - Ensure that no more time is allocated to visualization development within the user teams. - Save data scientists time in visualizing the results. ⚒️ To do
Overdue by 2 month(s)•Due by December 22, 2025📒 Context Each AI system has its own characteristics. Evaluation pipelines must match these characteristics in terms of pipeline, metrics, dataset, etc... Evaluating AI systems sometimes requires different parameters than those offered by EvalAP. The incubated product teams have all either already started implementing their own evaluation system or have not yet begun their evaluations. A few team examples: - MyCyberQuestions: uses the API but is creating its own evaluation pipeline for continuous deployment/versioning and will then visualize the results via their Metabase. - Document AI: the team is developing its own pipeline tool to connect their own dataset and to create its own metric. - HR Assistant: in progress. ❓Problem Some teams do not use EvalAP because the customization options for the evaluation are limited. It is not easy for developers to define their own evaluator, their own datasets, as well as their own metrics. 💡 Solution Give developers the ability to customize their parameters as well as their evaluators or datasets. 🎯 Goal By allowing them to customize their own parameters, we give teams the ability to create an evaluation pipeline that is consistent and relevant to their AI system. ⚒️ To do - Create your own datasets within EvalAP - Link external datasets - Define your judge (LLM + Prompt) - Define a custom metric - clear documentation
Overdue by 2 month(s)•Due by December 15, 2025•5/6 issues closed📒 Context Most teams install EvalAP locally for two main reasons: - The lack of authentication, which limits the ability to track only one's own results. - The integration of AI products/systems into continuous development technical environments that don't necessarily use notebooks but rather scripts to be integrated into these environments. The incubated product teams have all either already started implementing their own evaluation system or have not yet begun their evaluations. A few team examples: - MyCyberQuestions: uses the API but is creating its own evaluation pipeline for continuous deployment/versioning and will then visualize the results via their Metabase. - Document AI: the team is developing its own pipeline tool to connect their own dataset. - HR Assistant: in progress. - Jacepair: the team has not yet started the evaluations. ... For teams that have chosen not to use all of EvalAP's features, we are prioritizing the onboarding of new teams. ❓Problem The EvalAP packages to be installed and the demo notebooks include unnecessary elements that create clutter during installation. Developers waste time trying to understand what's relevant to them and rewriting notebooks. 💡 Solution Make the local installation of EvalAP "cleaner" and more streamlined by removing unnecessary visualizations, adding understandable wording, providing ready-to-use notebooks to launch a first test evaluation. -> Clean up the current packages in terms of datasets, web pages, and notebooks. 🎯 Goal / Impact Our goal is to be able to immediately launch a ready-to-use "test" experience after installation in 30 minutes. ⚒️ To do - delete leaderboard content - adapt wordings when pages are empty - launch a first evaluation by creating X notebooks to launch the first experimentation - clean up documentations - add initial datasets
Overdue by 2 month(s)•Due by December 8, 2025•6/8 issues closedExplore new models, algo, heuristics etc. Optimise current model on some metrics.
Overdue by 1 year(s)•Due by January 15, 2025•26/27 issues closedTrois cas d'usage identifié : * pour les data scientifique des données | Guider le développement et la selection de modèle) * pour les PO et manager | Aiguiller la mise en Prod - stage 1 * pour le reste du monde | Leaderboard général - Relation Publique Trois stage pour la MEP de modèles : 1. Stage 1: Evaluation quantitative des modèles | Dataset et métrique thématique 2. Stage 2 : Evaluation Humaine | Analyse qualitative, contrôle. 3. Stage 3: Feedback Utilisateur | A/B testing, validation.
Overdue by 1 year(s)•Due by February 15, 2025•13/15 issues closed