diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 243462b3d..3daa40529 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -28,3 +28,21 @@ - local: available-tasks title: Available Tasks title: API +- sections: + - sections: + - local: package_reference/evaluation_tracker + title: EvaluationTracker + - local: package_reference/models + title: Models + - local: package_reference/model_config + title: ModelConfig + - local: package_reference/pipeline + title: Pipeline + title: Main classes + - local: package_reference/metrics + title: Metrics + - local: package_reference/tasks + title: Tasks + - local: package_reference/logging + title: Logging + title: Reference diff --git a/docs/source/adding-a-custom-task.mdx b/docs/source/adding-a-custom-task.mdx index bcaa932ff..752c4e547 100644 --- a/docs/source/adding-a-custom-task.mdx +++ b/docs/source/adding-a-custom-task.mdx @@ -45,8 +45,9 @@ def prompt_fn(line, task_name: str = None): ) ``` -Then, you need to choose a metric, you can either use an existing one (defined -in `lighteval/metrics/metrics.py`) or [create a custom one](adding-a-new-metric)). +Then, you need to choose a metric: you can either use an existing one (defined +in [`lighteval.metrics.metrics.Metrics`]) or [create a custom one](adding-a-new-metric)). +[//]: # (TODO: Replace lighteval.metrics.metrics.Metrics with ~metrics.metrics.Metrics once its autodoc is added) ```python custom_metric = SampleLevelMetric( @@ -59,7 +60,8 @@ custom_metric = SampleLevelMetric( ) ``` -Then, you need to define your task. You can define a task with or without subsets. +Then, you need to define your task using [`~tasks.lighteval_task.LightevalTaskConfig`]. +You can define a task with or without subsets. To define a task with no subsets: ```python diff --git a/docs/source/adding-a-new-metric.mdx b/docs/source/adding-a-new-metric.mdx index e8562af4f..35fc975f8 100644 --- a/docs/source/adding-a-new-metric.mdx +++ b/docs/source/adding-a-new-metric.mdx @@ -1,8 +1,8 @@ # Adding a New Metric First, check if you can use one of the parametrized functions in -[src.lighteval.metrics.metrics_corpus]() or -[src.lighteval.metrics.metrics_sample](). +[Corpus Metrics](package_reference/metrics#corpus-metrics) or +[Sample Metrics](package_reference/metrics#sample-metrics). If not, you can use the `custom_task` system to register your new metric: @@ -49,7 +49,8 @@ def agg_function(items): return score ``` -Finally, you can define your metric. If it's a sample level metric, you can use the following code: +Finally, you can define your metric. If it's a sample level metric, you can use the following code +with [`~metrics.utils.metric_utils.SampleLevelMetric`]: ```python my_custom_metric = SampleLevelMetric( @@ -62,7 +63,8 @@ my_custom_metric = SampleLevelMetric( ) ``` -If your metric defines multiple metrics per sample, you can use the following code: +If your metric defines multiple metrics per sample, you can use the following code +with [`~metrics.utils.metric_utils.SampleLevelMetricGrouping`]: ```python custom_metric = SampleLevelMetricGrouping( diff --git a/docs/source/contributing-to-multilingual-evaluations.mdx b/docs/source/contributing-to-multilingual-evaluations.mdx index 25779bc38..0d0855d75 100644 --- a/docs/source/contributing-to-multilingual-evaluations.mdx +++ b/docs/source/contributing-to-multilingual-evaluations.mdx @@ -51,7 +51,7 @@ Browse the list of all templates [here](https://github.com/huggingface/lighteval Then, when ready, to define your own task, you should: 1. create a Python file as indicated in the above guide 2. import the relevant templates for your task type (XNLI, Copa, Multiple choice, Question Answering, etc) -3. define one or a list of tasks for each relevant language and evaluation formulation (for multichoice) using our parametrizable `LightevalTaskConfig` class +3. define one or a list of tasks for each relevant language and evaluation formulation (for multichoice) using our parametrizable [`~tasks.lighteval_task.LightevalTaskConfig`] class ```python your_tasks = [ @@ -101,7 +101,7 @@ your_tasks = [ 4. then, you can go back to the guide to test if your task is correctly implemented! > [!TIP] -> All `LightevalTaskConfig` parameters are strongly typed, including the inputs to the template function. Make sure to take advantage of your IDE's functionality to make it easier to correctly fill these parameters. +> All [`~tasks.lighteval_task.LightevalTaskConfig`] parameters are strongly typed, including the inputs to the template function. Make sure to take advantage of your IDE's functionality to make it easier to correctly fill these parameters. -Once everything is good, open a PR, and we'll be happy to review it! \ No newline at end of file +Once everything is good, open a PR, and we'll be happy to review it! diff --git a/docs/source/metric-list.mdx b/docs/source/metric-list.mdx index 0ab03afb9..c7aefefd2 100644 --- a/docs/source/metric-list.mdx +++ b/docs/source/metric-list.mdx @@ -69,7 +69,7 @@ These metrics need the model to generate an output. They are therefore slower. - `quasi_exact_match_gsm8k`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed) - `maj_at_8_gsm8k`: Majority choice evaluation, using the gsm8k normalisation for the predictions and gold -## LLM-as-Judge: +## LLM-as-Judge - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench. diff --git a/docs/source/package_reference/evaluation_tracker.mdx b/docs/source/package_reference/evaluation_tracker.mdx new file mode 100644 index 000000000..87dc90b7a --- /dev/null +++ b/docs/source/package_reference/evaluation_tracker.mdx @@ -0,0 +1,3 @@ +# EvaluationTracker + +[[autodoc]] logging.evaluation_tracker.EvaluationTracker diff --git a/docs/source/package_reference/logging.mdx b/docs/source/package_reference/logging.mdx new file mode 100644 index 000000000..9fd01154e --- /dev/null +++ b/docs/source/package_reference/logging.mdx @@ -0,0 +1,12 @@ +# Loggers + +## GeneralConfigLogger +[[autodoc]] logging.info_loggers.GeneralConfigLogger +## DetailsLogger +[[autodoc]] logging.info_loggers.DetailsLogger +## MetricsLogger +[[autodoc]] logging.info_loggers.MetricsLogger +## VersionsLogger +[[autodoc]] logging.info_loggers.VersionsLogger +## TaskConfigLogger +[[autodoc]] logging.info_loggers.TaskConfigLogger diff --git a/docs/source/package_reference/metrics.mdx b/docs/source/package_reference/metrics.mdx new file mode 100644 index 000000000..57c656966 --- /dev/null +++ b/docs/source/package_reference/metrics.mdx @@ -0,0 +1,70 @@ +# Metrics + +## Metrics +[//]: # (TODO: aenum.Enum raises error when generating docs: not supported by inspect.signature. See: https://github.com/ethanfurman/aenum/issues/44) +[//]: # (### Metrics) +[//]: # ([[autodoc]] metrics.metrics.Metrics) +### Metric +[[autodoc]] metrics.utils.metric_utils.Metric +### CorpusLevelMetric +[[autodoc]] metrics.utils.metric_utils.CorpusLevelMetric +### SampleLevelMetric +[[autodoc]] metrics.utils.metric_utils.SampleLevelMetric +### MetricGrouping +[[autodoc]] metrics.utils.metric_utils.MetricGrouping +### CorpusLevelMetricGrouping +[[autodoc]] metrics.utils.metric_utils.CorpusLevelMetricGrouping +### SampleLevelMetricGrouping +[[autodoc]] metrics.utils.metric_utils.SampleLevelMetricGrouping + +## Corpus Metrics +### CorpusLevelF1Score +[[autodoc]] metrics.metrics_corpus.CorpusLevelF1Score +### CorpusLevelPerplexityMetric +[[autodoc]] metrics.metrics_corpus.CorpusLevelPerplexityMetric +### CorpusLevelTranslationMetric +[[autodoc]] metrics.metrics_corpus.CorpusLevelTranslationMetric +### matthews_corrcoef +[[autodoc]] metrics.metrics_corpus.matthews_corrcoef + +## Sample Metrics +### ExactMatches +[[autodoc]] metrics.metrics_sample.ExactMatches +### F1_score +[[autodoc]] metrics.metrics_sample.F1_score +### LoglikelihoodAcc +[[autodoc]] metrics.metrics_sample.LoglikelihoodAcc +### NormalizedMultiChoiceProbability +[[autodoc]] metrics.metrics_sample.NormalizedMultiChoiceProbability +### Probability +[[autodoc]] metrics.metrics_sample.Probability +### Recall +[[autodoc]] metrics.metrics_sample.Recall +### MRR +[[autodoc]] metrics.metrics_sample.MRR +### ROUGE +[[autodoc]] metrics.metrics_sample.ROUGE +### BertScore +[[autodoc]] metrics.metrics_sample.BertScore +### Extractiveness +[[autodoc]] metrics.metrics_sample.Extractiveness +### Faithfulness +[[autodoc]] metrics.metrics_sample.Faithfulness +### BLEURT +[[autodoc]] metrics.metrics_sample.BLEURT +### BLEU +[[autodoc]] metrics.metrics_sample.BLEU +### StringDistance +[[autodoc]] metrics.metrics_sample.StringDistance +### JudgeLLM +[[autodoc]] metrics.metrics_sample.JudgeLLM +### JudgeLLMMTBench +[[autodoc]] metrics.metrics_sample.JudgeLLMMTBench +### JudgeLLMMixEval +[[autodoc]] metrics.metrics_sample.JudgeLLMMixEval +### MajAtK +[[autodoc]] metrics.metrics_sample.MajAtK + +## LLM-as-a-Judge +### JudgeLM +[[autodoc]] metrics.llm_as_judge.JudgeLM diff --git a/docs/source/package_reference/model_config.mdx b/docs/source/package_reference/model_config.mdx new file mode 100644 index 000000000..c70258bb5 --- /dev/null +++ b/docs/source/package_reference/model_config.mdx @@ -0,0 +1,12 @@ +# ModelConfig + +[[autodoc]] models.model_config.BaseModelConfig + +[[autodoc]] models.model_config.AdapterModelConfig +[[autodoc]] models.model_config.DeltaModelConfig +[[autodoc]] models.model_config.InferenceEndpointModelConfig +[[autodoc]] models.model_config.InferenceModelConfig +[[autodoc]] models.model_config.TGIModelConfig +[[autodoc]] models.model_config.VLLMModelConfig + +[[autodoc]] models.model_config.create_model_config diff --git a/docs/source/package_reference/models.mdx b/docs/source/package_reference/models.mdx new file mode 100644 index 000000000..a04c9eef9 --- /dev/null +++ b/docs/source/package_reference/models.mdx @@ -0,0 +1,30 @@ +# Models + +## Model +### LightevalModel +[[autodoc]] models.abstract_model.LightevalModel + +## Accelerate and Transformers Models +### BaseModel +[[autodoc]] models.base_model.BaseModel +[//]: # (TODO: Fix import error) +[//]: # (### AdapterModel) +[//]: # ([[autodoc]] models.adapter_model.AdapterModel) +### DeltaModel +[[autodoc]] models.delta_model.DeltaModel + +## Inference Endpoints and TGI Models +### InferenceEndpointModel +[[autodoc]] models.endpoint_model.InferenceEndpointModel +### ModelClient +[[autodoc]] models.tgi_model.ModelClient + +[//]: # (TODO: Fix import error) +[//]: # (## Nanotron Model) +[//]: # (### NanotronLightevalModel) +[//]: # ([[autodoc]] models.nanotron_model.NanotronLightevalModel) + +[//]: # (TODO: Fix import error) +[//]: # (## VLLM Model) +[//]: # (### VLLMModel) +[//]: # ([[autodoc]] models.vllm_model.VLLMModel) diff --git a/docs/source/package_reference/pipeline.mdx b/docs/source/package_reference/pipeline.mdx new file mode 100644 index 000000000..3473abb3d --- /dev/null +++ b/docs/source/package_reference/pipeline.mdx @@ -0,0 +1,13 @@ +# Pipeline + +## Pipeline + +[[autodoc]] pipeline.Pipeline + +## PipelineParameters + +[[autodoc]] pipeline.PipelineParameters + +## ParallelismManager + +[[autodoc]] pipeline.ParallelismManager diff --git a/docs/source/package_reference/tasks.mdx b/docs/source/package_reference/tasks.mdx new file mode 100644 index 000000000..c1a84b00a --- /dev/null +++ b/docs/source/package_reference/tasks.mdx @@ -0,0 +1,38 @@ +# Tasks + +## LightevalTask +### LightevalTaskConfig +[[autodoc]] tasks.lighteval_task.LightevalTaskConfig +### LightevalTask +[[autodoc]] tasks.lighteval_task.LightevalTask + +## PromptManager + +[[autodoc]] tasks.prompt_manager.PromptManager + +## Registry + +[[autodoc]] tasks.registry.Registry + +## Requests + +[[autodoc]] tasks.requests.Request + +[[autodoc]] tasks.requests.LoglikelihoodRequest + +[[autodoc]] tasks.requests.LoglikelihoodSingleTokenRequest + +[[autodoc]] tasks.requests.LoglikelihoodRollingRequest + +[[autodoc]] tasks.requests.GreedyUntilRequest + +[[autodoc]] tasks.requests.GreedyUntilMultiTurnRequest + +## Datasets + +[[autodoc]] data.DynamicBatchDataset +[[autodoc]] data.LoglikelihoodDataset +[[autodoc]] data.LoglikelihoodSingleTokenDataset +[[autodoc]] data.GenerativeTaskDataset +[[autodoc]] data.GenerativeTaskDatasetNanotron +[[autodoc]] data.GenDistributedSampler diff --git a/docs/source/using-the-python-api.mdx b/docs/source/using-the-python-api.mdx index 82238c7f1..2e160a679 100644 --- a/docs/source/using-the-python-api.mdx +++ b/docs/source/using-the-python-api.mdx @@ -1,8 +1,9 @@ # Using the Python API -Lighteval can be used from a custom python script. To evaluate a model you will -need to setup an `evaluation_tracker`, `pipeline_parameters`, `model_config` -and a `pipeline`. +Lighteval can be used from a custom python script. To evaluate a model you will need to set up an +[`~logging.evaluation_tracker.EvaluationTracker`], [`~pipeline.PipelineParameters`], +a [`model`](package_reference/models) or a [`model_config`](package_reference/model_config), +and a [`~pipeline.Pipeline`]. After that, simply run the pipeline and save the results. diff --git a/pyproject.toml b/pyproject.toml index a779ebf4c..7d17c8d13 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -59,6 +59,7 @@ dependencies = [ "torch>=2.0,<2.5", "GitPython>=3.1.41", # for logging "datasets>=2.14.0", + "numpy<2", # pinned to avoid incompatibilities # Prettiness "termcolor==2.3.0", "pytablewriter",