diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 243462b3d..3daa40529 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -28,3 +28,21 @@
   - local: available-tasks
     title: Available Tasks
   title: API
+- sections:
+  - sections:
+    - local: package_reference/evaluation_tracker
+      title: EvaluationTracker
+    - local: package_reference/models
+      title: Models
+    - local: package_reference/model_config
+      title: ModelConfig
+    - local: package_reference/pipeline
+      title: Pipeline
+    title: Main classes
+  - local: package_reference/metrics
+    title: Metrics
+  - local: package_reference/tasks
+    title: Tasks
+  - local: package_reference/logging
+    title: Logging
+  title: Reference
diff --git a/docs/source/adding-a-custom-task.mdx b/docs/source/adding-a-custom-task.mdx
index bcaa932ff..752c4e547 100644
--- a/docs/source/adding-a-custom-task.mdx
+++ b/docs/source/adding-a-custom-task.mdx
@@ -45,8 +45,9 @@ def prompt_fn(line, task_name: str = None):
     )
 ```
 
-Then, you need to choose a metric, you can either use an existing one (defined
-in `lighteval/metrics/metrics.py`) or [create a custom one](adding-a-new-metric)).
+Then, you need to choose a metric: you can either use an existing one (defined
+in [`lighteval.metrics.metrics.Metrics`]) or [create a custom one](adding-a-new-metric)).
+[//]: # (TODO: Replace lighteval.metrics.metrics.Metrics with ~metrics.metrics.Metrics once its autodoc is added)
 
 ```python
 custom_metric = SampleLevelMetric(
@@ -59,7 +60,8 @@ custom_metric = SampleLevelMetric(
 )
 ```
 
-Then, you need to define your task. You can define a task with or without subsets.
+Then, you need to define your task using [`~tasks.lighteval_task.LightevalTaskConfig`].
+You can define a task with or without subsets.
 To define a task with no subsets:
 
 ```python
diff --git a/docs/source/adding-a-new-metric.mdx b/docs/source/adding-a-new-metric.mdx
index e8562af4f..35fc975f8 100644
--- a/docs/source/adding-a-new-metric.mdx
+++ b/docs/source/adding-a-new-metric.mdx
@@ -1,8 +1,8 @@
 # Adding a New Metric
 
 First, check if you can use one of the parametrized functions in
-[src.lighteval.metrics.metrics_corpus]() or
-[src.lighteval.metrics.metrics_sample]().
+[Corpus Metrics](package_reference/metrics#corpus-metrics) or
+[Sample Metrics](package_reference/metrics#sample-metrics).
 
 If not, you can use the `custom_task` system to register your new metric:
 
@@ -49,7 +49,8 @@ def agg_function(items):
     return score
 ```
 
-Finally, you can define your metric. If it's a sample level metric, you can use the following code:
+Finally, you can define your metric. If it's a sample level metric, you can use the following code
+with [`~metrics.utils.metric_utils.SampleLevelMetric`]:
 
 ```python
 my_custom_metric = SampleLevelMetric(
@@ -62,7 +63,8 @@ my_custom_metric = SampleLevelMetric(
 )
 ```
 
-If your metric defines multiple metrics per sample, you can use the following code:
+If your metric defines multiple metrics per sample, you can use the following code
+with [`~metrics.utils.metric_utils.SampleLevelMetricGrouping`]:
 
 ```python
 custom_metric = SampleLevelMetricGrouping(
diff --git a/docs/source/contributing-to-multilingual-evaluations.mdx b/docs/source/contributing-to-multilingual-evaluations.mdx
index 25779bc38..0d0855d75 100644
--- a/docs/source/contributing-to-multilingual-evaluations.mdx
+++ b/docs/source/contributing-to-multilingual-evaluations.mdx
@@ -51,7 +51,7 @@ Browse the list of all templates [here](https://github.com/huggingface/lighteval
 Then, when ready, to define your own task, you should:
 1. create a Python file as indicated in the above guide
 2. import the relevant templates for your task type (XNLI, Copa, Multiple choice, Question Answering, etc)
-3. define one or a list of tasks for each relevant language and evaluation formulation (for multichoice) using our parametrizable `LightevalTaskConfig` class
+3. define one or a list of tasks for each relevant language and evaluation formulation (for multichoice) using our parametrizable [`~tasks.lighteval_task.LightevalTaskConfig`] class
 
 ```python
 your_tasks = [
@@ -101,7 +101,7 @@ your_tasks = [
 4. then, you can go back to the guide to test if your task is correctly implemented!
 
 > [!TIP]
-> All `LightevalTaskConfig` parameters are strongly typed, including the inputs to the template function. Make sure to take advantage of your IDE's functionality to make it easier to correctly fill these parameters.
+> All [`~tasks.lighteval_task.LightevalTaskConfig`] parameters are strongly typed, including the inputs to the template function. Make sure to take advantage of your IDE's functionality to make it easier to correctly fill these parameters.
 
 
-Once everything is good, open a PR, and we'll be happy to review it!
\ No newline at end of file
+Once everything is good, open a PR, and we'll be happy to review it!
diff --git a/docs/source/metric-list.mdx b/docs/source/metric-list.mdx
index 0ab03afb9..c7aefefd2 100644
--- a/docs/source/metric-list.mdx
+++ b/docs/source/metric-list.mdx
@@ -69,7 +69,7 @@ These metrics need the model to generate an output. They are therefore slower.
     - `quasi_exact_match_gsm8k`: Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
     - `maj_at_8_gsm8k`: Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
 
-## LLM-as-Judge:
+## LLM-as-Judge
 - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API
 - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the HuggingFace API
 - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the OpenAI API. It is used for multiturn tasks like mt-bench.
diff --git a/docs/source/package_reference/evaluation_tracker.mdx b/docs/source/package_reference/evaluation_tracker.mdx
new file mode 100644
index 000000000..87dc90b7a
--- /dev/null
+++ b/docs/source/package_reference/evaluation_tracker.mdx
@@ -0,0 +1,3 @@
+# EvaluationTracker
+
+[[autodoc]] logging.evaluation_tracker.EvaluationTracker
diff --git a/docs/source/package_reference/logging.mdx b/docs/source/package_reference/logging.mdx
new file mode 100644
index 000000000..9fd01154e
--- /dev/null
+++ b/docs/source/package_reference/logging.mdx
@@ -0,0 +1,12 @@
+# Loggers
+
+## GeneralConfigLogger
+[[autodoc]] logging.info_loggers.GeneralConfigLogger
+## DetailsLogger
+[[autodoc]] logging.info_loggers.DetailsLogger
+## MetricsLogger
+[[autodoc]] logging.info_loggers.MetricsLogger
+## VersionsLogger
+[[autodoc]] logging.info_loggers.VersionsLogger
+## TaskConfigLogger
+[[autodoc]] logging.info_loggers.TaskConfigLogger
diff --git a/docs/source/package_reference/metrics.mdx b/docs/source/package_reference/metrics.mdx
new file mode 100644
index 000000000..57c656966
--- /dev/null
+++ b/docs/source/package_reference/metrics.mdx
@@ -0,0 +1,70 @@
+# Metrics
+
+## Metrics
+[//]: # (TODO: aenum.Enum raises error when generating docs: not supported by inspect.signature. See: https://github.com/ethanfurman/aenum/issues/44)
+[//]: # (### Metrics)
+[//]: # ([[autodoc]] metrics.metrics.Metrics)
+### Metric
+[[autodoc]] metrics.utils.metric_utils.Metric
+### CorpusLevelMetric
+[[autodoc]] metrics.utils.metric_utils.CorpusLevelMetric
+### SampleLevelMetric
+[[autodoc]] metrics.utils.metric_utils.SampleLevelMetric
+### MetricGrouping
+[[autodoc]] metrics.utils.metric_utils.MetricGrouping
+### CorpusLevelMetricGrouping
+[[autodoc]] metrics.utils.metric_utils.CorpusLevelMetricGrouping
+### SampleLevelMetricGrouping
+[[autodoc]] metrics.utils.metric_utils.SampleLevelMetricGrouping
+
+## Corpus Metrics
+### CorpusLevelF1Score
+[[autodoc]] metrics.metrics_corpus.CorpusLevelF1Score
+### CorpusLevelPerplexityMetric
+[[autodoc]] metrics.metrics_corpus.CorpusLevelPerplexityMetric
+### CorpusLevelTranslationMetric
+[[autodoc]] metrics.metrics_corpus.CorpusLevelTranslationMetric
+### matthews_corrcoef
+[[autodoc]] metrics.metrics_corpus.matthews_corrcoef
+
+## Sample Metrics
+### ExactMatches
+[[autodoc]] metrics.metrics_sample.ExactMatches
+### F1_score
+[[autodoc]] metrics.metrics_sample.F1_score
+### LoglikelihoodAcc
+[[autodoc]] metrics.metrics_sample.LoglikelihoodAcc
+### NormalizedMultiChoiceProbability
+[[autodoc]] metrics.metrics_sample.NormalizedMultiChoiceProbability
+### Probability
+[[autodoc]] metrics.metrics_sample.Probability
+### Recall
+[[autodoc]] metrics.metrics_sample.Recall
+### MRR
+[[autodoc]] metrics.metrics_sample.MRR
+### ROUGE
+[[autodoc]] metrics.metrics_sample.ROUGE
+### BertScore
+[[autodoc]] metrics.metrics_sample.BertScore
+### Extractiveness
+[[autodoc]] metrics.metrics_sample.Extractiveness
+### Faithfulness
+[[autodoc]] metrics.metrics_sample.Faithfulness
+### BLEURT
+[[autodoc]] metrics.metrics_sample.BLEURT
+### BLEU
+[[autodoc]] metrics.metrics_sample.BLEU
+### StringDistance
+[[autodoc]] metrics.metrics_sample.StringDistance
+### JudgeLLM
+[[autodoc]] metrics.metrics_sample.JudgeLLM
+### JudgeLLMMTBench
+[[autodoc]] metrics.metrics_sample.JudgeLLMMTBench
+### JudgeLLMMixEval
+[[autodoc]] metrics.metrics_sample.JudgeLLMMixEval
+### MajAtK
+[[autodoc]] metrics.metrics_sample.MajAtK
+
+## LLM-as-a-Judge
+### JudgeLM
+[[autodoc]] metrics.llm_as_judge.JudgeLM
diff --git a/docs/source/package_reference/model_config.mdx b/docs/source/package_reference/model_config.mdx
new file mode 100644
index 000000000..c70258bb5
--- /dev/null
+++ b/docs/source/package_reference/model_config.mdx
@@ -0,0 +1,12 @@
+# ModelConfig
+
+[[autodoc]] models.model_config.BaseModelConfig
+
+[[autodoc]] models.model_config.AdapterModelConfig
+[[autodoc]] models.model_config.DeltaModelConfig
+[[autodoc]] models.model_config.InferenceEndpointModelConfig
+[[autodoc]] models.model_config.InferenceModelConfig
+[[autodoc]] models.model_config.TGIModelConfig
+[[autodoc]] models.model_config.VLLMModelConfig
+
+[[autodoc]] models.model_config.create_model_config
diff --git a/docs/source/package_reference/models.mdx b/docs/source/package_reference/models.mdx
new file mode 100644
index 000000000..a04c9eef9
--- /dev/null
+++ b/docs/source/package_reference/models.mdx
@@ -0,0 +1,30 @@
+# Models
+
+## Model
+### LightevalModel
+[[autodoc]] models.abstract_model.LightevalModel
+
+## Accelerate and Transformers Models
+### BaseModel
+[[autodoc]] models.base_model.BaseModel
+[//]: # (TODO: Fix import error)
+[//]: # (### AdapterModel)
+[//]: # ([[autodoc]] models.adapter_model.AdapterModel)
+### DeltaModel
+[[autodoc]] models.delta_model.DeltaModel
+
+## Inference Endpoints and TGI Models
+### InferenceEndpointModel
+[[autodoc]] models.endpoint_model.InferenceEndpointModel
+### ModelClient
+[[autodoc]] models.tgi_model.ModelClient
+
+[//]: # (TODO: Fix import error)
+[//]: # (## Nanotron Model)
+[//]: # (### NanotronLightevalModel)
+[//]: # ([[autodoc]] models.nanotron_model.NanotronLightevalModel)
+
+[//]: # (TODO: Fix import error)
+[//]: # (## VLLM Model)
+[//]: # (### VLLMModel)
+[//]: # ([[autodoc]] models.vllm_model.VLLMModel)
diff --git a/docs/source/package_reference/pipeline.mdx b/docs/source/package_reference/pipeline.mdx
new file mode 100644
index 000000000..3473abb3d
--- /dev/null
+++ b/docs/source/package_reference/pipeline.mdx
@@ -0,0 +1,13 @@
+# Pipeline
+
+## Pipeline
+
+[[autodoc]] pipeline.Pipeline
+
+## PipelineParameters
+
+[[autodoc]] pipeline.PipelineParameters
+
+## ParallelismManager
+
+[[autodoc]] pipeline.ParallelismManager
diff --git a/docs/source/package_reference/tasks.mdx b/docs/source/package_reference/tasks.mdx
new file mode 100644
index 000000000..c1a84b00a
--- /dev/null
+++ b/docs/source/package_reference/tasks.mdx
@@ -0,0 +1,38 @@
+# Tasks
+
+## LightevalTask
+### LightevalTaskConfig
+[[autodoc]] tasks.lighteval_task.LightevalTaskConfig
+### LightevalTask
+[[autodoc]] tasks.lighteval_task.LightevalTask
+
+## PromptManager
+
+[[autodoc]] tasks.prompt_manager.PromptManager
+
+## Registry
+
+[[autodoc]] tasks.registry.Registry
+
+## Requests
+
+[[autodoc]] tasks.requests.Request
+
+[[autodoc]] tasks.requests.LoglikelihoodRequest
+
+[[autodoc]] tasks.requests.LoglikelihoodSingleTokenRequest
+
+[[autodoc]] tasks.requests.LoglikelihoodRollingRequest
+
+[[autodoc]] tasks.requests.GreedyUntilRequest
+
+[[autodoc]] tasks.requests.GreedyUntilMultiTurnRequest
+
+## Datasets
+
+[[autodoc]] data.DynamicBatchDataset
+[[autodoc]] data.LoglikelihoodDataset
+[[autodoc]] data.LoglikelihoodSingleTokenDataset
+[[autodoc]] data.GenerativeTaskDataset
+[[autodoc]] data.GenerativeTaskDatasetNanotron
+[[autodoc]] data.GenDistributedSampler
diff --git a/docs/source/using-the-python-api.mdx b/docs/source/using-the-python-api.mdx
index 82238c7f1..2e160a679 100644
--- a/docs/source/using-the-python-api.mdx
+++ b/docs/source/using-the-python-api.mdx
@@ -1,8 +1,9 @@
 # Using the Python API
 
-Lighteval can be used from a custom python script. To evaluate a model you will
-need to setup an `evaluation_tracker`, `pipeline_parameters`, `model_config`
-and a `pipeline`.
+Lighteval can be used from a custom python script. To evaluate a model you will need to set up an
+[`~logging.evaluation_tracker.EvaluationTracker`], [`~pipeline.PipelineParameters`],
+a [`model`](package_reference/models) or a [`model_config`](package_reference/model_config),
+and a [`~pipeline.Pipeline`].
 
 After that, simply run the pipeline and save the results.
 
diff --git a/pyproject.toml b/pyproject.toml
index a779ebf4c..7d17c8d13 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -59,6 +59,7 @@ dependencies = [
     "torch>=2.0,<2.5",
     "GitPython>=3.1.41", # for logging
     "datasets>=2.14.0",
+    "numpy<2",  # pinned to avoid incompatibilities
     # Prettiness
     "termcolor==2.3.0",
     "pytablewriter",