huggingface
diff --git a/‎docs/source/_toctree.yml
Lines changed: 2 additions & 0 deletions b/‎docs/source/_toctree.yml
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/caching.mdx
Lines changed: 60 additions & 0 deletions b/‎docs/source/caching.mdx
Lines changed: 60 additions & 0 deletions
diff --git a/‎docs/source/evaluating-a-custom-model.mdx
Lines changed: 47 additions & 26 deletions b/‎docs/source/evaluating-a-custom-model.mdx
Lines changed: 47 additions & 26 deletions
diff --git a/‎docs/source/package_reference/models.mdx
Lines changed: 1 addition & 1 deletion b/‎docs/source/package_reference/models.mdx
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/use-vllm-as-backend.mdx
Lines changed: 8 additions & 8 deletions b/‎docs/source/use-vllm-as-backend.mdx
Lines changed: 8 additions & 8 deletions
diff --git a/‎examples/custom_models/google_translate_model.py
Lines changed: 11 additions & 31 deletions b/‎examples/custom_models/google_translate_model.py
Lines changed: 11 additions & 31 deletions
@@ -9,6 +9,8 @@
 - sections:
   - local: saving-and-reading-results
     title: Save and read results
+  - local: caching
+    title: Caching
   - local: using-the-python-api
     title: Use the Python API
   - local: adding-a-custom-task
 
@@ -0,0 +1,60 @@
+# Caching System
+
+Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions.
+This is especially useful when running the same evaluation multiple times, or comparing different evaluation metrics on the same model outputs.
+
+## How It Works
+
+The caching system caches the predictions of the model for now (we will add tokenized input caching later).
+It stores model responses objects (generations, logits, probabilities) for evaluation samples.
+
+### Cache Structure
+
+Cached data is stored on disk using HuggingFace datasets in the following structure:
+
+```
+.cache/
+└── huggingface/
+    └── lighteval/
+        └── predictions/
+            └── {model_name}/
+                └── {model_hash}/
+                    └── {task_name}.parquet
+```
+
+Where:
+- `model_name`: The model name (path on the hub or local path)
+- `model_hash`: Hash of the model configuration to ensure cache invalidation when parameters change
+- `task_name`: Name of the evaluation task
+
+### Cache Recreation
+
+A new cache is automatically created when:
+- Model configuration changes (different parameters, quantization, etc.)
+- Model weights change (different revision, checkpoint, etc.)
+- Generation parameters change (temperature, max_tokens, etc.)
+
+This ensures that cached results are always consistent with your current model setup.
+
+## Using Caching
+
+### Automatic Caching
+
+All built-in model classes in Lighteval automatically support caching. No additional configuration is needed.
+For custom models you need to add a cache to the model class and decorators on all functions.
+
+## Cache Management
+
+### Clearing Cache
+
+To clear cache for a specific model, delete the corresponding directory:
+
+```bash
+rm -rf ~/.cache/huggingface/lighteval/predictions/{model_name}/{model_hash}/
+```
+
+To clear all caches:
+
+```bash
+rm -rf ~/.cache/huggingface/lighteval/predictions
+```
@@ -1,6 +1,8 @@
-# Evaluating a Custom Model
+# Custom Model
 
-Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from `LightevalModel`. This is useful when you want to evaluate models that aren't directly supported by the standard backends (transformers, vllm, etc).
+Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from `LightevalModel`.
+This is useful when you want to evaluate models that aren't directly supported by the standard backends and providers (transformers, vllm, etc), or
+if you want to add your own pre/post processing.
 
 ## Creating a Custom Model
 
@@ -9,28 +11,34 @@ Lighteval allows you to evaluate custom model implementations by creating a cust
 Here's a basic example:
 
 ```python
+from typing import List
 from lighteval.models.abstract_model import LightevalModel
+from lighteval.models.model_output import ModelResponse
+from lighteval.tasks.requests import Doc
+from lighteval.utils.cache_management import SampleCache, cached
 
 class MyCustomModel(LightevalModel):
     def __init__(self, config):
         super().__init__(config)
         # Initialize your model here...
 
-    def greedy_until(self, requests, max_tokens=None, stop_sequences=None):
+        # Enable caching (recommended)
+        self._cache = SampleCache(config)
+
+    @cached("predictions")  # Enable caching for better performance
+    def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
         # Implement generation logic
         pass
 
-    def loglikelihood(self, requests, log=True):
+    @cached("predictions")  # Enable caching for better performance
+    def loglikelihood(self, docs: List[Doc]) -> List[ModelResponse]:
         # Implement loglikelihood computation
         pass
 
-    def loglikelihood_rolling(self, requests):
+    @cached("predictions")  # Enable caching for better performance
+    def loglikelihood_rolling(self, docs: List[Doc]) -> List[ModelResponse]:
         # Implement rolling loglikelihood computation
         pass
-
-    def loglikelihood_single_token(self, requests):
-        # Implement single token loglikelihood computation
-        pass
 ```
 
 2. The custom model file should contain exactly one class that inherits from `LightevalModel`. This class will be automatically detected and instantiated when loading the model.
@@ -97,31 +105,44 @@ pipeline.save_and_push_results()
 
 Your custom model must implement these core methods:
 
-- `greedy_until`: For generating text until a stop sequence or max tokens is reached
-- `loglikelihood`: For computing log probabilities of specific continuations
-- `loglikelihood_rolling`: For computing rolling log probabilities of sequences
-- `loglikelihood_single_token`: For computing log probabilities of single tokens
+- `greedy_until`: For generating text until a stop sequence or max tokens is reached - this is used for generative evaluations
+- `loglikelihood`: For computing log probabilities of specific continuations - this is used for multiple choice logprob evaluations
+- `loglikelihood_rolling`: For computing rolling log probabilities of sequences - this is used for perplexity metrics
 
 See the `LightevalModel` base class documentation for detailed method signatures and requirements.
 
-## Best Practices
+## Enabling Caching (Recommended)
 
-1. **Error Handling**: Implement robust error handling in your model methods to gracefully handle edge cases.
+Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions.
+To enable caching in your custom model:
 
-2. **Batching**: Consider implementing efficient batching in your model methods to improve performance.
+1. **Import caching components**:
+   ```python
+   from lighteval.utils.cache_management import SampleCache, cached
+   ```
 
-3. **Resource Management**: Properly manage any resources (e.g., API connections, model weights) in your model's `__init__` and `__del__` methods.
+2. **Initialize cache in constructor**:
+   ```python
+   def __init__(self, config):
+       # Your initialization code...
+       self._cache = SampleCache(config)
+   ```
 
-4. **Documentation**: Add clear docstrings to your model class and methods explaining any specific requirements or limitations.
+3. **Add cache decorators** to your prediction methods:
+   ```python
+   @cached("predictions")
+   def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
+       # Your implementation...
+   ```
 
-## Example Use Cases
+For detailed information about the caching system, see the [Caching Documentation](./caching.mdx).
 
-Custom models are particularly useful for:
+## Best Practices
+
+1. **Error Handling**: Implement robust error handling in your model methods to gracefully handle edge cases.
+
+2. **Batching**: Consider implementing efficient batching in your model methods to improve performance.
 
-- Evaluating models accessed through custom APIs
-- Wrapping models with specialized preprocessing/postprocessing
-- Testing novel model architectures
-- Evaluating ensemble models
-- Integrating with external services or tools
+3. **Documentation**: Add clear docstrings to your model class and methods explaining any specific requirements or limitations.
 
-For a complete example of a custom model that wraps the Google Translate API, see `examples/custom_models/google_translate_model.py`.
+4. **Caching**: Enable caching to speed up repeated evaluations and development iterations.
@@ -5,7 +5,7 @@ set in the `model-args` or in the model yaml file (see example
 [here](https://github.com/huggingface/lighteval/blob/main/examples/model_configs/vllm_model_config.yaml)).
 
 ### Base model config
-[[autodoc]] models.utils.ModelConfig
+[[autodoc]] models.abstract_model.ModelConfig
 
 ## Local Models
 
 
@@ -9,8 +9,8 @@ To use, simply change the `model_args` to reflect the arguments you want to pass
 
 ```bash
 lighteval vllm \
-    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
-    "leaderboard|truthfulqa:mc|0|0"
+    "model_name=HuggingFaceH4/zephyr-7b-beta" \
+    "extended|ifeval|0|0"
 ```
 
 `vllm` is able to distribute the model across multiple GPUs using data
@@ -21,16 +21,16 @@ For example if you have 4 GPUs you can split it across using `tensor_parallelism
 
 ```bash
 export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
-    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
-    "leaderboard|truthfulqa:mc|0|0"
+    "model_name=HuggingFaceH4/zephyr-7b-beta,tensor_parallel_size=4" \
+    "extended|ifeval|0|0"
 ```
 
 Or, if your model fits on a single GPU, you can use `data_parallelism` to speed up the evaluation:
 
 ```bash
-lighteval vllm \
-    "model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
-    "leaderboard|truthfulqa:mc|0|0"
+export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
+    "model_name=HuggingFaceH4/zephyr-7b-beta,data_parallel_size=4" \
+    "extended|ifeval|0|0"
 ```
 
 ## Use a config file
@@ -41,7 +41,7 @@ An example of a config file is shown below and can be found at `examples/model_c
 ```bash
 lighteval vllm \
     "examples/model_configs/vllm_model_config.yaml" \
-    "leaderboard|truthfulqa:mc|0|0"
+    "extended|ifeval|0|0"
 ```
 
 ```yaml
 
@@ -32,17 +32,12 @@
 from transformers import AutoTokenizer
 
 from lighteval.data import GenerativeTaskDataset
-from lighteval.models.abstract_model import LightevalModel, ModelInfo
+from lighteval.models.abstract_model import LightevalModel
 from lighteval.models.model_output import (
-    GenerativeResponse,
-    LoglikelihoodResponse,
-    LoglikelihoodSingleTokenResponse,
+    ModelResponse,
 )
 from lighteval.tasks.requests import (
-    GreedyUntilRequest,
-    LoglikelihoodRequest,
-    LoglikelihoodRollingRequest,
-    LoglikelihoodSingleTokenRequest,
+    Doc,
 )
 
 
@@ -53,13 +48,7 @@ class GoogleTranslateClient(LightevalModel):
     def __init__(self, config) -> None:
         self.model = config.model_name
         self.model_definition_file_path = config.model_definition_file_path
-
-        self.model_info = ModelInfo(
-            model_name=config.model_name,
-            model_sha="",
-            model_dtype=None,
-            model_size=-1,
-        )
+        self.config = config
 
         self._tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Use a dummy tokenizer for compatibility
 
@@ -113,8 +102,8 @@ def _translate_with_cache(self, context: str, src_lang: str, tgt_lang: str) -> s
 
     def greedy_until(
         self,
-        requests: list[GreedyUntilRequest],
-    ) -> list[GenerativeResponse]:
+        requests: list[Doc],
+    ) -> list[ModelResponse]:
         """
         Generates responses using a greedy decoding strategy until certain ending conditions are met.
         Results are cached to disk to avoid repeated translations.
@@ -124,7 +113,7 @@ def greedy_until(
             override_bs (int, optional): Override the batch size for generation. Defaults to None.
 
         Returns:
-            list[GenerativeResponse]: list of generated responses.
+            list[ModelResponse]: list of generated responses.
         """
         for request in requests:
             request.tokenized_context = self.tok_encode(request.context)
@@ -149,7 +138,7 @@ def greedy_until(
                 if result is None:
                     result = ""  # Set to empty string to prevent errors in metric computation
 
-                cur_response = GenerativeResponse(
+                cur_response = ModelResponse(
                     result=result,
                     logits=None,
                     generated_tokens=[],
@@ -175,24 +164,15 @@ def max_length(self) -> int:
         """Return the maximum sequence length of the model."""
         return 4096
 
-    def loglikelihood(self, requests: list[LoglikelihoodRequest]) -> list[LoglikelihoodResponse]:
+    def loglikelihood(self, requests: list[Doc]) -> list[ModelResponse]:
         """Tokenize the context and continuation and compute the log likelihood of those
         tokenized sequences.
         """
         raise NotImplementedError
 
     def loglikelihood_rolling(
         self,
-        requests: list[LoglikelihoodRollingRequest],
-    ) -> list[LoglikelihoodResponse]:
+        requests: list[Doc],
+    ) -> list[ModelResponse]:
         """This function is used to compute the log likelihood of the context for perplexity metrics."""
         raise NotImplementedError
-
-    def loglikelihood_single_token(
-        self,
-        requests: list[LoglikelihoodSingleTokenRequest],
-    ) -> list[LoglikelihoodSingleTokenResponse]:
-        """Tokenize the context and continuation and compute the log likelihood of those
-        tokenized sequences.
-        """
-        raise NotImplementedError