huggingface
diff --git a/‎docs/source/_toctree.yml
Lines changed: 2 additions & 0 deletions b/‎docs/source/_toctree.yml
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/caching.mdx
Lines changed: 60 additions & 0 deletions b/‎docs/source/caching.mdx
Lines changed: 60 additions & 0 deletions
diff --git a/‎docs/source/evaluating-a-custom-model.mdx
Lines changed: 47 additions & 26 deletions b/‎docs/source/evaluating-a-custom-model.mdx
Lines changed: 47 additions & 26 deletions
diff --git a/‎examples/custom_models/google_translate_model.py
Lines changed: 9 additions & 23 deletions b/‎examples/custom_models/google_translate_model.py
Lines changed: 9 additions & 23 deletions
diff --git a/‎examples/custom_models/local_mt_model.py
Lines changed: 8 additions & 25 deletions b/‎examples/custom_models/local_mt_model.py
Lines changed: 8 additions & 25 deletions
diff --git a/‎src/lighteval/main_accelerate.py
Lines changed: 2 additions & 0 deletions b/‎src/lighteval/main_accelerate.py
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/lighteval/models/abstract_model.py
Lines changed: 2 additions & 1 deletion b/‎src/lighteval/models/abstract_model.py
Lines changed: 2 additions & 1 deletion
@@ -9,6 +9,8 @@
 - sections:
   - local: saving-and-reading-results
     title: Save and read results
+  - local: caching
+    title: Caching
   - local: using-the-python-api
     title: Use the Python API
   - local: adding-a-custom-task
 
@@ -0,0 +1,60 @@
+# Caching System
+
+Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions.
+This is especially useful when running the same evaluation multiple times, or comparing different evaluation metrics on the same model outputs.
+
+## How It Works
+
+The caching system caches the predictions of the model for now (we will add tokenized input caching later).
+It stores model responses objects (generations, logits, probabilities) for evaluation samples.
+
+### Cache Structure
+
+Cached data is stored on disk using HuggingFace datasets in the following structure:
+
+```
+.cache/
+└── huggingface/
+    └── lighteval/
+        └── predictions/
+            └── {model_name}/
+                └── {model_hash}/
+                    └── {task_name}.parquet
+```
+
+Where:
+- `model_name`: The model name (path on the hub or local path)
+- `model_hash`: Hash of the model configuration to ensure cache invalidation when parameters change
+- `task_name`: Name of the evaluation task
+
+### Cache Recreation
+
+A new cache is automatically created when:
+- Model configuration changes (different parameters, quantization, etc.)
+- Model weights change (different revision, checkpoint, etc.)
+- Generation parameters change (temperature, max_tokens, etc.)
+
+This ensures that cached results are always consistent with your current model setup.
+
+## Using Caching
+
+### Automatic Caching
+
+All built-in model classes in Lighteval automatically support caching. No additional configuration is needed.
+For custom models you need to add a cache to the model class and decorators on all functions.
+
+## Cache Management
+
+### Clearing Cache
+
+To clear cache for a specific model, delete the corresponding directory:
+
+```bash
+rm -rf ./cache/huggingface/lighteval/predictions/{model_name}/{model_hash}/
+```
+
+To clear all caches:
+
+```bash
+rm -rf ./cache/huggingface/lighteval/predictions
+```
@@ -1,6 +1,8 @@
-# Evaluating a Custom Model
+# Custom Model
 
-Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from `LightevalModel`. This is useful when you want to evaluate models that aren't directly supported by the standard backends (transformers, vllm, etc).
+Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from `LightevalModel`.
+This is useful when you want to evaluate models that aren't directly supported by the standard backends and providers (transformers, vllm, etc), or
+if you want to add your own pre/post processing.
 
 ## Creating a Custom Model
 
@@ -9,28 +11,34 @@ Lighteval allows you to evaluate custom model implementations by creating a cust
 Here's a basic example:
 
 ```python
+from typing import List
 from lighteval.models.abstract_model import LightevalModel
+from lighteval.models.model_output import ModelResponse
+from lighteval.tasks.requests import Doc
+from lighteval.utils.cache_management import SampleCache, cached
 
 class MyCustomModel(LightevalModel):
     def __init__(self, config):
         super().__init__(config)
         # Initialize your model here...
 
-    def greedy_until(self, requests, max_tokens=None, stop_sequences=None):
+        # Enable caching (recommended)
+        self._cache = SampleCache(config)
+
+    @cached("predictions")  # Enable caching for better performance
+    def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
         # Implement generation logic
         pass
 
-    def loglikelihood(self, requests, log=True):
+    @cached("predictions")  # Enable caching for better performance
+    def loglikelihood(self, docs: List[Doc]) -> List[ModelResponse]:
         # Implement loglikelihood computation
         pass
 
-    def loglikelihood_rolling(self, requests):
+    @cached("predictions")  # Enable caching for better performance
+    def loglikelihood_rolling(self, docs: List[Doc]) -> List[ModelResponse]:
         # Implement rolling loglikelihood computation
         pass
-
-    def loglikelihood_single_token(self, requests):
-        # Implement single token loglikelihood computation
-        pass
 ```
 
 2. The custom model file should contain exactly one class that inherits from `LightevalModel`. This class will be automatically detected and instantiated when loading the model.
@@ -97,31 +105,44 @@ pipeline.save_and_push_results()
 
 Your custom model must implement these core methods:
 
-- `greedy_until`: For generating text until a stop sequence or max tokens is reached
-- `loglikelihood`: For computing log probabilities of specific continuations
-- `loglikelihood_rolling`: For computing rolling log probabilities of sequences
-- `loglikelihood_single_token`: For computing log probabilities of single tokens
+- `greedy_until`: For generating text until a stop sequence or max tokens is reached - this is used for generative evaluations
+- `loglikelihood`: For computing log probabilities of specific continuations - this is used for multiple choice logprob evaluations
+- `loglikelihood_rolling`: For computing rolling log probabilities of sequences - this is used for perplexity metrics
 
 See the `LightevalModel` base class documentation for detailed method signatures and requirements.
 
-## Best Practices
+## Enabling Caching (Recommended)
 
-1. **Error Handling**: Implement robust error handling in your model methods to gracefully handle edge cases.
+Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions.
+To enable caching in your custom model:
 
-2. **Batching**: Consider implementing efficient batching in your model methods to improve performance.
+1. **Import caching components**:
+   ```python
+   from lighteval.utils.cache_management import SampleCache, cached
+   ```
 
-3. **Resource Management**: Properly manage any resources (e.g., API connections, model weights) in your model's `__init__` and `__del__` methods.
+2. **Initialize cache in constructor**:
+   ```python
+   def __init__(self, config):
+       # Your initialization code...
+       self._cache = SampleCache(config)
+   ```
 
-4. **Documentation**: Add clear docstrings to your model class and methods explaining any specific requirements or limitations.
+3. **Add cache decorators** to your prediction methods:
+   ```python
+   @cached("predictions")
+   def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
+       # Your implementation...
+   ```
 
-## Example Use Cases
+For detailed information about the caching system, see the [Caching Documentation](./caching.mdx).
 
-Custom models are particularly useful for:
+## Best Practices
+
+1. **Error Handling**: Implement robust error handling in your model methods to gracefully handle edge cases.
+
+2. **Batching**: Consider implementing efficient batching in your model methods to improve performance.
 
-- Evaluating models accessed through custom APIs
-- Wrapping models with specialized preprocessing/postprocessing
-- Testing novel model architectures
-- Evaluating ensemble models
-- Integrating with external services or tools
+3. **Documentation**: Add clear docstrings to your model class and methods explaining any specific requirements or limitations.
 
-For a complete example of a custom model that wraps the Google Translate API, see `examples/custom_models/google_translate_model.py`.
+4. **Caching**: Enable caching to speed up repeated evaluations and development iterations.
@@ -34,15 +34,10 @@
 from lighteval.data import GenerativeTaskDataset
 from lighteval.models.abstract_model import LightevalModel
 from lighteval.models.model_output import (
-    GenerativeResponse,
-    LoglikelihoodResponse,
-    LoglikelihoodSingleTokenResponse,
+    ModelResponse,
 )
 from lighteval.tasks.requests import (
-    GreedyUntilRequest,
-    LoglikelihoodRequest,
-    LoglikelihoodRollingRequest,
-    LoglikelihoodSingleTokenRequest,
+    Doc,
 )
 
 
@@ -107,8 +102,8 @@ def _translate_with_cache(self, context: str, src_lang: str, tgt_lang: str) -> s
 
     def greedy_until(
         self,
-        requests: list[GreedyUntilRequest],
-    ) -> list[GenerativeResponse]:
+        requests: list[Doc],
+    ) -> list[ModelResponse]:
         """
         Generates responses using a greedy decoding strategy until certain ending conditions are met.
         Results are cached to disk to avoid repeated translations.
@@ -118,7 +113,7 @@ def greedy_until(
             override_bs (int, optional): Override the batch size for generation. Defaults to None.
 
         Returns:
-            list[GenerativeResponse]: list of generated responses.
+            list[ModelResponse]: list of generated responses.
         """
         for request in requests:
             request.tokenized_context = self.tok_encode(request.context)
@@ -143,7 +138,7 @@ def greedy_until(
                 if result is None:
                     result = ""  # Set to empty string to prevent errors in metric computation
 
-                cur_response = GenerativeResponse(
+                cur_response = ModelResponse(
                     result=result,
                     logits=None,
                     generated_tokens=[],
@@ -169,24 +164,15 @@ def max_length(self) -> int:
         """Return the maximum sequence length of the model."""
         return 4096
 
-    def loglikelihood(self, requests: list[LoglikelihoodRequest]) -> list[LoglikelihoodResponse]:
+    def loglikelihood(self, requests: list[Doc]) -> list[ModelResponse]:
         """Tokenize the context and continuation and compute the log likelihood of those
         tokenized sequences.
         """
         raise NotImplementedError
 
     def loglikelihood_rolling(
         self,
-        requests: list[LoglikelihoodRollingRequest],
-    ) -> list[LoglikelihoodResponse]:
+        requests: list[Doc],
+    ) -> list[ModelResponse]:
         """This function is used to compute the log likelihood of the context for perplexity metrics."""
         raise NotImplementedError
-
-    def loglikelihood_single_token(
-        self,
-        requests: list[LoglikelihoodSingleTokenRequest],
-    ) -> list[LoglikelihoodSingleTokenResponse]:
-        """Tokenize the context and continuation and compute the log likelihood of those
-        tokenized sequences.
-        """
-        raise NotImplementedError
@@ -36,15 +36,10 @@
 from lighteval.data import GenerativeTaskDataset
 from lighteval.models.abstract_model import LightevalModel, TokenSequence
 from lighteval.models.model_output import (
-    GenerativeResponse,
-    LoglikelihoodResponse,
-    LoglikelihoodSingleTokenResponse,
+    ModelResponse,
 )
 from lighteval.tasks.requests import (
-    GreedyUntilRequest,
-    LoglikelihoodRequest,
-    LoglikelihoodRollingRequest,
-    LoglikelihoodSingleTokenRequest,
+    Doc,
 )
 
 
@@ -119,9 +114,9 @@ def _convert_to_iso3(self, lang_code: str) -> str:
 
     def greedy_until(
         self,
-        requests: list[GreedyUntilRequest],
+        requests: list[Doc],
         override_bs: Optional[int] = None,
-    ) -> list[GenerativeResponse]:
+    ) -> list[ModelResponse]:
         """
         Generates responses using a greedy decoding strategy until certain ending conditions are met.
         Results are cached to disk to avoid repeated translations.
@@ -131,7 +126,7 @@ def greedy_until(
             override_bs (int, optional): Override the batch size for generation. Defaults to None.
 
         Returns:
-            list[GenerativeResponse]: list of generated responses.
+            list[ModelResponse]: list of generated responses.
         """
 
         def get_langs(task_name: str) -> tuple[str, str]:
@@ -204,7 +199,7 @@ def get_langs(task_name: str) -> tuple[str, str]:
                 # Create responses for the batch
                 for input_tokens, output_tokens, translation in zip(input_ids, output_ids, translations):
                     results.append(
-                        GenerativeResponse(
+                        ModelResponse(
                             input_tokens=input_tokens,
                             generated_tokens=output_tokens,
                             result=translation,
@@ -256,24 +251,12 @@ def max_length(self) -> int:
         """Return the maximum sequence length of the model."""
         return 4096
 
-    def loglikelihood(
-        self, requests: list[LoglikelihoodRequest], override_bs: Optional[int] = None
-    ) -> list[LoglikelihoodResponse]:
+    def loglikelihood(self, requests: list[Doc], override_bs: Optional[int] = None) -> list[ModelResponse]:
         """Tokenize the context and continuation and compute the log likelihood of those
         tokenized sequences.
         """
         raise NotImplementedError
 
-    def loglikelihood_rolling(
-        self, requests: list[LoglikelihoodRollingRequest], override_bs: Optional[int] = None
-    ) -> list[LoglikelihoodResponse]:
+    def loglikelihood_rolling(self, requests: list[Doc], override_bs: Optional[int] = None) -> list[ModelResponse]:
         """This function is used to compute the log likelihood of the context for perplexity metrics."""
         raise NotImplementedError
-
-    def loglikelihood_single_token(
-        self, requests: list[LoglikelihoodSingleTokenRequest], override_bs: Optional[int] = None
-    ) -> list[LoglikelihoodSingleTokenResponse]:
-        """Tokenize the context and continuation and compute the log likelihood of those
-        tokenized sequences.
-        """
-        raise NotImplementedError
@@ -154,8 +154,10 @@ def accelerate(  # noqa C901
         config: dict = ModelConfig._parse_args(model_args)
 
     if config.get("delta_weights", False):
+        config.pop("delta_weights")
         model_config = DeltaModelConfig(**config)
     elif config.get("adapter_weights", False):
+        config.pop("adapter_weights")
         model_config = AdapterModelConfig(**config)
     else:
         if vision_model:
 
@@ -81,6 +81,7 @@ class ModelConfig(BaseModel, extra="forbid"):
 
     generation_parameters: GenerationParameters = GenerationParameters()
     system_prompt: str | None = None
+    cache_dir: str = "./cache/huggingface/lighteval"
 
     @classmethod
     def from_path(cls, path: str):
@@ -191,7 +192,7 @@ def greedy_until(
             docs (list[Doc]): List of documents containing the context for generation.
 
         Returns:
-            list[GenerativeResponse]: list of generated responses.
+            list[ModelResponse]: list of generated responses.
         """
         return NotImplemented