Skip to content

Commit 82a3c25

Browse files
Merge branch 'huggingface:main' into alielfilali01-patch-1-fixTasksList
2 parents cfd4fc7 + 76e5aff commit 82a3c25

39 files changed

+1707
-966
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
- sections:
1010
- local: saving-and-reading-results
1111
title: Save and read results
12+
- local: caching
13+
title: Caching
1214
- local: using-the-python-api
1315
title: Use the Python API
1416
- local: adding-a-custom-task

docs/source/caching.mdx

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Caching System
2+
3+
Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions.
4+
This is especially useful when running the same evaluation multiple times, or comparing different evaluation metrics on the same model outputs.
5+
6+
## How It Works
7+
8+
The caching system caches the predictions of the model for now (we will add tokenized input caching later).
9+
It stores model responses objects (generations, logits, probabilities) for evaluation samples.
10+
11+
### Cache Structure
12+
13+
Cached data is stored on disk using HuggingFace datasets in the following structure:
14+
15+
```
16+
.cache/
17+
└── huggingface/
18+
└── lighteval/
19+
└── predictions/
20+
└── {model_name}/
21+
└── {model_hash}/
22+
└── {task_name}.parquet
23+
```
24+
25+
Where:
26+
- `model_name`: The model name (path on the hub or local path)
27+
- `model_hash`: Hash of the model configuration to ensure cache invalidation when parameters change
28+
- `task_name`: Name of the evaluation task
29+
30+
### Cache Recreation
31+
32+
A new cache is automatically created when:
33+
- Model configuration changes (different parameters, quantization, etc.)
34+
- Model weights change (different revision, checkpoint, etc.)
35+
- Generation parameters change (temperature, max_tokens, etc.)
36+
37+
This ensures that cached results are always consistent with your current model setup.
38+
39+
## Using Caching
40+
41+
### Automatic Caching
42+
43+
All built-in model classes in Lighteval automatically support caching. No additional configuration is needed.
44+
For custom models you need to add a cache to the model class and decorators on all functions.
45+
46+
## Cache Management
47+
48+
### Clearing Cache
49+
50+
To clear cache for a specific model, delete the corresponding directory:
51+
52+
```bash
53+
rm -rf ~/.cache/huggingface/lighteval/predictions/{model_name}/{model_hash}/
54+
```
55+
56+
To clear all caches:
57+
58+
```bash
59+
rm -rf ~/.cache/huggingface/lighteval/predictions
60+
```

docs/source/evaluating-a-custom-model.mdx

Lines changed: 47 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
# Evaluating a Custom Model
1+
# Custom Model
22

3-
Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from `LightevalModel`. This is useful when you want to evaluate models that aren't directly supported by the standard backends (transformers, vllm, etc).
3+
Lighteval allows you to evaluate custom model implementations by creating a custom model class that inherits from `LightevalModel`.
4+
This is useful when you want to evaluate models that aren't directly supported by the standard backends and providers (transformers, vllm, etc), or
5+
if you want to add your own pre/post processing.
46

57
## Creating a Custom Model
68

@@ -9,28 +11,34 @@ Lighteval allows you to evaluate custom model implementations by creating a cust
911
Here's a basic example:
1012

1113
```python
14+
from typing import List
1215
from lighteval.models.abstract_model import LightevalModel
16+
from lighteval.models.model_output import ModelResponse
17+
from lighteval.tasks.requests import Doc
18+
from lighteval.utils.cache_management import SampleCache, cached
1319

1420
class MyCustomModel(LightevalModel):
1521
def __init__(self, config):
1622
super().__init__(config)
1723
# Initialize your model here...
1824

19-
def greedy_until(self, requests, max_tokens=None, stop_sequences=None):
25+
# Enable caching (recommended)
26+
self._cache = SampleCache(config)
27+
28+
@cached("predictions") # Enable caching for better performance
29+
def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
2030
# Implement generation logic
2131
pass
2232

23-
def loglikelihood(self, requests, log=True):
33+
@cached("predictions") # Enable caching for better performance
34+
def loglikelihood(self, docs: List[Doc]) -> List[ModelResponse]:
2435
# Implement loglikelihood computation
2536
pass
2637

27-
def loglikelihood_rolling(self, requests):
38+
@cached("predictions") # Enable caching for better performance
39+
def loglikelihood_rolling(self, docs: List[Doc]) -> List[ModelResponse]:
2840
# Implement rolling loglikelihood computation
2941
pass
30-
31-
def loglikelihood_single_token(self, requests):
32-
# Implement single token loglikelihood computation
33-
pass
3442
```
3543

3644
2. The custom model file should contain exactly one class that inherits from `LightevalModel`. This class will be automatically detected and instantiated when loading the model.
@@ -97,31 +105,44 @@ pipeline.save_and_push_results()
97105

98106
Your custom model must implement these core methods:
99107

100-
- `greedy_until`: For generating text until a stop sequence or max tokens is reached
101-
- `loglikelihood`: For computing log probabilities of specific continuations
102-
- `loglikelihood_rolling`: For computing rolling log probabilities of sequences
103-
- `loglikelihood_single_token`: For computing log probabilities of single tokens
108+
- `greedy_until`: For generating text until a stop sequence or max tokens is reached - this is used for generative evaluations
109+
- `loglikelihood`: For computing log probabilities of specific continuations - this is used for multiple choice logprob evaluations
110+
- `loglikelihood_rolling`: For computing rolling log probabilities of sequences - this is used for perplexity metrics
104111

105112
See the `LightevalModel` base class documentation for detailed method signatures and requirements.
106113

107-
## Best Practices
114+
## Enabling Caching (Recommended)
108115

109-
1. **Error Handling**: Implement robust error handling in your model methods to gracefully handle edge cases.
116+
Lighteval includes a caching system that can significantly speed up evaluations by storing and reusing model predictions.
117+
To enable caching in your custom model:
110118

111-
2. **Batching**: Consider implementing efficient batching in your model methods to improve performance.
119+
1. **Import caching components**:
120+
```python
121+
from lighteval.utils.cache_management import SampleCache, cached
122+
```
112123

113-
3. **Resource Management**: Properly manage any resources (e.g., API connections, model weights) in your model's `__init__` and `__del__` methods.
124+
2. **Initialize cache in constructor**:
125+
```python
126+
def __init__(self, config):
127+
# Your initialization code...
128+
self._cache = SampleCache(config)
129+
```
114130

115-
4. **Documentation**: Add clear docstrings to your model class and methods explaining any specific requirements or limitations.
131+
3. **Add cache decorators** to your prediction methods:
132+
```python
133+
@cached("predictions")
134+
def greedy_until(self, docs: List[Doc]) -> List[ModelResponse]:
135+
# Your implementation...
136+
```
116137

117-
## Example Use Cases
138+
For detailed information about the caching system, see the [Caching Documentation](./caching.mdx).
118139

119-
Custom models are particularly useful for:
140+
## Best Practices
141+
142+
1. **Error Handling**: Implement robust error handling in your model methods to gracefully handle edge cases.
143+
144+
2. **Batching**: Consider implementing efficient batching in your model methods to improve performance.
120145

121-
- Evaluating models accessed through custom APIs
122-
- Wrapping models with specialized preprocessing/postprocessing
123-
- Testing novel model architectures
124-
- Evaluating ensemble models
125-
- Integrating with external services or tools
146+
3. **Documentation**: Add clear docstrings to your model class and methods explaining any specific requirements or limitations.
126147

127-
For a complete example of a custom model that wraps the Google Translate API, see `examples/custom_models/google_translate_model.py`.
148+
4. **Caching**: Enable caching to speed up repeated evaluations and development iterations.

docs/source/package_reference/models.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ set in the `model-args` or in the model yaml file (see example
55
[here](https://github.com/huggingface/lighteval/blob/main/examples/model_configs/vllm_model_config.yaml)).
66

77
### Base model config
8-
[[autodoc]] models.utils.ModelConfig
8+
[[autodoc]] models.abstract_model.ModelConfig
99

1010
## Local Models
1111

docs/source/use-vllm-as-backend.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ To use, simply change the `model_args` to reflect the arguments you want to pass
99
1010
```bash
1111
lighteval vllm \
12-
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
13-
"leaderboard|truthfulqa:mc|0|0"
12+
"model_name=HuggingFaceH4/zephyr-7b-beta" \
13+
"extended|ifeval|0|0"
1414
```
1515

1616
`vllm` is able to distribute the model across multiple GPUs using data
@@ -21,16 +21,16 @@ For example if you have 4 GPUs you can split it across using `tensor_parallelism
2121

2222
```bash
2323
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
24-
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
25-
"leaderboard|truthfulqa:mc|0|0"
24+
"model_name=HuggingFaceH4/zephyr-7b-beta,tensor_parallel_size=4" \
25+
"extended|ifeval|0|0"
2626
```
2727

2828
Or, if your model fits on a single GPU, you can use `data_parallelism` to speed up the evaluation:
2929

3030
```bash
31-
lighteval vllm \
32-
"model_name=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
33-
"leaderboard|truthfulqa:mc|0|0"
31+
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
32+
"model_name=HuggingFaceH4/zephyr-7b-beta,data_parallel_size=4" \
33+
"extended|ifeval|0|0"
3434
```
3535

3636
## Use a config file
@@ -41,7 +41,7 @@ An example of a config file is shown below and can be found at `examples/model_c
4141
```bash
4242
lighteval vllm \
4343
"examples/model_configs/vllm_model_config.yaml" \
44-
"leaderboard|truthfulqa:mc|0|0"
44+
"extended|ifeval|0|0"
4545
```
4646

4747
```yaml

examples/custom_models/google_translate_model.py

Lines changed: 11 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,12 @@
3232
from transformers import AutoTokenizer
3333

3434
from lighteval.data import GenerativeTaskDataset
35-
from lighteval.models.abstract_model import LightevalModel, ModelInfo
35+
from lighteval.models.abstract_model import LightevalModel
3636
from lighteval.models.model_output import (
37-
GenerativeResponse,
38-
LoglikelihoodResponse,
39-
LoglikelihoodSingleTokenResponse,
37+
ModelResponse,
4038
)
4139
from lighteval.tasks.requests import (
42-
GreedyUntilRequest,
43-
LoglikelihoodRequest,
44-
LoglikelihoodRollingRequest,
45-
LoglikelihoodSingleTokenRequest,
40+
Doc,
4641
)
4742

4843

@@ -53,13 +48,7 @@ class GoogleTranslateClient(LightevalModel):
5348
def __init__(self, config) -> None:
5449
self.model = config.model_name
5550
self.model_definition_file_path = config.model_definition_file_path
56-
57-
self.model_info = ModelInfo(
58-
model_name=config.model_name,
59-
model_sha="",
60-
model_dtype=None,
61-
model_size=-1,
62-
)
51+
self.config = config
6352

6453
self._tokenizer = AutoTokenizer.from_pretrained("gpt2") # Use a dummy tokenizer for compatibility
6554

@@ -113,8 +102,8 @@ def _translate_with_cache(self, context: str, src_lang: str, tgt_lang: str) -> s
113102

114103
def greedy_until(
115104
self,
116-
requests: list[GreedyUntilRequest],
117-
) -> list[GenerativeResponse]:
105+
requests: list[Doc],
106+
) -> list[ModelResponse]:
118107
"""
119108
Generates responses using a greedy decoding strategy until certain ending conditions are met.
120109
Results are cached to disk to avoid repeated translations.
@@ -124,7 +113,7 @@ def greedy_until(
124113
override_bs (int, optional): Override the batch size for generation. Defaults to None.
125114
126115
Returns:
127-
list[GenerativeResponse]: list of generated responses.
116+
list[ModelResponse]: list of generated responses.
128117
"""
129118
for request in requests:
130119
request.tokenized_context = self.tok_encode(request.context)
@@ -149,7 +138,7 @@ def greedy_until(
149138
if result is None:
150139
result = "" # Set to empty string to prevent errors in metric computation
151140

152-
cur_response = GenerativeResponse(
141+
cur_response = ModelResponse(
153142
result=result,
154143
logits=None,
155144
generated_tokens=[],
@@ -175,24 +164,15 @@ def max_length(self) -> int:
175164
"""Return the maximum sequence length of the model."""
176165
return 4096
177166

178-
def loglikelihood(self, requests: list[LoglikelihoodRequest]) -> list[LoglikelihoodResponse]:
167+
def loglikelihood(self, requests: list[Doc]) -> list[ModelResponse]:
179168
"""Tokenize the context and continuation and compute the log likelihood of those
180169
tokenized sequences.
181170
"""
182171
raise NotImplementedError
183172

184173
def loglikelihood_rolling(
185174
self,
186-
requests: list[LoglikelihoodRollingRequest],
187-
) -> list[LoglikelihoodResponse]:
175+
requests: list[Doc],
176+
) -> list[ModelResponse]:
188177
"""This function is used to compute the log likelihood of the context for perplexity metrics."""
189178
raise NotImplementedError
190-
191-
def loglikelihood_single_token(
192-
self,
193-
requests: list[LoglikelihoodSingleTokenRequest],
194-
) -> list[LoglikelihoodSingleTokenResponse]:
195-
"""Tokenize the context and continuation and compute the log likelihood of those
196-
tokenized sequences.
197-
"""
198-
raise NotImplementedError

0 commit comments

Comments
 (0)