Skip to content

Commit f9546ba

Browse files
[ColossalEval] support for vllm (#6056)
* support vllm * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * modify vllm and update readme * run pre-commit * remove dupilicated lines and refine code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update param name * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine code * update readme * refine code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 4fa6b95 commit f9546ba

File tree

19 files changed

+576
-35
lines changed

19 files changed

+576
-35
lines changed

applications/ColossalEval/README.md

Lines changed: 40 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ inference_kwargs = {
154154
"calculate_loss": True,
155155
"all_classes": ["A", "B", "C", "D"],
156156
"language": "Chinese",
157-
"pretrain": False,
157+
"calculate_overall_loss": False,
158158
"max_new_tokens": 32
159159
}
160160
```
@@ -163,7 +163,7 @@ The `inference_kwargs` currently contains 5 fields:
163163
- `calculate_loss` (bool, compulsory): Whether the loss on target tokens will be calculated
164164
- `all_classes` (Optional[list], compulsory): Whether the subcategory is a single-choice question. Specify all available options in a list or otherwise None.
165165
- `language` (str, compulsory): The language for the subcategory.
166-
- `pretrain` (bool, compulsory): Whether the dataset is a pretrain dataset or not. It is usually used for calculate perplexity when you want to evaluate a model with extended context length.
166+
- `calculate_overall_loss` (bool, compulsory): Whether to calculate the overall loss of sentences or not if the dataset is a pretrain dataset. It is usually used for calculate perplexity when you want to evaluate a model with extended context length.
167167
- `max_new_tokens` (int, compulsory): The number of new tokens to generate during inference.
168168

169169
For example, for dataset MMLU, each subcategory consists of single-choice questions with options A, B, C and D by default and we can assign value `["A", "B", "C", "D"]` to key`all_classes`. For dataset C-Eval, target answers aren't provided in the test split so `calculate_loss` should be set as False. However, other dataset such as GAOKAO-bench contains different formats of questions and lacks some keys or metadata which can reveal what type (single-choice or multi-choice) of questions it is. Before assigning inference arguments, we first parse the dataset to decide which type of questions the subcategory belongs to and set the inference arguments accordingly.
@@ -230,7 +230,7 @@ Example:
230230
In this step, you will configure your tokenizer and model arguments to infer on the given datasets.
231231

232232
A config file consists of two parts.
233-
1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments. For model class, currently we support `HuggingFaceModel`, `HuggingFaceCausalLM`, `ChatGLMModel` and `ChatGLMModel2`. `HuggingFaceModel` is for models that can be loaded with `AutoModel` and `HuggingFaceCausalLM` is for models that can be loaded with `AutoModelForCausalLM`. `ChatGLMModel` and `ChatGLMModel2` are for ChatGLM and ChatGLM2 models respectively. You can check all model classes in `colossal_eval/models/__init__.py`. If your model should set `trust_remote_code` as true, specify it in the `tokenizer_kwargs` and `model_kwargs` fields.
233+
1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments. For model class, currently we support `HuggingFaceModel`, `HuggingFaceCausalLM`, `ChatGLMModel`, `ChatGLMModel2` and `vLLMModel`. `HuggingFaceModel` is for models that can be loaded with `AutoModel` and `HuggingFaceCausalLM` is for models that can be loaded with `AutoModelForCausalLM`. `ChatGLMModel` and `ChatGLMModel2` are for ChatGLM and ChatGLM2 models respectively. `vLLMModel` is for models that can be loaded with vllm offline inference `LLM` class. You can check all model classes in `colossal_eval/models/__init__.py`. If your model should set `trust_remote_code` as true, specify it in the `tokenizer_kwargs` and `model_kwargs` fields.
234234
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K and LongBench and few-shot on dataset MMLU, CMMLU AGIEval and GSM8K. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
235235

236236
Once you have all config ready, the program will run inference on all the given datasets on all the given models.
@@ -272,7 +272,42 @@ An example config using model class `HuggingFaceCausalLM` and dataset class `CMM
272272
}
273273
```
274274

275-
Currently, we support Hugging Face models. The `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. `few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.
275+
An example config using model class `vLLMModel` and dataset class `CMMLUDataset` can be:
276+
```json
277+
{
278+
"model": [
279+
{
280+
"name": "model name",
281+
"model_class": "vLLMModel",
282+
"parameters": {
283+
"path": "path to model",
284+
"model_max_length": 2048,
285+
"tokenizer_path": "",
286+
"tokenizer_kwargs": {
287+
"trust_remote_code": true
288+
},
289+
"model_kwargs": {
290+
"trust_remote_code": true
291+
},
292+
"prompt_template": "plain",
293+
"batch_size": 4
294+
}
295+
}
296+
],
297+
"dataset": [
298+
{
299+
"name": "dataset name",
300+
"dataset_class": "CMMLUDataset",
301+
"debug": false,
302+
"few_shot": true,
303+
"path": "path to original dataset",
304+
"save_path": "path to save converted dataset"
305+
}
306+
]
307+
}
308+
```
309+
310+
Currently, we support Hugging Face models as well as vLLM models. For Hugging Face models, the `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. For vLLM model, the `tokenizer_kwargs` and `model_kwargs` are loaded together in `LLM` class.`few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.
276311

277312
> For GSM8K dataset, you can set additional flags `load_train` or `load_reference` for dataset configuration as true and during the inference process, the program will calculate loss summation over all tokens for each data sample. During the evaluation process, you can use metric `loss_over_all_tokens` to calculate the overall loss and use it for data leakage evaluation.
278313
@@ -287,7 +322,7 @@ torchrun --nproc_per_node=4 inference.py \
287322
--inference_save_path "path to save inference results"
288323
```
289324

290-
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`. If you want to use tensor parallel inference, specify the tensor parallel size in `--tp_size` and the process will automatically calculate data parallel size.
325+
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`. If you want to use tensor parallel inference, specify the tensor parallel size in `--tp_size` and the process will automatically calculate data parallel size (currently not support for `vLLMModel`).
291326

292327
### Evaluation
293328

@@ -530,10 +565,6 @@ class CustomizedModel(BaseModel):
530565

531566
Once you have successfully added your own model, you can specify your model class in your inference config.
532567

533-
## To do
534-
535-
- [ ] Add visualization code for evaluation results on public dataset
536-
- [ ] Improve the way to label target tokens
537568

538569
## Citations
539570

applications/ColossalEval/colossal_eval/dataset/agieval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
"calculate_loss": True,
4848
"all_classes": None,
4949
"language": "Chinese",
50-
"pretrain": False,
50+
"calculate_overall_loss": False,
5151
"max_new_tokens": 32,
5252
}
5353

applications/ColossalEval/colossal_eval/dataset/ceval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@
7070
"calculate_loss": False,
7171
"all_classes": ["A", "B", "C", "D"],
7272
"language": "Chinese",
73-
"pretrain": False,
73+
"calculate_overall_loss": False,
7474
"max_new_tokens": 32,
7575
}
7676

applications/ColossalEval/colossal_eval/dataset/cmmlu.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@
8181
"calculate_loss": True,
8282
"all_classes": ["A", "B", "C", "D"],
8383
"language": "Chinese",
84-
"pretrain": False,
84+
"calculate_overall_loss": False,
8585
"max_new_tokens": 32,
8686
}
8787

applications/ColossalEval/colossal_eval/dataset/colossalai.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
"calculate_loss": False,
1313
"all_classes": None,
1414
"language": "Chinese",
15-
"pretrain": False,
15+
"calculate_overall_loss": False,
1616
"max_new_tokens": 256,
1717
}
1818

applications/ColossalEval/colossal_eval/dataset/cvalues.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"calculate_loss": False,
1616
"all_classes": ["A", "B"],
1717
"language": LANGUAGE,
18-
"pretrain": False,
18+
"calculate_overall_loss": False,
1919
"max_new_tokens": 32,
2020
}
2121

applications/ColossalEval/colossal_eval/dataset/gaokaobench.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
"calculate_loss": True,
3737
"all_classes": None,
3838
"language": "Chinese",
39-
"pretrain": False,
39+
"calculate_overall_loss": False,
4040
"max_new_tokens": 32,
4141
}
4242

applications/ColossalEval/colossal_eval/dataset/gsm.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@
7272
"calculate_loss": True,
7373
"all_classes": None,
7474
"language": "English",
75-
"pretrain": False,
75+
"calculate_overall_loss": False,
7676
"max_new_tokens": 256,
7777
}
7878

@@ -114,7 +114,7 @@ def load(
114114
dataset[split][subject]["inference_kwargs"] = copy.deepcopy(default_inference_kwargs)
115115

116116
if forward_only:
117-
dataset[split][subject]["inference_kwargs"]["pretrain"] = True
117+
dataset[split][subject]["inference_kwargs"]["calculate_overall_loss"] = True
118118

119119
if split == "test" and few_shot:
120120
dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data()

applications/ColossalEval/colossal_eval/dataset/longbench.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
"calculate_loss": True,
6161
"all_classes": None,
6262
"language": "Chinese",
63-
"pretrain": False,
63+
"calculate_overall_loss": False,
6464
"max_new_tokens": 32,
6565
}
6666

applications/ColossalEval/colossal_eval/dataset/mmlu.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"calculate_loss": True,
1212
"all_classes": ["A", "B", "C", "D"],
1313
"language": "English",
14-
"pretrain": False,
14+
"calculate_overall_loss": False,
1515
"max_new_tokens": 32,
1616
}
1717

0 commit comments

Comments
 (0)