Skip to content

Commit 3c10674

Browse files
author
Rudra
committed
Updated README
1 parent cb57e55 commit 3c10674

File tree

1 file changed

+71
-21
lines changed

1 file changed

+71
-21
lines changed

README.md

Lines changed: 71 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# KCIF
1+
# Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks
22

3-
## Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks
3+
This repository is the official implementation of [Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks](https://arxiv.org/abs/2410.12972)
44

55
KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. KCIF allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions.
66

@@ -17,7 +17,7 @@ conda create -n KCIF python=3.10
1717
conda activate KCIF
1818
pip install -r requirements.txt
1919
```
20-
## Supported Tasks
20+
## Supported Datasets
2121

2222
- [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
2323
- [MathQA](https://huggingface.co/datasets/allenai/math_qa)
@@ -26,21 +26,7 @@ pip install -r requirements.txt
2626
- [Winogrande](https://huggingface.co/datasets/allenai/winogrande)
2727
- More tasks will be added soon.
2828

29-
## Usage
30-
31-
### Dataset Creation
32-
33-
- To add new dataset, please follow the guidelines [here](src/construct_data/hf_to_schema/README.md)
34-
- To add a new instruction, please follow the guidelines [here](src/construct_data/instruction/README.md)
35-
- To create the dataset for evalaution, create a `json` file with `dataset` names as the keys and `instructions` to be applied on the dataset as list of values
36-
- A sample `json` file is provided [here](src/construct_data/config.json)
37-
- Then run the following command
38-
```bash
39-
cd src
40-
python construct_data/create_benchmark.py --config <path to json> --output_path <path to folder to store the dataset> --cot
41-
```
42-
43-
### Evaluating Models
29+
## Evaluation
4430
To evaluate any model on KCIF, pls run the following command
4531

4632
```bash
@@ -52,7 +38,7 @@ For the full list of arguments pls run
5238
python inference/run_inference.py --help
5339
```
5440

55-
#### Scoring Generations
41+
### Scoring Generations
5642
The evaluation script expects a JSON configuration file containing paths to generations for both instruction-following (if_filepath) and non-instruction-following (noif_filepath) versions of each model.
5743

5844
Here's a sample json file
@@ -76,7 +62,7 @@ Each entry in the config includes:
7662
- noif_filepath: Path to non-instruction-following generations
7763
- model: Name or identifier for the model.
7864

79-
#### Usage
65+
### Usage
8066
Run the following command to compute metrics based on the config:
8167

8268
```bash
@@ -89,7 +75,71 @@ Arguments:
8975

9076
Sample config file is provided in [sample/sample_lite_benchmark.json](sample/sample_lite_benchmark.json)
9177

92-
## Todo
78+
79+
## Results
80+
The following Table lists the performance of several LLMs on our leaderboard.
81+
82+
### Full Benchmark
83+
84+
| **Models** | **μₑₘ** | **IC Score** | **KTS Score** | **μₑₘ′** | **Average Score** |
85+
|--------------------|---------|--------------|---------------|----------|-------------------|
86+
| Qwen2.5-72B | 0.4488 | 0.5077 | 0.4708 | 0.6218 | 0.5123 |
87+
| Qwen2.5-32B | 0.419 | 0.4736 | 0.4519 | 0.6351 | 0.4949 |
88+
| Llama-3.1-70B | 0.3697 | 0.3735 | 0.3925 | 0.6109 | 0.4366 |
89+
| Gemma-2-27B | 0.3622 | 0.3783 | 0.3984 | 0.5177 | 0.4142 |
90+
| Qwen2.5-14B | 0.2819 | 0.3521 | 0.305 | 0.4443 | 0.3458 |
91+
| Phi-3-medium | 0.2589 | 0.2632 | 0.2799 | 0.4897 | 0.3229 |
92+
| Gemma-2-9B | 0.2417 | 0.2701 | 0.2688 | 0.484 | 0.3162 |
93+
| Qwen2.5-7B | 0.1921 | 0.2393 | 0.2 | 0.4061 | 0.2594 |
94+
| Llama-3.1-8B | 0.1646 | 0.1917 | 0.1773 | 0.3907 | 0.2311 |
95+
| Phi-3-small | 0.1472 | 0.1474 | 0.1686 | 0.3376 | 0.2002 |
96+
| Qwen2.5-3B | 0.1277 | 0.1341 | 0.1386 | 0.3021 | 0.1756 |
97+
| Llama-3.2-3B | 0.0946 | 0.0874 | 0.1021 | 0.2395 | 0.1309 |
98+
| Phi-3.5-mini | 0.0966 | 0.1179 | 0.1014 | 0.2044 | 0.1301 |
99+
| Mistral-7B | 0.0484 | 0.059 | 0.057 | 0.2451 | 0.1024 |
100+
| Qwen2.5-1.5B | 0.0382 | 0.0346 | 0.0435 | 0.1461 | 0.0656 |
101+
| Llama-3.2-1B | 0.0153 | 0.012 | 0.0176 | 0.0897 | 0.0337 |
102+
103+
**Table**: Performance of the Small, Medium, and Large Models on our Full Benchmark — models ranked in order of performance using the average score (higher is better).
104+
105+
### Lite Benchmark
106+
107+
| **Models** | **μₑₘ** | **IC Score** | **KTS Score** | **μₑₘ′** | **Average Score** |
108+
|-------------------|---------|--------------|---------------|----------|-------------------|
109+
| GPT-4o-2024-08-06 | 0.5065 | 0.5174 | 0.5874 | 0.6889 | 0.575 |
110+
| Llama-3.1-405B | 0.4617 | 0.4888 | 0.5351 | 0.6387 | 0.5311 |
111+
| Qwen2.5-72B | 0.4348 | 0.5035 | 0.493 | 0.5768 | 0.502 |
112+
| Qwen2.5-32B | 0.409 | 0.4751 | 0.4755 | 0.5873 | 0.4867 |
113+
| Llama-3.1-70B | 0.3708 | 0.4138 | 0.4319 | 0.5645 | 0.4453 |
114+
| GPT-4o-mini | 0.394 | 0.4029 | 0.4689 | 0.4609 | 0.4317 |
115+
| Gemma-2-27B | 0.3497 | 0.3972 | 0.4194 | 0.4505 | 0.4042 |
116+
| Qwen2.5-14B | 0.2764 | 0.3523 | 0.3272 | 0.4084 | 0.3411 |
117+
| Phi-3-medium | 0.2518 | 0.2869 | 0.3054 | 0.4238 | 0.317 |
118+
| Gemma-2-9B | 0.2381 | 0.2828 | 0.292 | 0.4428 | 0.3139 |
119+
| Qwen2.5-7B | 0.1944 | 0.2513 | 0.2275 | 0.3411 | 0.2536 |
120+
| Llama-3.1-8B | 0.174 | 0.2203 | 0.2048 | 0.3513 | 0.2376 |
121+
| Phi-3-small | 0.1555 | 0.1809 | 0.1921 | 0.3027 | 0.2078 |
122+
| Mistral-7B | 0.0577 | 0.0808 | 0.0768 | 0.205 | 0.1051 |
123+
124+
**Table**: Performance of the Medium, Large and Frontier Models on our Lite Benchmark — ranked in order of performance using the average score (higher is better).
125+
126+
127+
## Contributions
128+
This section provides instructions on how to contribute to the KCIF benchmark.
129+
130+
### Adding new dataset and Instructions
131+
132+
- To add new dataset, please follow the guidelines [here](src/construct_data/hf_to_schema/README.md)
133+
- To add a new instruction, please follow the guidelines [here](src/construct_data/instruction/README.md)
134+
- To create the dataset for evaluation, create a `json` file with `dataset` names as the keys and `instructions` to be applied on the dataset as list of values
135+
- A sample `json` file is provided [here](src/construct_data/config.json)
136+
- Then run the following command
137+
```bash
138+
cd src
139+
python construct_data/create_benchmark.py --config <path to json> --output_path <path to folder to store the dataset> --cot
140+
```
141+
142+
### Todo
93143

94144
- [ ] Support new tasks (BBH, etc.)
95145
- [ ] Add test cases

0 commit comments

Comments
 (0)