You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+71-21Lines changed: 71 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# KCIF
1
+
# Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks
2
2
3
-
## Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks
3
+
This repository is the official implementation of [Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks](https://arxiv.org/abs/2410.12972)
4
4
5
5
KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. KCIF allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions.
- To add new dataset, please follow the guidelines [here](src/construct_data/hf_to_schema/README.md)
34
-
- To add a new instruction, please follow the guidelines [here](src/construct_data/instruction/README.md)
35
-
- To create the dataset for evalaution, create a `json` file with `dataset` names as the keys and `instructions` to be applied on the dataset as list of values
36
-
- A sample `json` file is provided [here](src/construct_data/config.json)
37
-
- Then run the following command
38
-
```bash
39
-
cd src
40
-
python construct_data/create_benchmark.py --config <path to json> --output_path <path to folder to store the dataset> --cot
41
-
```
42
-
43
-
### Evaluating Models
29
+
## Evaluation
44
30
To evaluate any model on KCIF, pls run the following command
45
31
46
32
```bash
@@ -52,7 +38,7 @@ For the full list of arguments pls run
52
38
python inference/run_inference.py --help
53
39
```
54
40
55
-
####Scoring Generations
41
+
### Scoring Generations
56
42
The evaluation script expects a JSON configuration file containing paths to generations for both instruction-following (if_filepath) and non-instruction-following (noif_filepath) versions of each model.
57
43
58
44
Here's a sample json file
@@ -76,7 +62,7 @@ Each entry in the config includes:
76
62
- noif_filepath: Path to non-instruction-following generations
77
63
- model: Name or identifier for the model.
78
64
79
-
####Usage
65
+
### Usage
80
66
Run the following command to compute metrics based on the config:
81
67
82
68
```bash
@@ -89,7 +75,71 @@ Arguments:
89
75
90
76
Sample config file is provided in [sample/sample_lite_benchmark.json](sample/sample_lite_benchmark.json)
91
77
92
-
## Todo
78
+
79
+
## Results
80
+
The following Table lists the performance of several LLMs on our leaderboard.
**Table**: Performance of the Small, Medium, and Large Models on our Full Benchmark — models ranked in order of performance using the average score (higher is better).
**Table**: Performance of the Medium, Large and Frontier Models on our Lite Benchmark — ranked in order of performance using the average score (higher is better).
125
+
126
+
127
+
## Contributions
128
+
This section provides instructions on how to contribute to the KCIF benchmark.
129
+
130
+
### Adding new dataset and Instructions
131
+
132
+
- To add new dataset, please follow the guidelines [here](src/construct_data/hf_to_schema/README.md)
133
+
- To add a new instruction, please follow the guidelines [here](src/construct_data/instruction/README.md)
134
+
- To create the dataset for evaluation, create a `json` file with `dataset` names as the keys and `instructions` to be applied on the dataset as list of values
135
+
- A sample `json` file is provided [here](src/construct_data/config.json)
136
+
- Then run the following command
137
+
```bash
138
+
cd src
139
+
python construct_data/create_benchmark.py --config <path to json> --output_path <path to folder to store the dataset> --cot
0 commit comments