Skip to content

Commit cb57e55

Browse files
author
Rudra
committed
Initial commit
1 parent 8b32888 commit cb57e55

File tree

176 files changed

+23746
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

176 files changed

+23746
-1
lines changed

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 88
3+
extend-ignore = E203, E704

.nojekyll

Whitespace-only changes.

README.md

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,113 @@
11
# KCIF
2-
KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM).
2+
3+
## Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks
4+
5+
KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. KCIF allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions.
6+
7+
## Getting Started
8+
9+
### Dependencies
10+
11+
* Python > 3.10 is preferred
12+
13+
### Installing
14+
15+
```bash
16+
conda create -n KCIF python=3.10
17+
conda activate KCIF
18+
pip install -r requirements.txt
19+
```
20+
## Supported Tasks
21+
22+
- [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
23+
- [MathQA](https://huggingface.co/datasets/allenai/math_qa)
24+
- [BoolQ](https://huggingface.co/datasets/google/boolq)
25+
- [PiQA](https://huggingface.co/datasets/ybisk/piqa)
26+
- [Winogrande](https://huggingface.co/datasets/allenai/winogrande)
27+
- More tasks will be added soon.
28+
29+
## Usage
30+
31+
### Dataset Creation
32+
33+
- To add new dataset, please follow the guidelines [here](src/construct_data/hf_to_schema/README.md)
34+
- To add a new instruction, please follow the guidelines [here](src/construct_data/instruction/README.md)
35+
- To create the dataset for evalaution, create a `json` file with `dataset` names as the keys and `instructions` to be applied on the dataset as list of values
36+
- A sample `json` file is provided [here](src/construct_data/config.json)
37+
- Then run the following command
38+
```bash
39+
cd src
40+
python construct_data/create_benchmark.py --config <path to json> --output_path <path to folder to store the dataset> --cot
41+
```
42+
43+
### Evaluating Models
44+
To evaluate any model on KCIF, pls run the following command
45+
46+
```bash
47+
python inference/run_inference.py --engine <engine_type> --model_name <HF model name or local checkpoint> --input_path <path to KCIF> --batch_size <batch size> --output_folder <path to output folder>
48+
```
49+
50+
For the full list of arguments pls run
51+
```bash
52+
python inference/run_inference.py --help
53+
```
54+
55+
#### Scoring Generations
56+
The evaluation script expects a JSON configuration file containing paths to generations for both instruction-following (if_filepath) and non-instruction-following (noif_filepath) versions of each model.
57+
58+
Here's a sample json file
59+
```json
60+
[
61+
{
62+
"if_filepath": "test_results/Meta-Llama-3.1-8B-Instruct_instruction_follow/all_results.jsonl",
63+
"noif_filepath": "test_results/Meta-Llama-3.1-8B-Instruct_no_instruction_follow/all_results.jsonl",
64+
"model": "llama"
65+
},
66+
{
67+
"if_filepath": "test_results/Qwen2.5-72B-Instruct_instruction_follow/all_results.jsonl",
68+
"noif_filepath": "test_results/Qwen2.5-72B-Instruct_no_instruction_follow/all_results.jsonl",
69+
"model": "qwen_72B"
70+
}
71+
]
72+
```
73+
74+
Each entry in the config includes:
75+
- if_filepath: Path to instruction-following generations
76+
- noif_filepath: Path to non-instruction-following generations
77+
- model: Name or identifier for the model.
78+
79+
#### Usage
80+
Run the following command to compute metrics based on the config:
81+
82+
```bash
83+
python -m evaluation.compute_metrics --config path/to/config.json --output_folder path/to/output
84+
```
85+
86+
Arguments:
87+
- `--config`: Path to the configuration JSON file.
88+
- `--output_folder`: Directory where the computed metrics will be saved.
89+
90+
Sample config file is provided in [sample/sample_lite_benchmark.json](sample/sample_lite_benchmark.json)
91+
92+
## Todo
93+
94+
- [ ] Support new tasks (BBH, etc.)
95+
- [ ] Add test cases
96+
- [ ] Support for OpenAI API
97+
98+
99+
## Citation
100+
If you find KCIF useful, please cite it as follows in your publication:
101+
102+
```bibtex
103+
@misc{murthy2024evaluatinginstructionfollowingabilitieslanguage,
104+
title={Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks},
105+
author={Rudra Murthy and Prince Kumar and Praveen Venkateswaran and Danish Contractor},
106+
year={2024},
107+
eprint={2410.12972},
108+
archivePrefix={arXiv},
109+
primaryClass={cs.CL},
110+
url={https://arxiv.org/abs/2410.12972},
111+
}
112+
```
113+

docs/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# KCIF
2+
3+
KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. KCIF allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions.
4+

docs/build/doctrees/README.doctree

3.91 KB
Binary file not shown.

docs/build/doctrees/citation.doctree

3.95 KB
Binary file not shown.
Binary file not shown.
74.2 KB
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)