|
1 | 1 | # KCIF
|
2 |
| -KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). |
| 2 | + |
| 3 | +## Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks |
| 4 | + |
| 5 | +KCIF is a benchmark for evaluating the instruction-following capabilities of Large Language Models (LLM). We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. KCIF allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions. |
| 6 | + |
| 7 | +## Getting Started |
| 8 | + |
| 9 | +### Dependencies |
| 10 | + |
| 11 | +* Python > 3.10 is preferred |
| 12 | + |
| 13 | +### Installing |
| 14 | + |
| 15 | +```bash |
| 16 | +conda create -n KCIF python=3.10 |
| 17 | +conda activate KCIF |
| 18 | +pip install -r requirements.txt |
| 19 | +``` |
| 20 | +## Supported Tasks |
| 21 | + |
| 22 | +- [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) |
| 23 | +- [MathQA](https://huggingface.co/datasets/allenai/math_qa) |
| 24 | +- [BoolQ](https://huggingface.co/datasets/google/boolq) |
| 25 | +- [PiQA](https://huggingface.co/datasets/ybisk/piqa) |
| 26 | +- [Winogrande](https://huggingface.co/datasets/allenai/winogrande) |
| 27 | +- More tasks will be added soon. |
| 28 | + |
| 29 | +## Usage |
| 30 | + |
| 31 | +### Dataset Creation |
| 32 | + |
| 33 | +- To add new dataset, please follow the guidelines [here](src/construct_data/hf_to_schema/README.md) |
| 34 | +- To add a new instruction, please follow the guidelines [here](src/construct_data/instruction/README.md) |
| 35 | +- To create the dataset for evalaution, create a `json` file with `dataset` names as the keys and `instructions` to be applied on the dataset as list of values |
| 36 | +- A sample `json` file is provided [here](src/construct_data/config.json) |
| 37 | +- Then run the following command |
| 38 | +```bash |
| 39 | +cd src |
| 40 | +python construct_data/create_benchmark.py --config <path to json> --output_path <path to folder to store the dataset> --cot |
| 41 | +``` |
| 42 | + |
| 43 | +### Evaluating Models |
| 44 | +To evaluate any model on KCIF, pls run the following command |
| 45 | + |
| 46 | +```bash |
| 47 | +python inference/run_inference.py --engine <engine_type> --model_name <HF model name or local checkpoint> --input_path <path to KCIF> --batch_size <batch size> --output_folder <path to output folder> |
| 48 | +``` |
| 49 | + |
| 50 | +For the full list of arguments pls run |
| 51 | +```bash |
| 52 | +python inference/run_inference.py --help |
| 53 | +``` |
| 54 | + |
| 55 | +#### Scoring Generations |
| 56 | +The evaluation script expects a JSON configuration file containing paths to generations for both instruction-following (if_filepath) and non-instruction-following (noif_filepath) versions of each model. |
| 57 | + |
| 58 | +Here's a sample json file |
| 59 | +```json |
| 60 | +[ |
| 61 | + { |
| 62 | + "if_filepath": "test_results/Meta-Llama-3.1-8B-Instruct_instruction_follow/all_results.jsonl", |
| 63 | + "noif_filepath": "test_results/Meta-Llama-3.1-8B-Instruct_no_instruction_follow/all_results.jsonl", |
| 64 | + "model": "llama" |
| 65 | + }, |
| 66 | + { |
| 67 | + "if_filepath": "test_results/Qwen2.5-72B-Instruct_instruction_follow/all_results.jsonl", |
| 68 | + "noif_filepath": "test_results/Qwen2.5-72B-Instruct_no_instruction_follow/all_results.jsonl", |
| 69 | + "model": "qwen_72B" |
| 70 | + } |
| 71 | +] |
| 72 | +``` |
| 73 | + |
| 74 | +Each entry in the config includes: |
| 75 | +- if_filepath: Path to instruction-following generations |
| 76 | +- noif_filepath: Path to non-instruction-following generations |
| 77 | +- model: Name or identifier for the model. |
| 78 | + |
| 79 | +#### Usage |
| 80 | +Run the following command to compute metrics based on the config: |
| 81 | + |
| 82 | +```bash |
| 83 | +python -m evaluation.compute_metrics --config path/to/config.json --output_folder path/to/output |
| 84 | +``` |
| 85 | + |
| 86 | +Arguments: |
| 87 | +- `--config`: Path to the configuration JSON file. |
| 88 | +- `--output_folder`: Directory where the computed metrics will be saved. |
| 89 | + |
| 90 | +Sample config file is provided in [sample/sample_lite_benchmark.json](sample/sample_lite_benchmark.json) |
| 91 | + |
| 92 | +## Todo |
| 93 | + |
| 94 | +- [ ] Support new tasks (BBH, etc.) |
| 95 | +- [ ] Add test cases |
| 96 | +- [ ] Support for OpenAI API |
| 97 | + |
| 98 | + |
| 99 | +## Citation |
| 100 | +If you find KCIF useful, please cite it as follows in your publication: |
| 101 | + |
| 102 | +```bibtex |
| 103 | +@misc{murthy2024evaluatinginstructionfollowingabilitieslanguage, |
| 104 | + title={Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks}, |
| 105 | + author={Rudra Murthy and Prince Kumar and Praveen Venkateswaran and Danish Contractor}, |
| 106 | + year={2024}, |
| 107 | + eprint={2410.12972}, |
| 108 | + archivePrefix={arXiv}, |
| 109 | + primaryClass={cs.CL}, |
| 110 | + url={https://arxiv.org/abs/2410.12972}, |
| 111 | +} |
| 112 | +``` |
| 113 | + |
0 commit comments