Skip to content

Commit 649b6e6

Browse files
authored
Merge pull request #30 from OfirArviv/main
Add Safety evaluation reproducible readme
2 parents ac45165 + 14cec23 commit 649b6e6

16 files changed

+77
-7
lines changed

evaluation/README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44

55
initialize and environment, sometimes `conda` makes it easier to handle cuda installations:
66
```bash
7-
conda create -n bamba_vllm python=3.11 -y
8-
conda activate bamba_vllm
7+
conda create -n bamba python=3.11 -y
8+
conda activate bamba
99
```
1010

1111
Instal cuda toolkit:
@@ -23,14 +23,19 @@ pip install "causal-conv1d @ git+https://github.com/Dao-AILab/[email protected]
2323
pip install git+https://github.com/fabianlim/transformers.git@pr-draft
2424
```
2525

26-
To run the benchmark, clone lm-evaluation-harness, and install it along with some other dependencies
26+
To run the benchmark, install lm-evaluation-harness and unitxt (www.unitxt.ai), along with some other dependencies
2727

2828
```bash
29-
git clone [email protected]:EleutherAI/lm-evaluation-harness.git
30-
pip install -e path/to/lmeval/.
29+
pip install lm_eval
30+
pip install unitxt
3131
pip install langdetect immutabledict antlr4-python3-runtime==4.11 sacrebleu streamlit boto3 matplotlib loguru
3232
```
3333

34+
Link Unitxt to Lm-Eval-Harness (see https://www.unitxt.ai/en/latest/docs/lm_eval.html)
35+
```
36+
python -c 'from lm_eval.tasks.unitxt import task; import os.path; print("class: !function " + task.__file__.replace("task.py", "task.Unitxt"))' > ./unitxt_cards_for_lm_eval/unitxt
37+
```
38+
3439
### Running the benchmark
3540

3641
Running the benchmark can be done using `evaluation/runner.py` that has the following signeture:

evaluation/runner.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -230,8 +230,8 @@ def run_job(model_id, task_to_run, args):
230230
",".join(subtasks_to_run),
231231
"--output_path",
232232
output_path,
233-
"--cache_requests",
234-
"true",
233+
# "--cache_requests",
234+
# "true",
235235
"--log_samples",
236236
"--trust_remote_code",
237237
# f"--use_cache={cache_dir}",

evaluation/runner_tasks.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,4 +106,30 @@
106106
{"task": "Winogrande", "num_fewshot": 5, "subtasks": ["winogrande"]},
107107
{"task": "GSM8k", "num_fewshot": 5, "subtasks": ["gsm8k"]},
108108
],
109+
"Safety_BBQ": [
110+
{"task": "bbq_Age_temp_0_few_shot_5", "num_fewshot": None, "subtasks": ["bbq_Age_temp_0_few_shot_5"]},
111+
{"task": "bbq_Disability_status_temp_0_few_shot_5", "num_fewshot": None,
112+
"subtasks": ["bbq_Disability_status_temp_0_few_shot_5"]},
113+
{"task": "bbq_Gender_identity_temp_0_few_shot_5", "num_fewshot": None,
114+
"subtasks": ["bbq_Gender_identity_temp_0_few_shot_5"]},
115+
{"task": "bbq_Nationality_temp_0_few_shot_5", "num_fewshot": None,
116+
"subtasks": ["bbq_Nationality_temp_0_few_shot_5"]},
117+
{"task": "bbq_Physical_appearance_temp_0_few_shot_5", "num_fewshot": None,
118+
"subtasks": ["bbq_Physical_appearance_temp_0_few_shot_5"]},
119+
{"task": "bbq_Race_ethnicity_temp_0_few_shot_5", "num_fewshot": None,
120+
"subtasks": ["bbq_Race_ethnicity_temp_0_few_shot_5"]},
121+
{"task": "bbq_Race_x_gender_temp_0_few_shot_5", "num_fewshot": None,
122+
"subtasks": ["bbq_Race_x_gender_temp_0_few_shot_5"]},
123+
{"task": "bbq_Race_x_SES_temp_0_few_shot_5", "num_fewshot": None,
124+
"subtasks": ["bbq_Race_x_SES_temp_0_few_shot_5"]},
125+
{"task": "bbq_Religion_temp_0_few_shot_5", "num_fewshot": None, "subtasks": ["bbq_Religion_temp_0_few_shot_5"]},
126+
{"task": "bbq_SES_temp_0_few_shot_5", "num_fewshot": None, "subtasks": ["bbq_SES_temp_0_few_shot_5"]},
127+
{"task": "bbq_Sexual_orientation_temp_0_few_shot_5", "num_fewshot": None,
128+
"subtasks": ["bbq_Sexual_orientation_temp_0_few_shot_5"]},
129+
],
130+
"Safety": [
131+
{"task": "toxigen", "num_fewshot": 5, "subtasks": ["toxigen"]},
132+
{"task": "pop_qa_temp_0_few_shot_5", "num_fewshot": None, "subtasks": ["pop_qa_temp_0_few_shot_5"]},
133+
{"task": "crows_pairs_english", "num_fewshot": 5, "subtasks": ["crows_pairs_english"]}
134+
]
109135
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Age,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Age_temp_0_few_shot_5
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Disability_status,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Disability_status_temp_0_few_shot_5
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Gender_identity,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Gender_identity_temp_0_few_shot_5
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Nationality,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Nationality_temp_0_few_shot_5
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Physical_appearance,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Physical_appearance_temp_0_few_shot_5
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Race_ethnicity,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Race_ethnicity_temp_0_few_shot_5
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: unitxt
2+
recipe: card=cards.safety.bbq.Race_x_SES,format=formats.empty,template_card_index=0,demos_pool_size=100,num_demos=5,demos_taken_from=test
3+
task: bbq_Race_x_SES_temp_0_few_shot_5

0 commit comments

Comments
 (0)