Skip to content

Commit b5f64c0

Browse files
Eval reproduce recipe using lm-evaluation-harness and our 3.1 evals datasets (meta-llama#627)
2 parents eaded5e + e354eee commit b5f64c0

File tree

20 files changed

+1154
-3
lines changed

20 files changed

+1154
-3
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1432,4 +1432,22 @@ CPUs
14321432
modelUpgradeExample
14331433
guardrailing
14341434
MaaS
1435-
MFU
1435+
MFU
1436+
BBH
1437+
GPQA
1438+
IFEVAL
1439+
IFeval
1440+
bos
1441+
gpqa
1442+
ifeval
1443+
lighteval
1444+
sqrt
1445+
wis
1446+
evals
1447+
mmlu
1448+
parsers
1449+
reproducibility
1450+
openhathi
1451+
sarvam
1452+
subtask
1453+
acc

recipes/use_cases/multilingual/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
# Extending Llama to a new language
22
Authored by : Sarvam team
33
In this recipe, we will see how to add a new language to the Llama family of models. The steps are quite general and can be easily adapted to other models as well. Using this recipe, you should be able to replicate the findings of [OpenHathi](https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base).
4-
Please read more about OpenHathi [here](https://analyticsindiamag.com/industry-insights/ai-startups/indian-startup-sarvam-ai-launches-hindi-llm-openhathi/)
4+
Please read more about OpenHathi [here](https://web.archive.org/web/20240418103408/https://www.sarvam.ai/blog/announcing-openhathi-series)
5+
56
## Data
67
The original OpenHathi model uses a combination of [Sangraha](https://huggingface.co/datasets/ai4bharat/sangraha) and Wikipedia as its primary data sources. If the reader is interested in using these sources, they would also have to preprocess the data: clean, filter, and deduplicate. See [Setu](https://github.com/AI4Bharat/setu) for an easy way to do this at scale.
78

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,4 @@ langchain_openai
2828
langchain
2929
langchain_community
3030
sentence_transformers
31-
codeshield
31+
codeshield

tools/benchmarks/llm_eval_harness/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ There has been an study from [IBM on efficient benchmarking of LLMs](https://arx
6262
python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100
6363
```
6464

65+
### Reproducing Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
66+
67+
[meta_eval_reproduce](./meta_eval_reproduce/) folder provides a detailed guide on how to reproduce the Meta Llama 3.1 evaluation metrics reported in our [Meta Llama website](https://llama.meta.com/) using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and our [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). By following the steps outlined, users can replicate a evaluation process that is similar to Meta's, for specific tasks and compare their results with our reported metrics. While slight variations in results are expected due to differences in implementation and model behavior, we aim to provide a transparent and reproducible method for evaluating Meta Llama 3 models using third party library. Please check the [README.md](./meta_eval_reproduce/README.md) for more details.
68+
6569
### Reproducing Hugging Face Open-LLM-Leaderboard
6670

6771
Here, we provided a list of tasks from `Open-LLM-Leaderboard` which can be used by passing `--open-llm-leaderboard-tasks` instead of `tasks` to the `eval.py`.

tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md

Lines changed: 213 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
2+
3+
evals_dataset: "meta-llama/Meta-Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid Meta Llama 3.1 evals dataset name in the Llama 3.1 Evals collection.
4+
# Must be one of the following ["meta-llama/Meta-Llama-3.1-8B-Instruct-evals","meta-llama/Meta-Llama-3.1-70B-Instruct-evals","meta-llama/Meta-Llama-3.1-405B-Instruct-evals","meta-llama/Meta-Llama-3.1-8B-evals","meta-llama/Meta-Llama-3.1-70B-evals","meta-llama/Meta-Llama-3.1-405B-evals"]
5+
6+
tasks: "meta_instruct" # Available tasks for instruct model: "meta_math_hard", "meta_gpqa", "meta_mmlu_pro_instruct", "meta_ifeval"; or just use "meta_instruct" to run all of them.
7+
# Available tasks for pretrain model: "meta_bbh", "meta_mmlu_pro_pretrain"; or just use "meta_pretrain" to run all of them.
8+
9+
tensor_parallel_size: 1 # The VLLM argument that speicify the tensor parallel size for the model, eg how many GPUs to use for a model copy.
10+
11+
data_parallel_size: 4 # The VLLM argument that speicify the data parallel size for the model, eg how copies of model will be used.
12+
13+
gpu_memory_utilization: 0.9 #The VLLM argument that speicify gpu memory utilization, the rest will be reserved for KV cache.
14+
15+
max_model_len: 8192 #The VLLM argument that speicify model max length, decrease this value only if GPU memory issue encountered. Please make sure the max_gen_toks in the yaml does not exceed this length.
16+
17+
batch_size: "auto" # Batch size, can be 'auto', 'auto:N', or an integer. It is strongly recommend to use 'auto' for vllm to speed up the inference
18+
19+
output_path: "eval_results" # the output folder to store all the eval results and samples.
20+
21+
#limit: 12 # Limit number of examples per task, set 'null' to run all.
22+
limit: null # Limit number of examples per task, set 'null' to run all.
23+
24+
verbosity: "INFO" #Logging level: CRITICAL, ERROR, WARNING, INFO, DEBUG.
25+
26+
log_samples: true # If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis.
27+
28+
work_dir: ./work_dir # The work folder where the task template yaml files will be copied and modified, datasets will be downloaded for math_hard, ifeval.
29+
30+
template_dir: ./meta_template #Path to the folder that contains all the meta task templates
31+
32+
show_config: false # If True, shows the full config of all tasks at the end of the evaluation.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
dataset_path: meta-llama/Meta-Llama-3.1-8B-evals
2+
dataset_name: Meta-Llama-3.1-8B-evals__bbh__details
3+
task: meta_bbh
4+
output_type: generate_until
5+
process_docs: !function utils.process_docs
6+
test_split: latest
7+
doc_to_text: !function utils.doc_to_text
8+
doc_to_target: answer
9+
filter_list:
10+
- name: "strict-match"
11+
filter:
12+
- function: "regex"
13+
regex_pattern: 'the answer is (.*?)\.'
14+
- function: "take_first"
15+
generation_kwargs:
16+
until: "\n\nQ: "
17+
do_sample: false
18+
temperature: 0
19+
max_gen_toks: 512
20+
num_fewshot: 0
21+
metric_list:
22+
- metric: exact_match
23+
aggregation: mean
24+
higher_is_better: true
25+
ignore_case: true
26+
ignore_punctuation: true
27+
metadata:
28+
version: 1.0
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import random
2+
import re
3+
4+
import datasets
5+
6+
7+
8+
def doc_to_text(doc: dict) -> str:
9+
return doc["input_final_prompts"][0]
10+
11+
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
12+
def _process_doc(doc: dict) -> dict:
13+
out_doc = {
14+
"problem": doc["input_question"],
15+
"answer": doc["input_correct_responses"][0],
16+
}
17+
return out_doc
18+
dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","output_prediction_text"])
19+
dataset = dataset.rename_column("is_correct","previously_is_correct")
20+
dataset = dataset.map(_process_doc)
21+
return dataset.map(_process_doc)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
dataset_path: meta-llama/Meta-Llama-3.1-8B-Instruct-evals
2+
dataset_name: Meta-Llama-3.1-8B-Instruct-evals__gpqa__details
3+
task: meta_gpqa
4+
output_type: generate_until
5+
process_docs: !function utils.process_docs
6+
test_split: latest
7+
doc_to_text: !function utils.doc_to_text
8+
doc_to_target: gold
9+
filter_list:
10+
- name: "strict-match"
11+
filter:
12+
- function: "regex"
13+
group_select: -1
14+
regex_pattern: 'best answer is ([A-Z])'
15+
- function: "take_first"
16+
generation_kwargs:
17+
until: []
18+
do_sample: false
19+
temperature: 0
20+
max_gen_toks: 2048
21+
num_fewshot: 0
22+
metric_list:
23+
- metric: exact_match
24+
aggregation: mean
25+
higher_is_better: true
26+
ignore_case: true
27+
ignore_punctuation: true
28+
metadata:
29+
version: 1.0
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import random
2+
import re
3+
4+
import datasets
5+
6+
7+
8+
def doc_to_text(doc: dict) -> str:
9+
return doc["input_final_prompts"][0]
10+
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
11+
def _process_doc(doc: dict) -> dict:
12+
out_doc = {
13+
"problem": doc["input_question"],
14+
"gold": doc["input_correct_responses"][0],
15+
}
16+
return out_doc
17+
dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","input_choice_list","output_prediction_text"])
18+
dataset = dataset.rename_column("is_correct","previously_is_correct")
19+
dataset = dataset.map(_process_doc)
20+
return dataset.map(_process_doc)

0 commit comments

Comments
 (0)