nerdy-tech-com-gitub
diff --git a/‎tools/benchmarks/llm_eval_harness/README.md
Lines changed: 2 additions & 2 deletions b/‎tools/benchmarks/llm_eval_harness/README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/README.md
Lines changed: 24 additions & 24 deletions b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/README.md renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/README.md
Lines changed: 24 additions & 24 deletions
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/eval_config.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/eval_config.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/eval_config.yaml
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/bbh_3shot_cot.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/bbh/bbh_3shot_cot.yaml b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/bbh_3shot_cot.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/bbh/bbh_3shot_cot.yaml
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/utils.py renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/bbh/utils.py b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/bbh/utils.py renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/bbh/utils.py
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/gpqa_0shot_cot.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa_cot/gpqa_0shot_cot.yaml b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/gpqa_0shot_cot.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa_cot/gpqa_0shot_cot.yaml
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/utils.py renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa_cot/utils.py b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/gpqa_cot/utils.py renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/gpqa_cot/utils.py
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/ifeval.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/ifeval/ifeval.yaml b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/ifeval.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/ifeval/ifeval.yaml
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/utils.py renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/ifeval/utils.py b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/ifeval/utils.py renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/ifeval/utils.py
diff --git a/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/math_hard/math_hard_0shot_cot.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/math_hard/math_hard_0shot_cot.yaml b/‎tools/benchmarks/llm_eval_harness/meta_eval_reproduce/meta_template/math_hard/math_hard_0shot_cot.yaml renamed to ‎tools/benchmarks/llm_eval_harness/meta_eval/meta_template/math_hard/math_hard_0shot_cot.yaml
@@ -141,9 +141,9 @@ vLLM occasionally differs in output from Huggingface. `lm-evaluation-harness` tr
 
 For more details about `lm-evaluation-harness`, please visit checkout their github repo [README.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/README.md).
 
-## Reproducing Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
+## Calculating Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
 
-[meta_eval_reproduce](./meta_eval_reproduce/) folder provides a detailed guide on how to reproduce the Meta Llama 3.1 evaluation metrics reported in our [Meta Llama website](https://llama.meta.com/) using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and our [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). By following the steps outlined, users can replicate a evaluation process that is similar to Meta's, for specific tasks and compare their results with our reported metrics. While slight variations in results are expected due to differences in implementation and model behavior, we aim to provide a transparent and reproducible method for evaluating Meta Llama 3 models using third party library. Please check the [README.md](./meta_eval_reproduce/README.md) for more details.
+[meta_eval](./meta_eval/) folder provides a detailed guide on how to calculate the Meta Llama 3.1 evaluation metrics reported in our [Meta Llama website](https://llama.meta.com/) using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) and our [3.1 evals Huggingface collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). By following the steps outlined, users can replicate a evaluation process that is similar to Meta's, for specific tasks and compare their results with our reported metrics. While slight variations in results are expected due to differences in implementation and model behavior, we aim to provide a transparent method for evaluating Meta Llama 3 models using third party library. Please check the [README.md](./meta_eval/README.md) for more details.
 
 ## Reproducing HuggingFace Open-LLM-Leaderboard v2
 
 
@@ -1,12 +1,12 @@
 
-# Reproducing Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
+# Calculating Meta 3.1 Evaluation Metrics Using LM-Evaluation-Harness
 
-As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to closely reproduce the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
+As Meta Llama models gain popularity, evaluating these models has become increasingly important. We have released all the evaluation details for Meta-Llama 3.1 models as datasets in the [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f). This recipe demonstrates how to calculate the Llama 3.1 reported benchmark numbers using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main) library and our prompts from the 3.1 evals datasets on selected tasks.
 
 ## Disclaimer
 
 
-1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the reproduced numbers.
+1. **This recipe is not the official implementation** of Meta Llama evaluation. It is based on public third-party libraries, as this implementation is not mirroring Meta Llama evaluation, this may lead to minor differences in the produced numbers.
 2. **Model Compatibility**: This tutorial is specifically for Llama 3 based models, as our prompts include Meta Llama 3 special tokens, e.g. `<|start_header_id|>user<|end_header_id|>`. It will not work with models that are not based on Llama 3.
 
 ## Insights from Our Evaluation Process
@@ -19,19 +19,19 @@ Here are our insights about the differences in terms of the eval configurations
 - **Inference**: We use an internal LLM inference solution that does not apply padding, while Hugging Face leaderboard uses padding on the generative tasks (MATH and IFEVAL).
 - **Tasks**  We run benchmarks on BBH and MMLU-Pro only for pretrained models and Math-Hard, IFeval, GPQA, only for pretrained models.
 
-Given those differences, our reproduced number can not be compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
+Given those differences, the numbers from this recipe can not be compared to the numbers in the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), even if the task names are the same.
 
 ## Environment setups
 
 Please install lm-evaluation-harness and our llama-recipe repo by following:
 
 ```
-pip install lm-eval[math,ifeval,sentencepiece,vllm]==0.4.3
 git clone [email protected]:meta-llama/llama-recipes.git
 cd llama-recipes
 pip install -U pip setuptools
 pip install -e .
-cd tools/benchmarks/llm_eval_harness/meta_eval_reproduce
+pip install lm-eval[math,ifeval,sentencepiece,vllm]==0.4.3
+cd tools/benchmarks/llm_eval_harness/meta_eval
 ```
 
 To access our [3.1 evals Hugging Face collection](https://huggingface.co/collections/meta-llama/llama-31-evals-66a2c5a14c2093e58298ac7f), you must:
@@ -47,7 +47,7 @@ Given the extensive number of tasks available (12 for pretrained models and 30 f
 - **Tasks for pretrained models**: BBH and MMLU-Pro
 - **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro
 
-Here, we aim to reproduce the Meta reported benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and reproduce our reported metrics.
+Here, we aim to get the benchmark numbers on the aforementioned tasks using Hugging Face [leaderboard implementation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/leaderboard). Please follow the instructions below to make necessary modifications to use our eval prompts and get more eval metrics.
 
 
 ### Run eval tasks
@@ -91,7 +91,7 @@ python prepare_meta_eval.py --config_path ./eval_config.yaml
 lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_instruct --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42  --log_samples
 ```
 
-4. Then just copy the `lm_eval` command printed by [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get our reproduced result, which will be saved into `eval_results` folder by default.
+4. Then just copy the `lm_eval` command printed by [prepare_meta_eval.py](./prepare_meta_eval.py) back to your terminal and run it to get the result, which will be saved into `eval_results` folder by default.
 
 **NOTE**: As for `--model vllm`, here we will use VLLM inference instead of Hugging Face inference because of the padding issue. By default, for the generative tasks, the `lm-eval --model_args="{...}" --batch_size=auto` command will use Hugging Face inference solution that uses a static batch method with [left padding](https://github.com/EleutherAI/lm-evaluation-harness/blob/8ad598dfd305ece8c6c05062044442d207279a97/lm_eval/models/huggingface.py#L773) using EOS_token for Llama models, but our internal evaluation will load python original checkpoints and handle individual generation request asynchronously without any padding. To simulate this, we will use VLLM inference solution to do dynamic batching without any padding.
 
@@ -115,7 +115,7 @@ Here, we will use MMLU-Pro as a example to show the steps to create a yaml confi
 
 **1.Define the config to load datasets**
 
-We can use our 3.1 evals dataset as the source dataset and the corresponding subset and define the test split to latest. For example, if we want to reproduce the MMLU_Pro metric for 3.1 8B instruct, the following configs are needed as explained below:
+We can use our 3.1 evals dataset as the source dataset and the corresponding subset and define the test split to latest. For example, if we want to calculate the MMLU_Pro metric for 3.1 8B instruct, the following configs are needed as explained below:
 
 ```yaml
 task: meta_mmlu_pro_instruct
@@ -124,7 +124,7 @@ dataset_name: Meta-Llama-3.1-8B-Instruct-evals__mmlu_pro__details
 test_split: latest
 ```
 
-If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and  `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current reproduced result and the reported results for each sample, is different.
+If you want to run evaluation on 70B-Instruct, then it is recommended to change the `dataset_path` and  `dataset_name` from 8B to 70B, even though 70B-instruct and 8B-instruct share the same prompts, the `is_correct` column, which can be used to get the difference between current result and the reported results for each sample, is different.
 
 **Note**: Config files for Meta-Llama-3.1-8B-Instruct are already provided in each task subfolder under [meta_template folder](./meta_template/). Remember to change the eval dataset name according to the model type and DO NOT use pretrained evals dataset on instruct models or vice versa.
 
@@ -138,7 +138,7 @@ doc_to_text: !function utils.doc_to_text
 doc_to_target: gold
 ```
 
-- `process_docs` : Defines the preprocess function for our datasets. In this case, we uses the `process_docs` python function that is defined in [utils.py](./meta_template/mmlu_pro/utils.py). This function will take the original dataset and output a processed dataset that has a out_doc, which contains `problem` which is the input question, `gold` which is the ground truth. We also renamed the `is_correct` column to `previously_is_correct` to allow detailed comparison for the difference of each sample between previously reported score and the reproduced score. You must use eval dataset and model with same parameters and same model type to get a valid comparison.
+- `process_docs` : Defines the preprocess function for our datasets. In this case, we uses the `process_docs` python function that is defined in [utils.py](./meta_template/mmlu_pro/utils.py). This function will take the original dataset and output a processed dataset that has a out_doc, which contains `problem` which is the input question, `gold` which is the ground truth. We also renamed the `is_correct` column to `previously_is_correct` to allow detailed comparison for the difference of each sample between previously reported score and the current score. You must use eval dataset and model with same parameters and same model type to get a valid comparison.
 
 -  `doc_to_text`: Defines the prompts. In the MMLU-Pro case, the `input_final_prompts` column always contains a list of a prompt, so we just use a python function that returns `input_final_prompts[0]`.
 
@@ -178,33 +178,33 @@ metric_list:
 ```
 Here we set the `num_fewshot` to 0 as our prompts have already been converted to 5-shots, and the model generation will only stop if the generated output tokens exceeds 1024, as stated in the [mmlu-pro eval details](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro). We will set the `do_sample` to false and `temperature` to 0 as stated in our `eval_config` column in the dataset. We will use metric `exact_match` for this tasks and calculate the `mean` as our task aggregated number.
 
-**NOTE**: While we tried our best to create the template files, those configs and functions are created based on public third-party library and are not exactly the same as our internal implementation, so there is a chance that the reproduced numbers are slightly different.
+**NOTE**: While we tried our best to create the template files, those configs and functions are created based on public third-party library and are not exactly the same as our internal implementation, so there is a chance that the eval numbers are slightly different.
 
 ## Results
 
-Here is the comparison between our reported numbers and the reproduced numbers in this tutorial:
+Here is the comparison between our reported numbers and the eval numbers in this tutorial:
 
 | Model                        | MATH_HARD | GPQA_RAW | MMLU_PRO_RAW | IFeval  |
 |------------------------------|-----------|----------|--------------|---------|
-| 3.1 8B-Instruct reported     | 0.254     | 0.328    | 0.47         | 0.804   |
-| 3.1 8B-Instruct reproduced   | 0.2424    | 0.3259   | 0.4675       | 0.7782  |
-| 3.1 70B-Instruct reported    | 0.438     | 0.467    | 0.651        | 0.875   |
-| 3.1 70B-Instruct reproduced  | 0.4388    | 0.4799   | 0.6475       | 0.848   |
+| 3.1 8B-Instruct(reported)    | 0.254     | 0.328    | 0.47         | 0.804   |
+| 3.1 8B-Instruct(this)        | 0.2424    | 0.3259   | 0.4675       | 0.7782  |
+| 3.1 70B-Instruct(reported)   | 0.438     | 0.467    | 0.651        | 0.875   |
+| 3.1 70B-Instruct(this)       | 0.4388    | 0.4799   | 0.6475       | 0.848   |
 
 | Model                  | BBH_RAW | MMLU_PRO_RAW |
 |------------------------|---------|--------------|
-| 3.1 8B reported        | 0.642   | 0.356        |
-| 3.1 8B reproduced      | 0.6515  | 0.3572       |
-| 3.1 70B reported       | 0.816   | 0.52         |
-| 3.1 70B reproduced     | 0.8191  | 0.5225       |
+| 3.1 8B(reported)       | 0.642   | 0.356        |
+| 3.1 8B(this)           | 0.6515  | 0.3572       |
+| 3.1 70B(reported)      | 0.816   | 0.52         |
+| 3.1 70B(this)          | 0.8191  | 0.5225       |
 
-From the table above, we can see that most of our reproduced results are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
+From the table above, we can see that most of our results calculated from this recipe are very close to our reported number in the [Meta Llama website](https://llama.meta.com/).
 
 **NOTE**: We used the average of `inst_level_strict_acc,none` and `prompt_level_strict_acc,none` to get the final number for `IFeval` as stated [here](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#task-evaluations-and-parameters).
 
-**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are reproducing the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg`  numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
+**NOTE**: In the [Meta Llama website](https://llama.meta.com/), we reported the `macro_avg` metric, which is the average of all subtask average score, for `MMLU-Pro `task, but here we are calculating the `micro_avg` metric, which is the average score for all the individual samples, and those `micro_avg`  numbers can be found in the [eval_details.md](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md#mmlu-pro).
 
-**NOTE**: The reproduced numbers may be slightly different, as we observed around ±0.01 differences between each reproduce run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).
+**NOTE**: The eval numbers may be slightly different, as we observed around ±0.01 differences between each evaluation run because the latest VLLM inference is not very deterministic even with temperature=0. This behavior maybe related [this issue](https://github.com/vllm-project/vllm/issues/5404).
 or it is expected due to 16-bits inference as stated in [this comment](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535) and [this comment](https://github.com/vllm-project/vllm/issues/4112#issuecomment-2071115725).
 
 ## Acknowledgement