Skip to content

Commit 99f78f7

Browse files
Merge branch 'main' into main
2 parents fb1ed28 + 05bde78 commit 99f78f7

File tree

24 files changed

+1767
-1814
lines changed

24 files changed

+1767
-1814
lines changed

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
# Llama Recipes: Examples to get started using the Llama models from Meta
22
<!-- markdown-link-check-disable -->
3-
The 'llama-recipes' repository is a companion to the [Meta Llama 2](https://github.com/meta-llama/llama) and [Meta Llama 3](https://github.com/meta-llama/llama3) models. The goal of this repository is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Meta Llama and other tools in the LLM ecosystem. The examples here showcase how to run Meta Llama locally, in the cloud, and on-prem.
3+
The 'llama-recipes' repository is a companion to the [Meta Llama 3](https://github.com/meta-llama/llama3) models. The goal of this repository is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Meta Llama and other tools in the LLM ecosystem. The examples here showcase how to run Meta Llama locally, in the cloud, and on-prem. [Meta Llama 2](https://github.com/meta-llama/llama) is also supported in this repository. We highly recommend everyone to utilize [Meta Llama 3](https://github.com/meta-llama/llama3) due to its enhanced capabilities.
4+
45
<!-- markdown-link-check-enable -->
56
> [!IMPORTANT]
6-
> Llama 3 has a new prompt template and special tokens (based on the tiktoken tokenizer).
7+
> Meta Llama 3 has a new prompt template and special tokens (based on the tiktoken tokenizer).
78
> | Token | Description |
89
> |---|---|
910
> `<\|begin_of_text\|>` | This is equivalent to the BOS token. |
1011
> `<\|end_of_text\|>` | This is equivalent to the EOS token. For multiturn-conversations it's usually unused. Instead, every message is terminated with `<\|eot_id\|>` instead.|
1112
> `<\|eot_id\|>` | This token signifies the end of the message in a turn i.e. the end of a single message by a system, user or assistant role as shown below.|
1213
> `<\|start_header_id\|>{role}<\|end_header_id\|>` | These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant. |
1314
>
14-
> A multiturn-conversation with Llama 3 follows this prompt template:
15+
> A multiturn-conversation with Meta Llama 3 follows this prompt template:
1516
> ```
1617
> <|begin_of_text|><|start_header_id|>system<|end_header_id|>
1718
>
@@ -133,7 +134,7 @@ Contains examples are organized in folders by topic:
133134
[quickstart](./recipes/quickstart) | The "Hello World" of using Llama, start here if you are new to using Llama.
134135
[finetuning](./recipes/finetuning)|Scripts to finetune Llama on single-GPU and multi-GPU setups
135136
[inference](./recipes/inference)|Scripts to deploy Llama for inference locally and using model servers
136-
[use_cases](./recipes/use_cases)|Scripts showing common applications of Llama2
137+
[use_cases](./recipes/use_cases)|Scripts showing common applications of Meta Llama3
137138
[responsible_ai](./recipes/responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
138139
[llama_api_providers](./recipes/llama_api_providers)|Scripts to run inference on Llama via hosted endpoints
139140
[benchmarks](./recipes/benchmarks)|Scripts to benchmark Llama models inference on various backends
@@ -159,7 +160,8 @@ Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduc
159160

160161
## License
161162
<!-- markdown-link-check-disable -->
162-
See the License file for Meta Llama 2 [here](https://llama.meta.com/llama2/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama2/use-policy/)
163163

164164
See the License file for Meta Llama 3 [here](https://llama.meta.com/llama3/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama3/use-policy/)
165+
166+
See the License file for Meta Llama 2 [here](https://llama.meta.com/llama2/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama2/use-policy/)
165167
<!-- markdown-link-check-enable -->

docs/images/messenger_llama_arch.jpg

-4.34 KB
Loading

docs/images/whatsapp_llama_arch.jpg

2.18 KB
Loading

recipes/evaluation/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Llama Model Evaluation
22

3-
Llama-Recipe make use of `lm-evaluation-harness` for evaluating our fine-tuned Llama2 model. It also can serve as a tool to evaluate quantized model to ensure the quality in lower precision or other optimization applied to the model that might need evaluation.
3+
Llama-Recipe make use of `lm-evaluation-harness` for evaluating our fine-tuned Meta Llama3 (or Llama2) model. It also can serve as a tool to evaluate quantized model to ensure the quality in lower precision or other optimization applied to the model that might need evaluation.
44

55

66
`lm-evaluation-harness` provide a wide range of [features](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#overview):
@@ -12,7 +12,7 @@ Llama-Recipe make use of `lm-evaluation-harness` for evaluating our fine-tuned L
1212
- Support for evaluation on adapters (e.g. LoRA) supported in Hugging Face's PEFT library.
1313
- Support for local models and benchmarks.
1414

15-
The Language Model Evaluation Harness is also the backend for 🤗 [Hugging Face's (HF) popular Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
15+
The Language Model Evaluation Harness is also the backend for 🤗 [Hugging Face's (HF) popular Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
1616

1717
## Setup
1818

@@ -36,35 +36,35 @@ pip install -e .
3636

3737
### Quick Test
3838

39-
To run evaluation for Hugging Face `Llama2 7B` model on a single GPU please run the following,
39+
To run evaluation for Hugging Face `Llama3 8B` model on a single GPU please run the following,
4040

4141
```bash
42-
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks hellaswag --device cuda:0 --batch_size 8
42+
python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B --tasks hellaswag --device cuda:0 --batch_size 8
4343

4444
```
4545
Tasks can be extended by using `,` between them for example `--tasks hellaswag,arc`.
4646

47-
To set the number of shots you can use `--num_fewshot` to set the number for few shot evaluation.
47+
To set the number of shots you can use `--num_fewshot` to set the number for few shot evaluation.
4848

49-
### PEFT Fine-tuned model Evaluation
49+
### PEFT Fine-tuned model Evaluation
5050

5151
In case you have fine-tuned your model using PEFT you can set the PATH to the PEFT checkpoints using PEFT as part of model_args as shown below:
5252

5353
```bash
54-
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8
54+
python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B, dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8
5555
```
5656

5757
### Limit the number of examples in benchmarks
5858

5959
There has been an study from [IBM on efficient benchmarking of LLMs](https://arxiv.org/pdf/2308.11696.pdf), with main take a way that to identify if a model is performing poorly, benchmarking on wider range of tasks is more important than the number example in each task. This means you could run the evaluation harness with fewer number of example to have initial decision if the performance got worse from the base line. To limit the number of example here, it can be set using `--limit` flag with actual desired number. But for the full assessment you would need to run the full evaluation. Please read more in the paper linked above.
6060

6161
```bash
62-
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100
62+
python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B,dtype="float",peft=../peft_output --tasks hellaswag --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100
6363
```
6464

6565
### Reproducing Hugging Face Open-LLM-Leaderboard
6666

67-
Here, we provided a list of tasks from `Open-LLM-Leaderboard` which can be used by passing `--open-llm-leaderboard-tasks` instead of `tasks` to the `eval.py`.
67+
Here, we provided a list of tasks from `Open-LLM-Leaderboard` which can be used by passing `--open-llm-leaderboard-tasks` instead of `tasks` to the `eval.py`.
6868

6969
**NOTE** Make sure to run the bash script below, that will set the `include paths` in the [config files](./open_llm_leaderboard/). The script will prompt you to enter the path to the cloned lm-evaluation-harness repo.**You would need this step only for the first time**.
7070

@@ -76,7 +76,7 @@ bash open_llm_eval_prep.sh
7676
Now we can run the eval benchmark:
7777

7878
```bash
79-
python eval.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float",peft=../peft_output --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100 --open_llm_leaderboard_tasks
79+
python eval.py --model hf --model_args pretrained=meta-llama/Meta-Llama-3-8B,dtype="float",peft=../peft_output --num_fewshot 10 --device cuda:0 --batch_size 8 --limit 100 --open_llm_leaderboard_tasks
8080
```
8181

8282
In the HF leaderboard, the [LLMs are evaluated on 7 benchmarks](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) from Language Model Evaluation Harness as described below:
@@ -107,7 +107,7 @@ To perform *data-parallel evaluation* (where each GPU loads a **separate full co
107107
```bash
108108
accelerate config
109109

110-
accelerate launch eval.py --model hf --model_args "pretrained=meta-llama/Llama-2-7b-chat-hf" --limit 100 --open-llm-leaderboard-tasks --output_path ./results.json --log_samples
110+
accelerate launch eval.py --model hf --model_args "pretrained=meta-llama/Meta-Llama-3-8B" --limit 100 --open-llm-leaderboard-tasks --output_path ./results.json --log_samples
111111
```
112112

113113
In case your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than on one.
@@ -119,7 +119,7 @@ In case your model is *too large to fit on a single GPU.*
119119
In this setting, run the library *outside of the `accelerate` launcher*, but passing `parallelize=True` to `--model_args` as follows:
120120

121121
```bash
122-
python eval.py --model hf --model_args "pretrained=meta-llama/Llama-2-7b-chat-hf,parallelize=True" --limit 100 --open_llm_leaderboard_tasks --output_path ./results.json --log_samples
122+
python eval.py --model hf --model_args "pretrained=meta-llama/Meta-Llama-3-8B,parallelize=True" --limit 100 --open_llm_leaderboard_tasks --output_path ./results.json --log_samples
123123
```
124124

125125

@@ -138,7 +138,7 @@ These two options (`accelerate launch` and `parallelize=True`) are mutually excl
138138
Also `lm-evaluation-harness` supports vLLM for faster inference on [supported model types](https://docs.vllm.ai/en/latest/models/supported_models.html), especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example:
139139

140140
```bash
141-
python eval.py --model vllm --model_args "pretrained=meta-llama/Llama-2-7b-chat-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=2" --limit 100 --open_llm_leaderboard_tasks --output_path ./results.json --log_samples --batch_size auto
141+
python eval.py --model vllm --model_args "pretrained=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=2" --limit 100 --open_llm_leaderboard_tasks --output_path ./results.json --log_samples --batch_size auto
142142
```
143143
For a full list of supported vLLM configurations, please to [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/076372ee9ee81e25c4e2061256400570354a8d1a/lm_eval/models/vllm_causallms.py#L44-L62).
144144

recipes/finetuning/huggingface_trainer/peft_finetuning.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@
251251
" get_peft_model,\n",
252252
" LoraConfig,\n",
253253
" TaskType,\n",
254-
" prepare_model_for_int8_training,\n",
254+
" prepare_model_for_kbit_training,\n",
255255
" )\n",
256256
"\n",
257257
" peft_config = LoraConfig(\n",
@@ -264,7 +264,7 @@
264264
" )\n",
265265
"\n",
266266
" # prepare int-8 model for training\n",
267-
" model = prepare_model_for_int8_training(model)\n",
267+
" model = prepare_model_for_kbit_training(model)\n",
268268
" model = get_peft_model(model, peft_config)\n",
269269
" model.print_trainable_parameters()\n",
270270
" return model, peft_config\n",

recipes/inference/local_inference/inference.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ def main(
3131
temperature: float=1.0, # [optional] The value used to modulate the next token probabilities.
3232
top_k: int=50, # [optional] The number of highest probability vocabulary tokens to keep for top-k-filtering.
3333
repetition_penalty: float=1.0, #The parameter for repetition penalty. 1.0 means no penalty.
34-
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
34+
length_penalty: int=1, #[optional] Exponential penalty to the length that is used with beam-based generation.
3535
enable_azure_content_safety: bool=False, # Enable safety check with Azure content safety api
3636
enable_sensitive_topics: bool=False, # Enable check for sensitive topics using AuditNLG APIs
3737
enable_salesforce_content_safety: bool=True, # Enable safety check with Salesforce safety flan t5
@@ -98,12 +98,12 @@ def inference(user_prompt, temperature, top_p, top_k, max_new_tokens, **kwargs,)
9898
top_k=top_k,
9999
repetition_penalty=repetition_penalty,
100100
length_penalty=length_penalty,
101-
**kwargs
101+
**kwargs
102102
)
103103
e2e_inference_time = (time.perf_counter()-start)*1000
104104
print(f"the inference time is {e2e_inference_time} ms")
105105
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
106-
106+
107107
# Safety check of the model output
108108
safety_results = [check(output_text, agent_type=AgentType.AGENT, user_prompt=user_prompt) for check in safety_checker]
109109
are_safe = all([r[1] for r in safety_results])
@@ -156,7 +156,7 @@ def inference(user_prompt, temperature, top_p, top_k, max_new_tokens, **kwargs,)
156156
label="Output",
157157
)
158158
],
159-
title="Llama2 Playground",
159+
title="Meta Llama3 Playground",
160160
description="https://github.com/facebookresearch/llama-recipes",
161161
).queue().launch(server_name="0.0.0.0", share=True)
162162

0 commit comments

Comments
 (0)