Skip to content

Commit c62ed77

Browse files
Llama 3.1 update
Updating the recipes for the Llama 3.1 release.
2 parents 0d00616 + 1a0a282 commit c62ed77

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+2516
-247
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1406,3 +1406,15 @@ DLAI
14061406
agentic
14071407
containts
14081408
dlai
1409+
Prerequirements
1410+
tp
1411+
QLoRA
1412+
ntasks
1413+
srun
1414+
xH
1415+
unquantized
1416+
eom
1417+
ipython
1418+
CPUs
1419+
modelUpgradeExample
1420+
guardrailing

README.md

Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,38 @@
11
# Llama Recipes: Examples to get started using the Llama models from Meta
22
<!-- markdown-link-check-disable -->
3-
The 'llama-recipes' repository is a companion to the [Meta Llama 3](https://github.com/meta-llama/llama3) models. The goal of this repository is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Meta Llama and other tools in the LLM ecosystem. The examples here showcase how to run Meta Llama locally, in the cloud, and on-prem. [Meta Llama 2](https://github.com/meta-llama/llama) is also supported in this repository. We highly recommend everyone to utilize [Meta Llama 3](https://github.com/meta-llama/llama3) due to its enhanced capabilities.
3+
The 'llama-recipes' repository is a companion to the [Meta Llama](https://github.com/meta-llama/llama-models) models. We support the latest version, [Llama 3.1](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md), in this repository. The goal is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama and other tools in the LLM ecosystem. The examples here showcase how to run Llama locally, in the cloud, and on-prem.
44

55
<!-- markdown-link-check-enable -->
66
> [!IMPORTANT]
7-
> Meta Llama 3 has a new prompt template and special tokens (based on the tiktoken tokenizer).
7+
> Meta Llama 3.1 has a new prompt template and special tokens.
88
> | Token | Description |
99
> |---|---|
10-
> `<\|begin_of_text\|>` | This is equivalent to the BOS token. |
11-
> `<\|end_of_text\|>` | This is equivalent to the EOS token. For multiturn-conversations it's usually unused. Instead, every message is terminated with `<\|eot_id\|>` instead.|
12-
> `<\|eot_id\|>` | This token signifies the end of the message in a turn i.e. the end of a single message by a system, user or assistant role as shown below.|
13-
> `<\|start_header_id\|>{role}<\|end_header_id\|>` | These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant. |
10+
> `<\|begin_of_text\|>` | Specifies the start of the prompt. |
11+
> `<\|eot_id\|>` | This token signifies the end of a turn i.e. the end of the model's interaction either with the user or tool executor. |
12+
> `<\|eom_id\|>` | End of Message. A message represents a possible stopping point where the model can inform the execution environment that a tool call needs to be made. |
13+
> `<\|python_tag\|>` | A special tag used in the model’s response to signify a tool call. |
14+
> `<\|finetune_right_pad_id\|>` | Used for padding text sequences in a batch to the same length. |
15+
> `<\|start_header_id\|>{role}<\|end_header_id\|>` | These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant and ipython. |
16+
> `<\|end_of_text\|>` | This is equivalent to the EOS token. For multiturn-conversations it's usually unused, this token is expected to be generated only by the base models. |
1417
>
15-
> A multiturn-conversation with Meta Llama 3 follows this prompt template:
18+
> A multiturn-conversation with Meta Llama 3.1 that includes tool-calling follows this structure:
1619
> ```
1720
> <|begin_of_text|><|start_header_id|>system<|end_header_id|>
1821
>
1922
> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
2023
>
2124
> {{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
2225
>
23-
> {{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|>
26+
> <|python_tag|>{{ model_tool_call_1 }}<|eom_id|><|start_header_id|>ipython<|end_header_id|>
2427
>
25-
> {{ user_message_2 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
28+
> {{ tool_response }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
29+
>
30+
> {{model_response_based_on_tool_response}}<|eot_id|>
2631
> ```
2732
> Each message gets trailed by an `<|eot_id|>` token before a new header is started, signaling a role change.
2833
>
29-
> More details on the new tokenizer and prompt template can be found [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3#special-tokens-used-with-meta-llama-3).
34+
> More details on the new tokenizer and prompt template can be found [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1).
35+
3036
>
3137
> [!NOTE]
3238
> The llama-recipes repository was recently refactored to promote a better developer experience of using the examples. Some files have been moved to new locations. The `src/` folder has NOT been modified, so the functionality of this repo and package is not impacted.
@@ -139,6 +145,7 @@ Contains examples are organized in folders by topic:
139145
[use_cases](./recipes/use_cases)|Scripts showing common applications of Meta Llama3
140146
[3p_integrations](./recipes/3p_integrations)|Partner owned folder showing common applications of Meta Llama3
141147
[responsible_ai](./recipes/responsible_ai)|Scripts to use PurpleLlama for safeguarding model outputs
148+
[experimental](./recipes/experimental)|Meta Llama implementations of experimental LLM techniques
142149

143150
### `src/`
144151

@@ -160,7 +167,9 @@ Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduc
160167
## License
161168
<!-- markdown-link-check-disable -->
162169

163-
See the License file for Meta Llama 3 [here](https://llama.meta.com/llama3/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama3/use-policy/)
170+
See the License file for Meta Llama 3.1 [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) and Acceptable Use Policy [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md)
171+
172+
See the License file for Meta Llama 3 [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3/LICENSE) and Acceptable Use Policy [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3/USE_POLICY.md)
164173

165-
See the License file for Meta Llama 2 [here](https://llama.meta.com/llama2/license/) and Acceptable Use Policy [here](https://llama.meta.com/llama2/use-policy/)
174+
See the License file for Meta Llama 2 [here](https://github.com/meta-llama/llama-models/blob/main/models/llama2/LICENSE) and Acceptable Use Policy [here](https://github.com/meta-llama/llama-models/blob/main/models/llama2/USE_POLICY.md)
166175
<!-- markdown-link-check-enable -->

docs/LLM_finetuning.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## LLM Fine-Tuning
22

3-
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
3+
Here we discuss fine-tuning Meta Llama with a couple of different recipes. We will cover two scenarios here:
44

55

66
## 1. **Parameter Efficient Model Fine-Tuning**
@@ -18,8 +18,6 @@ These methods will address three aspects:
1818

1919
HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
2020

21-
22-
2321
## 2. **Full/ Partial Parameter Fine-Tuning**
2422

2523
Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

docs/multi_gpu.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,12 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
66

77
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
88

9-
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
9+
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 8B model on multiple GPUs in one node.
10+
For big models like 405B we will need to fine-tune in a multi-node setup even if 4bit quantization is enabled.
1011

1112
## Requirements
1213
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/quickstart/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
1314

14-
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
15-
1615
## How to run it
1716

1817
Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
@@ -61,7 +60,7 @@ torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning
6160
This has been tested on 4 H100s GPUs.
6261

6362
```bash
64-
FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --quantization int4 --model_name /path_of_model_folder/70B --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
63+
FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --quantization 4bit --model_name /path_of_model_folder/70B --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
6564
```
6665

6766
### Fine-tuning using FSDP on 70B Model

recipes/3p_integrations/lamini/text2sql_memory_tuning/meta_lamini.ipynb

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@
145145
"class Args:\n",
146146
" def __init__(self, \n",
147147
" max_examples=100, \n",
148-
" sql_model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\", \n",
148+
" sql_model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\", \n",
149149
" gold_file_name=\"gold-test-set.jsonl\",\n",
150150
" training_file_name=\"generated_queries.jsonl\",\n",
151151
" num_to_generate=10):\n",
@@ -197,7 +197,7 @@
197197
}
198198
],
199199
"source": [
200-
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
200+
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
201201
"\n",
202202
"question = \"\"\"Who is the highest paid NBA player?\"\"\"\n",
203203
"system = f\"\"\"You are an NBA analyst with 15 years of experience writing complex SQL queries. Consider the nba_roster table with the following schema:\n",
@@ -418,7 +418,7 @@
418418
"class ScoreStage(GenerationNode):\n",
419419
" def __init__(self):\n",
420420
" super().__init__(\n",
421-
" model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
421+
" model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
422422
" max_new_tokens=150,\n",
423423
" )\n",
424424
"\n",
@@ -712,7 +712,7 @@
712712
"class ModelStage(GenerationNode):\n",
713713
" def __init__(self):\n",
714714
" super().__init__(\n",
715-
" model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
715+
" model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
716716
" max_new_tokens=300,\n",
717717
" )\n",
718718
"\n",
@@ -808,7 +808,7 @@
808808
"class QuestionStage(GenerationNode):\n",
809809
" def __init__(self):\n",
810810
" super().__init__(\n",
811-
" model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\",\n",
811+
" model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
812812
" max_new_tokens=150,\n",
813813
" )\n",
814814
"\n",
@@ -1055,7 +1055,7 @@
10551055
],
10561056
"source": [
10571057
"args = Args()\n",
1058-
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
1058+
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
10591059
"\n",
10601060
"dataset = get_dataset(args, make_question)\n",
10611061
"finetune_args = get_default_finetune_args()\n",
@@ -1601,7 +1601,7 @@
16011601
],
16021602
"source": [
16031603
"args = Args(training_file_name=\"archive/generated_queries_large_filtered_cleaned.jsonl\")\n",
1604-
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
1604+
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
16051605
"\n",
16061606
"dataset = get_dataset(args, make_question)\n",
16071607
"finetune_args = get_default_finetune_args()\n",
@@ -1798,7 +1798,7 @@
17981798
],
17991799
"source": [
18001800
"args = Args(training_file_name=\"generated_queries_v2.jsonl\")\n",
1801-
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
1801+
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
18021802
"\n",
18031803
"dataset = get_dataset(args, make_question)\n",
18041804
"finetune_args = get_default_finetune_args()\n",
@@ -1966,7 +1966,7 @@
19661966
],
19671967
"source": [
19681968
"args = Args(training_file_name=\"archive/generated_queries_v2_large_filtered_cleaned.jsonl\")\n",
1969-
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3-8B-Instruct\")\n",
1969+
"llm = lamini.Lamini(model_name=\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
19701970
"\n",
19711971
"dataset = get_dataset(args, make_question)\n",
19721972
"finetune_args = get_default_finetune_args()\n",

recipes/3p_integrations/lamini/text2sql_memory_tuning/util/parse_arguments.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def parse_arguments():
1616
parser.add_argument(
1717
"--sql-model-name",
1818
type=str,
19-
default="meta-llama/Meta-Llama-3-8B-Instruct",
19+
default="meta-llama/Meta-Llama-3.1-8B-Instruct",
2020
help="The model to use for text2sql",
2121
required=False,
2222
)

recipes/3p_integrations/llama_on_prem.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an exa
88

99
The Colab notebook to connect via LangChain with Llama 3 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg), also shown in the sections below.
1010

11-
This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page.
11+
This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page.
1212

1313
You'll also need your Hugging Face access token which you can get at your Settings page [here](https://huggingface.co/settings/tokens).
1414

@@ -33,7 +33,7 @@ There are two ways to deploy Llama 3 via vLLM, as a general API server or an Ope
3333
Run the command below to deploy vLLM as a general Llama 3 service:
3434

3535
```
36-
python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct
36+
python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct
3737
```
3838

3939
Then on another terminal you can run:
@@ -68,13 +68,13 @@ Also, if you have multiple GPUs, you can add the `--tensor-parallel-size` argume
6868
git clone https://github.com/vllm-project/vllm
6969
cd vllm/vllm/entrypoints
7070
conda activate llama3
71-
python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 4
71+
python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 4
7272
```
7373

7474
With multiple GPUs, you can also run replica of models as long as your model size can fit into targeted GPU memory. For example, if you have two A10G with 24 GB memory, you can run two Llama 3 8B models at the same time. This can be done by launching two api servers each targeting specific CUDA cores on different ports:
75-
`CUDA_VISIBLE_DEVICES=0 python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct`
75+
`CUDA_VISIBLE_DEVICES=0 python api_server.py --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct`
7676
and
77-
`CUDA_VISIBLE_DEVICES=1 python api_server.py --host 0.0.0.0 --port 5001 --model meta-llama/Meta-Llama-3-8B-Instruct`
77+
`CUDA_VISIBLE_DEVICES=1 python api_server.py --host 0.0.0.0 --port 5001 --model meta-llama/Meta-Llama-3.1-8B-Instruct`
7878
The benefit would be that you can balance incoming requests to both models, reaching higher batch size processing for a trade-off of generation latency.
7979

8080

@@ -83,14 +83,14 @@ The benefit would be that you can balance incoming requests to both models, reac
8383
You can also deploy the vLLM hosted Llama 3 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below:
8484

8585
```
86-
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3-8B-Instruct
86+
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 5000 --model meta-llama/Meta-Llama-3.1-8B-Instruct
8787
```
8888

8989
Then on another terminal, run:
9090

9191
```
9292
curl http://localhost:5000/v1/completions -H "Content-Type: application/json" -d '{
93-
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
93+
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
9494
"prompt": "Who wrote the book Innovators dilemma?",
9595
"max_tokens": 300,
9696
"temperature": 0
@@ -118,7 +118,7 @@ from langchain.llms import VLLMOpenAI
118118
llm = VLLMOpenAI(
119119
openai_api_key="EMPTY",
120120
openai_api_base="http://<vllm_server_ip_address>:5000/v1",
121-
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
121+
model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
122122
)
123123
124124
print(llm("Who wrote the book godfather?"))
@@ -136,7 +136,7 @@ You can now use the Llama 3 instance `llm` created this way in any of the demo a
136136
The easiest way to deploy Llama 3 with TGI is using its official docker image. First, replace `<your_hugging_face_access_token>` and set the three required shell variables (you may replace the `model` value above with another Llama 3 model):
137137

138138
```
139-
model=meta-llama/Meta-Llama-3-8B-Instruct
139+
model=meta-llama/Meta-Llama-3.1-8B-Instruct
140140
volume=$PWD/data
141141
token=<your_hugging_face_access_token>
142142
```

0 commit comments

Comments
 (0)