Skip to content

Commit 3470eaf

Browse files
Upstream merge (meta-llama#677)
2 parents ee1768d + d729f9d commit 3470eaf

File tree

23 files changed

+926
-1169
lines changed

23 files changed

+926
-1169
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1451,4 +1451,7 @@ openhathi
14511451
sarvam
14521452
subtask
14531453
acc
1454+
OCRVQA
1455+
OCRVQADataCollator
1456+
ocrvqa
14541457
langchain

README.md

Lines changed: 14 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,29 @@
11
# Llama Recipes: Examples to get started using the Llama models from Meta
22
<!-- markdown-link-check-disable -->
3-
The 'llama-recipes' repository is a companion to the [Meta Llama](https://github.com/meta-llama/llama-models) models. We support the latest version, [Llama 3.1](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md), in this repository. The goal is to provide a scalable library for fine-tuning Meta Llama models, along with some example scripts and notebooks to quickly get started with using the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama and other tools in the LLM ecosystem. The examples here showcase how to run Llama locally, in the cloud, and on-prem.
3+
The 'llama-recipes' repository is a companion to the [Meta Llama](https://github.com/meta-llama/llama-models) models. We support the latest version, [Llama 3.2 Vision](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md) and [Llama 3.2 Text](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md), in this repository. This repository contains example scripts and notebooks to get started with the models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama and other tools in the LLM ecosystem. The examples here use Llama locally, in the cloud, and on-prem.
44

55
<!-- markdown-link-check-enable -->
66
> [!IMPORTANT]
7-
> Meta Llama 3.1 has a new prompt template and special tokens.
7+
> Llama 3.2 follows the same prompt template as Llama 3.1, with a new special token `<|image|>` representing the input image for the multimodal models.
8+
>
89
> | Token | Description |
910
> |---|---|
1011
> `<\|begin_of_text\|>` | Specifies the start of the prompt. |
12+
> `<\|image\|>` | Represents the image tokens passed as an input to Llama. |
1113
> `<\|eot_id\|>` | This token signifies the end of a turn i.e. the end of the model's interaction either with the user or tool executor. |
1214
> `<\|eom_id\|>` | End of Message. A message represents a possible stopping point where the model can inform the execution environment that a tool call needs to be made. |
1315
> `<\|python_tag\|>` | A special tag used in the model’s response to signify a tool call. |
1416
> `<\|finetune_right_pad_id\|>` | Used for padding text sequences in a batch to the same length. |
1517
> `<\|start_header_id\|>{role}<\|end_header_id\|>` | These tokens enclose the role for a particular message. The possible roles can be: system, user, assistant and ipython. |
1618
> `<\|end_of_text\|>` | This is equivalent to the EOS token. For multiturn-conversations it's usually unused, this token is expected to be generated only by the base models. |
1719
>
18-
> A multiturn-conversation with Meta Llama 3.1 that includes tool-calling follows this structure:
19-
> ```
20-
> <|begin_of_text|><|start_header_id|>system<|end_header_id|>
21-
>
22-
> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>
23-
>
24-
> {{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
25-
>
26-
> <|python_tag|>{{ model_tool_call_1 }}<|eom_id|><|start_header_id|>ipython<|end_header_id|>
27-
>
28-
> {{ tool_response }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
29-
>
30-
> {{model_response_based_on_tool_response}}<|eot_id|>
31-
> ```
32-
> Each message gets trailed by an `<|eot_id|>` token before a new header is started, signaling a role change.
33-
>
34-
> More details on the new tokenizer and prompt template can be found [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1).
20+
> More details on the prompt templates for image reasoning, tool-calling and code interpreter can be found [on the documentation website](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_2).
21+
3522

36-
>
37-
> [!NOTE]
38-
> The llama-recipes repository was recently refactored to promote a better developer experience of using the examples. Some files have been moved to new locations. The `src/` folder has NOT been modified, so the functionality of this repo and package is not impacted.
39-
>
40-
> Make sure you update your local clone by running `git pull origin main`
4123

4224
## Table of Contents
4325

44-
- [Llama Recipes: Examples to get started using the Meta Llama models from Meta](#llama-recipes-examples-to-get-started-using-the-llama-models-from-meta)
26+
- [Llama Recipes: Examples to get started using the Llama models from Meta](#llama-recipes-examples-to-get-started-using-the-llama-models-from-meta)
4527
- [Table of Contents](#table-of-contents)
4628
- [Getting Started](#getting-started)
4729
- [Prerequisites](#prerequisites)
@@ -117,23 +99,21 @@ pip install -e .[tests,auditnlg,vllm]
11799
```
118100

119101

120-
### Getting the Meta Llama models
121-
You can find Meta Llama models on Hugging Face hub [here](https://huggingface.co/meta-llama), **where models with `hf` in the name are already converted to Hugging Face checkpoints so no further conversion is needed**. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well.
102+
### Getting the Llama models
103+
You can find Llama models on Hugging Face hub [here](https://huggingface.co/meta-llama), **where models with `hf` in the name are already converted to Hugging Face checkpoints so no further conversion is needed**. The conversion step below is only for original model weights from Meta that are hosted on Hugging Face model hub as well.
122104

123105
#### Model conversion to Hugging Face
124-
The recipes and notebooks in this folder are using the Meta Llama model definition provided by Hugging Face's transformers library.
125-
126-
Given that the original checkpoint resides under models/7B you can install all requirements and convert the checkpoint with:
106+
If you have the model checkpoints downloaded from the Meta website, you can convert it to the Hugging Face format with:
127107

128108
```bash
129109
## Install Hugging Face Transformers from source
130-
pip freeze | grep transformers ## verify it is version 4.31.0 or higher
110+
pip freeze | grep transformers ## verify it is version 4.45.0 or higher
131111

132112
git clone [email protected]:huggingface/transformers.git
133113
cd transformers
134114
pip install protobuf
135115
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
136-
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
116+
--input_dir /path/to/downloaded/llama/weights --model_size 3B --output_dir /output/path
137117
```
138118

139119

@@ -196,6 +176,8 @@ Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduc
196176
## License
197177
<!-- markdown-link-check-disable -->
198178

179+
See the License file for Meta Llama 3.2 [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) and Acceptable Use Policy [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/USE_POLICY.md)
180+
199181
See the License file for Meta Llama 3.1 [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE) and Acceptable Use Policy [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md)
200182

201183
See the License file for Meta Llama 3 [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3/LICENSE) and Acceptable Use Policy [here](https://github.com/meta-llama/llama-models/blob/main/models/llama3/USE_POLICY.md)
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# This software may be used and distributed according to the terms of the Llama 3 Community License Agreement.
3+
4+
5+
import copy
6+
from datasets import load_dataset
7+
import itertools
8+
import torch
9+
10+
# check system prompt token seq or user prompt token seq is in the current token list
11+
def check_header(targets,seq):
12+
for i in range(len(seq)-3):
13+
if seq[i:i+3] in targets:
14+
return True
15+
return False
16+
def replace_target(target,seq):
17+
for i in range(len(seq)-3):
18+
if seq[i:i+3] == target:
19+
seq[i],seq[i+1],seq[i+2] = -100,-100,-100
20+
return seq
21+
def tokenize_dialogs(dialogs, images, processor):
22+
text_prompt = processor.apply_chat_template(dialogs)
23+
batch = processor(images=images, text=text_prompt,padding = True, return_tensors="pt")
24+
label_list = []
25+
for i in range(len(batch["input_ids"])):
26+
dialog_tokens = batch["input_ids"][i].tolist()
27+
labels = copy.copy(dialog_tokens)
28+
eot_indices = [i for i,n in enumerate(labels) if n == 128009]
29+
last_idx = 0
30+
# system prompt header "<|start_header_id|>system<|end_header_id|>" has been tokenized to [128006, 9125, 128007]
31+
# user prompt header "<|start_header_id|>user<|end_header_id|>" has been tokenized to [128006, 882, 128007]
32+
prompt_header_seqs = [[128006, 9125, 128007],[128006, 882, 128007]]
33+
for n, idx in enumerate(eot_indices):
34+
current_seq = labels[last_idx:idx+1]
35+
if check_header(prompt_header_seqs,current_seq):
36+
# found prompt header, indicating that this seq should be masked
37+
labels[last_idx:idx+1] = [-100] * (idx-last_idx+1)
38+
else:
39+
last_idx = idx+1
40+
# Mask all the assistant header prompt <|start_header_id|>assistant<|end_header_id|>, which has been tokenized to [128006, 78191, 128007]
41+
assistant_header_seq = [128006, 78191, 128007]
42+
labels = replace_target(assistant_header_seq,labels)
43+
# Mask the padding token and image token 128256
44+
for i in range(len(labels)):
45+
if labels[i] == processor.tokenizer.pad_token_id or labels[i] == 128256: # 128256 is image token index
46+
labels[i] = -100
47+
label_list.append(labels)
48+
batch["labels"] = torch.tensor(label_list)
49+
return batch
50+
51+
52+
def get_custom_dataset(dataset_config, processor, split, split_ratio=0.9):
53+
# load_dataset will return DatasetDict that contains all the data in the train set
54+
dataset_dict = load_dataset("HuggingFaceM4/the_cauldron", name="ocrvqa")
55+
dataset = dataset_dict['train']
56+
# Comment out the following line to use the full dataset, for quick testing only use 2000 samples
57+
dataset = dataset.select(range(2000))
58+
dataset = dataset.train_test_split(test_size=1-split_ratio, shuffle=True, seed=42)[split]
59+
return dataset
60+
61+
class OCRVQADataCollator:
62+
def __init__(self, processor):
63+
self.processor = processor
64+
self.processor.tokenizer.padding_side = "right" # during training, one always uses padding on the right
65+
def __call__(self, samples):
66+
dialogs,images = [],[]
67+
for sample in samples:
68+
image_list,sample_list = sample["images"],sample["texts"]
69+
if len(image_list) > 1:
70+
raise ValueError("Only support one image per sample")
71+
image = image_list[0].convert("RGB") # only use the first image
72+
dialog = []
73+
for sample_dict in sample_list:
74+
if not dialog:
75+
# only append image to the first sentence
76+
dialog += [
77+
{"role":"user","content":[{"type": "image"},{"type": "text", "text": sample_dict["user"].strip()}]},
78+
{"role":"assistant","content":[{"type": "text", "text": sample_dict["assistant"].strip()}]}
79+
]
80+
81+
else:
82+
dialog += [
83+
{"role":"user","content":[{"type": "text", "text": sample_dict["user"].strip()}]},
84+
{"role":"assistant","content":[{"type": "text", "text": sample_dict["assistant"].strip()}]}
85+
]
86+
dialogs.append(dialog)
87+
images.append([image])
88+
return tokenize_dialogs(dialogs,images, self.processor)
89+
def get_data_collator(processor):
90+
return OCRVQADataCollator(processor)
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
## Fine-Tuning Meta Llama Multi Modal Models recipe
2+
This recipe steps you through how to finetune a Llama 3.2 vision model on the OCR VQA task using the [OCRVQA](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron/viewer/ocrvqa?row=0) dataset.
3+
4+
**Disclaimer**: As our vision models already have a very good OCR ability, here we just use the OCRVQA dataset only for demonstration purposes of the required steps for fine-tuning our vision models with llama-recipes.
5+
6+
### Fine-tuning steps
7+
8+
We created an example script [ocrvqa_dataset.py](./datasets/ocrvqa_dataset.py) that can load the OCRVQA dataset with `get_custom_dataset` function, then provide OCRVQADataCollator class to process the image dataset.
9+
10+
For **full finetuning with FSDP**, we can run the following code:
11+
12+
```bash
13+
torchrun --nnodes 1 --nproc_per_node 4 recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py" --run_validation True --batching_strategy padding
14+
```
15+
16+
For **LoRA finetuning with FSDP**, we can run the following code:
17+
18+
```bash
19+
torchrun --nnodes 1 --nproc_per_node 4 recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py" --run_validation True --batching_strategy padding --use_peft --peft_method lora
20+
```
21+
**Note**: `--batching_strategy padding` is needed as the vision model will not work with `packing` method.
22+
23+
For more details about the finetuning configurations, please read the [finetuning readme](./README.md).
24+
25+
### How to use a custom dataset to fine-tune vision model
26+
27+
In order to use a custom dataset, please follow the steps below:
28+
29+
1. Create a new dataset python file under `recipes/quickstart/finetuning/dataset` folder.
30+
2. In this python file, you need to define a `get_custom_dataset(dataset_config, processor, split, split_ratio=0.9)` function that handles the data loading.
31+
3. In this python file, you need to define a `get_data_collator(processor)` that returns a custom data collator that can be used by the Pytorch Data Loader.
32+
4. This custom data collator class must have a `__call__(self, samples)` function that converts the image and text samples into the actual inputs that vision model expects.
33+
5. Run the `torchrun` commend from above section, please change the `--custom_dataset.file` to the new dataset python file, adjust the learning rate accordingly.

recipes/quickstart/inference/local_inference/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Local Inference
22

3+
For Multi-Modal inference we have added [multi_modal_infer.py](multi_modal_infer.py) which uses the transformers library
4+
5+
The way to run this would be
6+
```
7+
python multi_modal_infer.py --image_path "./resources/image.jpg" --prompt_text "Describe this image" --temperature 0.5 --top_p 0.8 --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct"
8+
```
9+
310
For local inference we have provided an [inference script](inference.py). Depending on the type of finetuning performed during training the [inference script](inference.py) takes different arguments.
411
To finetune all model parameters the output dir of the training has to be given as --model_name argument.
512
In the case of a parameter efficient method like lora the base model has to be given as --model_name and the output dir of the training has to be given as --peft_model argument.
@@ -87,4 +94,4 @@ python inference.py --model_name <training_config.output_dir> --prompt_file <tes
8794

8895
## Inference on large models like Meta Llama 405B
8996
The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
90-
To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
97+
To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
import os
2+
import sys
3+
import argparse
4+
from PIL import Image as PIL_Image
5+
import torch
6+
from transformers import MllamaForConditionalGeneration, MllamaProcessor
7+
8+
9+
# Constants
10+
DEFAULT_MODEL = "meta-llama/Llama-3.2-11B-Vision-Instruct"
11+
12+
13+
def load_model_and_processor(model_name: str, hf_token: str):
14+
"""
15+
Load the model and processor based on the 11B or 90B model.
16+
"""
17+
model = MllamaForConditionalGeneration.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16, token=hf_token)
18+
processor = MllamaProcessor.from_pretrained(model_name, token=hf_token)
19+
return model, processor
20+
21+
22+
def process_image(image_path: str) -> PIL_Image.Image:
23+
"""
24+
Open and convert an image from the specified path.
25+
"""
26+
if not os.path.exists(image_path):
27+
print(f"The image file '{image_path}' does not exist.")
28+
sys.exit(1)
29+
with open(image_path, "rb") as f:
30+
return PIL_Image.open(f).convert("RGB")
31+
32+
33+
def generate_text_from_image(model, processor, image, prompt_text: str, temperature: float, top_p: float):
34+
"""
35+
Generate text from an image using the model and processor.
36+
"""
37+
conversation = [
38+
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text}]}
39+
]
40+
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
41+
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
42+
output = model.generate(**inputs, temperature=temperature, top_p=top_p, max_new_tokens=512)
43+
return processor.decode(output[0])[len(prompt):]
44+
45+
46+
def main(image_path: str, prompt_text: str, temperature: float, top_p: float, model_name: str, hf_token: str):
47+
"""
48+
Call all the functions.
49+
"""
50+
model, processor = load_model_and_processor(model_name, hf_token)
51+
image = process_image(image_path)
52+
result = generate_text_from_image(model, processor, image, prompt_text, temperature, top_p)
53+
print("Generated Text: " + result)
54+
55+
56+
if __name__ == "__main__":
57+
parser = argparse.ArgumentParser(description="Generate text from an image and prompt using the 3.2 MM Llama model.")
58+
parser.add_argument("--image_path", type=str, help="Path to the image file")
59+
parser.add_argument("--prompt_text", type=str, help="Prompt text to describe the image")
60+
parser.add_argument("--temperature", type=float, default=0.7, help="Temperature for generation (default: 0.7)")
61+
parser.add_argument("--top_p", type=float, default=0.9, help="Top p for generation (default: 0.9)")
62+
parser.add_argument("--model_name", type=str, default=DEFAULT_MODEL, help=f"Model name (default: '{DEFAULT_MODEL}')")
63+
parser.add_argument("--hf_token", type=str, required=True, help="Hugging Face token for authentication")
64+
65+
args = parser.parse_args()
66+
main(args.image_path, args.prompt_text, args.temperature, args.top_p, args.model_name, args.hf_token)

0 commit comments

Comments
 (0)