You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Fine-Tuning Meta Llama Multi Modal Models recipe
2
-
Here we discuss fine-tuning Meta Llama 3.2 11B and 90B models.
2
+
This recipe steps you through how to finetune a Llama 3.2 vision model on the VQA task using the [the_cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) dataset.
3
3
4
4
### Concepts
5
5
Model Architecture
@@ -12,18 +12,24 @@ We need have a new processor class added, that will handle the image processing
12
12
13
13
14
14
### Fine-tuning steps
15
-
1. Download the dataset:
16
-
an example of the dataset looks like this:
17
-
2. Processor example looks like this
18
15
19
-
3. Load the dataset
20
16
21
-
Full-finetune
17
+
For **full finetuning with FSDP**, we can run the following code:
**Note**: `--batching_strategy padding` is needed as the vision model will not work with `packing` method.
27
+
28
+
For more details about the finetuning configurations, please read the [finetuning readme](./README.md).
29
+
30
+
### How to use custom dataset to fine-tune vision model
31
+
32
+
1. Create a new dataset python file under `recipes/quickstart/finetuning/dataset` folder
33
+
2. In this python file, you need to define a `get_custom_dataset(dataset_config, processor, split, split_ratio=0.9)` function that handles the dataloading.
34
+
3. In this python file, you need to define a `get_data_collator(processor)` that returns a custom data collartor that can be used by the Pytorch Data Loader.
35
+
4. This custom data collator class must have a `__call__(self, samples)` function that converts the image and text samples into the actual inputs that vision model expects.
0 commit comments