Skip to content

Commit fa070a8

Browse files
feat: support loading vision model (foundation-model-stack#451)
* install trl=0.13, deepspeed, update transformers * deps: install pillow, uninstall deepspeed * add multimodal flag, pass processor, add data collator * load dataset directly, pass processor, fix field * add generic data collator Signed-off-by: Anh Uong <[email protected]> * remove load_dataset since HF support added Signed-off-by: Anh Uong <[email protected]> * add fsdp config needed for llava models Signed-off-by: Anh Uong <[email protected]> * feat:Use of data handlers for Vision LM support (foundation-model-stack#4) * Changes to support vlms Signed-off-by: Abhishek <[email protected]> * Change in kwargs Signed-off-by: Abhishek <[email protected]> * Restructure of VisionDataCollator Signed-off-by: Abhishek <[email protected]> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <[email protected]> * fix fmt+lint Signed-off-by: Abhishek <[email protected]> * Minor Fix for unit test case Signed-off-by: Abhishek <[email protected]> * Minor error handling Signed-off-by: Abhishek <[email protected]> --------- Signed-off-by: Abhishek <[email protected]> * replace text_field_name for dataset_text_field and for image Signed-off-by: Anh Uong <[email protected]> * remove multimodal flag Signed-off-by: Anh Uong <[email protected]> * fix formatting, remove unused fields Signed-off-by: Anh Uong <[email protected]> * remove irrelevant unit test - in transformers v4.49 output_dir is no longer required Signed-off-by: Anh Uong <[email protected]> * revert data loading back Signed-off-by: Anh Uong <[email protected]> * fix:Support loading for Granite-3.2 Vision Model * Changes to support vlms Signed-off-by: Abhishek <[email protected]> * Change in kwargs Signed-off-by: Abhishek <[email protected]> * Restructure of VisionDataCollator Signed-off-by: Abhishek <[email protected]> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <[email protected]> * fix fmt+lint Signed-off-by: Abhishek <[email protected]> * Minor Fix for unit test case Signed-off-by: Abhishek <[email protected]> * Minor error handling Signed-off-by: Abhishek <[email protected]> * Fix issues for granite vision preview model Signed-off-by: Abhishek <[email protected]> --------- Signed-off-by: Abhishek <[email protected]> * remove duplicate logger, fmt Signed-off-by: Anh Uong <[email protected]> * fix unbound var, refactor tokenizer Signed-off-by: Anh Uong <[email protected]> * changes from review comments Signed-off-by: Anh Uong <[email protected]> * fix embedding resize and errors Signed-off-by: Anh Uong <[email protected]> * add hack fix for vocab size for Mllama models Signed-off-by: Anh Uong <[email protected]> * add docs on vision model usage Signed-off-by: Anh Uong <[email protected]> * move llama vocab size, allow single image inputs Signed-off-by: Anh Uong <[email protected]> * linter fixes Signed-off-by: Anh Uong <[email protected]> * fix merge, add lora note Signed-off-by: Anh Uong <[email protected]> * docs: organize sections Signed-off-by: Anh Uong <[email protected]> * remove all dataset columns Signed-off-by: Anh Uong <[email protected]> * only take single image for granite models Signed-off-by: Anh Uong <[email protected]> * feat:Support Entire Vision dataset with Streaming (foundation-model-stack#6) * Changes to support vlms Signed-off-by: Abhishek <[email protected]> * Change in kwargs Signed-off-by: Abhishek <[email protected]> * Restructure of VisionDataCollator Signed-off-by: Abhishek <[email protected]> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <[email protected]> * fix fmt+lint Signed-off-by: Abhishek <[email protected]> * Minor Fix for unit test case Signed-off-by: Abhishek <[email protected]> * Minor error handling Signed-off-by: Abhishek <[email protected]> * Fix issues for granite vision preview model Signed-off-by: Abhishek <[email protected]> * Transformers version for running Llama model successfully Signed-off-by: Abhishek <[email protected]> * Changes when enabling streaming * Merge remote-tracking branch 'anh_vision_fms_hf_tuning/vision-model' into vision_support * Merge with main Signed-off-by: Abhishek <[email protected]> * modify apply_tokenizer_chat_template argument key Signed-off-by: Abhishek <[email protected]> * resolve features for iterable dataset Signed-off-by: Abhishek <[email protected]> * Add applying processor in collator and PR changes Signed-off-by: Abhishek <[email protected]> * Rename Handler Signed-off-by: Abhishek <[email protected]> * Add config for dataset streaming via arguments Signed-off-by: Abhishek <[email protected]> * Fix column removal Signed-off-by: Abhishek <[email protected]> * Convert to RGB for LlavaProcessor and model LlavaForConditionalGeneration Signed-off-by: Abhishek <[email protected]> * PR CHANGES 1 Signed-off-by: Abhishek <[email protected]> * PR Changes 2 Signed-off-by: Abhishek <[email protected]> * Collator documentation Signed-off-by: Abhishek <[email protected]> * Minor fix Signed-off-by: Abhishek <[email protected]> * Resize input and output embeddings seperately for LLama vision model Signed-off-by: Abhishek <[email protected]> * PR changes Signed-off-by: Abhishek <[email protected]> * Documentation added Signed-off-by: Abhishek <[email protected]> * Added processor to DataPreProcessor Signed-off-by: Abhishek <[email protected]> --------- Signed-off-by: Abhishek <[email protected]> * PR change of adding vocab size Signed-off-by: Abhishek <[email protected]> * Added llama vision model and unit test case Signed-off-by: Abhishek <[email protected]> * Make Jinja template work Signed-off-by: Abhishek <[email protected]> * Fix for preprocessor_config in checkpoint folder Signed-off-by: Abhishek <[email protected]> * fmt fix Signed-off-by: Abhishek <[email protected]> * Moving resizing out of if block Signed-off-by: Abhishek <[email protected]> * Test case fix and merging with main Signed-off-by: Abhishek <[email protected]> * PR Change 1 Signed-off-by: Abhishek <[email protected]> * PR Change 2 Signed-off-by: Abhishek <[email protected]> * Added test_vision_data_collator Signed-off-by: Abhishek <[email protected]> * PR Changes Signed-off-by: Abhishek <[email protected]> * Comment change Signed-off-by: Abhishek <[email protected]> --------- Signed-off-by: Anh Uong <[email protected]> Signed-off-by: Abhishek <[email protected]> Co-authored-by: Abhishek Maurya <[email protected]> Co-authored-by: Abhishek <[email protected]>
1 parent 2d47acf commit fa070a8

26 files changed

+1254027
-52
lines changed

.pylintrc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ ignored-parents=
281281
max-args=5
282282

283283
# Maximum number of attributes for a class (custom).
284-
max-attributes=10
284+
max-attributes=15
285285

286286
# Maximum number of boolean expressions in an if statement (see R0916).
287287
max-bool-expr=5
@@ -299,7 +299,7 @@ max-parents=7
299299
max-public-methods=20
300300

301301
# Maximum number of return / yield for function / method body.
302-
max-returns=6
302+
max-returns=10
303303

304304
# Maximum number of statements in function / method body.
305305
max-statements=50

README.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
- [Fine Tuning](#fine-tuning)
1414
- [FMS Acceleration](#fms-acceleration)
1515
- [Extended Pre-Training](#extended-pre-training)
16+
- [Tuning Vision Language Models](#tuning-vision-language-models)
1617
- [Inference](#inference)
1718
- [Running a single example](#running-a-single-example)
1819
- [Running multiple examples](#running-multiple-examples)
@@ -39,15 +40,15 @@ pip install fms-hf-tuning
3940
### Using FlashAttention
4041

4142
> Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
42-
```
43+
```sh
4344
pip install fms-hf-tuning[dev]
4445
pip install fms-hf-tuning[flash-attn]
4546
```
4647
[FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
4748

4849
*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
4950

50-
```
51+
```sh
5152
pip uninstall flash-attn
5253
pip cache purge
5354
pip install fms-hf-tuning[flash-attn]
@@ -898,6 +899,34 @@ The `fms_acceleration.cli` can do more to search for all available configs, plug
898899

899900
We also have support for extended pre training where users might wanna pretrain a model with large number of samples. Please refer our separate doc on [EPT Use Cases](./docs/ept.md)
900901

902+
## Tuning Vision Language Models
903+
904+
We also support full fine-tuning and LoRA tuning for vision language models - `Granite 3.2 Vision`, `Llama 3.2 Vision`, and `LLaVa-Next`.
905+
For information on supported dataset formats and how to tune a vision-language model, please see [this document](./docs/vision-language-model-tuning.md).
906+
907+
### Supported vision model
908+
909+
- Legend:
910+
911+
✅ Ready and available
912+
913+
✔️ Ready and available - compatible architecture
914+
915+
🚫 Not supported
916+
917+
? May be supported, but not tested
918+
919+
Model Name & Size | Model Architecture | Full Finetuning |
920+
-------------------- | ---------------- | --------------- |
921+
Llama 3.2-11B Vision | MllamaForConditionalGeneration | ✅* |
922+
Llava 1.5-7B | LlavaForConditionalGeneration | ✅* |
923+
Granite 3.1-2B Vision | LlavaNextForConditionalGeneration | ✅* |
924+
Llava Mistral 1.6-7B | LlavaNextForConditionalGeneration | ✅* |
925+
926+
(*) - Supported with `fms-hf-tuning` v2.8.0 or later.
927+
928+
**Note**: vLLM currently does not support inference with LoRA-tuned vision models. To use a tuned LoRA adapter of vision model, please merge it with the base model before running vLLM inference.
929+
901930
## Inference
902931
Currently, we do *not* offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run `python scripts/run_inference.py --help`. Note that no data formatting / templating is applied at inference time.
903932

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Tuning Vision Language Models
2+
Our library also supports full fine tuning and LoRA tuning for vision language models.
3+
4+
## Supported Dataset Format
5+
We support tuning an `image+text` dataset that includes:
6+
- A single text field, formatted using the model’s chat template.
7+
- A single image field, which can contain either a list of images or a single image.
8+
9+
The text must follow the OpenAI conversational data format, which is defined as a list of message objects. Each message object must have two required fields: `role` and `content`:
10+
- `role`: The speaker (e.g., "user" or "assistant").
11+
- `content`: A list of dictionaries, each specifying:
12+
- `type`: `text` or `image`.
13+
- `text`: The text content (if applicable).
14+
15+
Example Format:
16+
```json
17+
[
18+
{
19+
"role": "user",
20+
"content": [
21+
{"type": "text", "text": "Who is this?"},
22+
{"type": "image"}
23+
]
24+
},
25+
{
26+
"role": "assistant",
27+
"content": [
28+
{"type": "text", "text": "Barack Obama"}
29+
]
30+
},
31+
{
32+
"role": "user",
33+
"content": [
34+
{"type": "text", "text": "What is he famous for?"}
35+
]
36+
},
37+
{
38+
"role": "assistant",
39+
"content": [
40+
{"type": "text", "text": "He is the 44th President of the United States."}
41+
]
42+
}
43+
]
44+
```
45+
46+
## Processing of dataset
47+
48+
First, each dataset sample is processed by applying the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) to the raw text, which formats the conversation as required. Then, the model’s [`processor`](https://huggingface.co/docs/transformers/main/en/processors) takes the formatted text and the corresponding image(s) and converts them into the final input representation (e.g., input_ids, attention masks, etc.) that the model uses for training.
49+
50+
**Note**: `Granite 3.2` and `Llava-1.5` Vision models expect a single image for each dataset sample. If a list of images is provided, only the first image will be used.
51+
52+
## Tuning configurations
53+
54+
Two parameters must be passed to specify which dataset columns to use:
55+
- `dataset_text_field`: The column name that contains the conversational text.
56+
- `dataset_image_field`: The column name that contains the images.
57+
58+
Below is a sample configuration file:
59+
```json
60+
{
61+
"model_name_or_path": "ibm-granite/granite-vision-3.2-2b",
62+
"training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft",
63+
"dataset_text_field": "messages",
64+
"dataset_image_field": "images",
65+
"output_dir": "/app/test",
66+
"num_train_epochs": 1.0,
67+
"per_device_train_batch_size": 8,
68+
"gradient_accumulation_steps": 2,
69+
"learning_rate": 1e-4,
70+
"bf16": true,
71+
"torch_dtype": "bfloat16",
72+
"use_flash_attn": true,
73+
"remove_unused_columns": false,
74+
"dataset_kwargs": {"skip_prepare_dataset": true},
75+
"gradient_checkpointing": true,
76+
"gradient_checkpointing_kwargs": {"use_reentrant": false},
77+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer"}
78+
}
79+
```
80+
81+
## Running the Trainer
82+
83+
You can also run training by calling our trainer module directly using the command line. You can use `python` for single GPU or `accelerate launch` command for multi GPU.
84+
For example:
85+
86+
Command for single GPU:
87+
88+
```sh
89+
python tuning/sft_trainer.py \
90+
--model_name_or_path $MODEL_PATH \
91+
--training_data_path $TRAIN_DATA_PATH \
92+
--output_dir $OUTPUT_PATH \
93+
--num_train_epochs 5 \
94+
--per_device_train_batch_size 4 \
95+
--gradient_accumulation_steps 1 \
96+
--learning_rate 1e-5 \
97+
--dataset_text_field "messages" \
98+
--dataset_image_field "images"
99+
```
100+
101+
Command for multi GPU:
102+
103+
```sh
104+
accelerate launch \
105+
--num_processes=$NUM_PROCESSORS
106+
--config_file fixtures/accelerate_fsdp_defaults.yaml \
107+
tuning/sft_trainer.py \
108+
--model_name_or_path $MODEL_PATH \
109+
--training_data_path $TRAIN_DATA_PATH \
110+
--output_dir $OUTPUT_PATH \
111+
--num_train_epochs 5 \
112+
--per_device_train_batch_size 4 \
113+
--gradient_accumulation_steps 1 \
114+
--learning_rate 1e-5 \
115+
--dataset_text_field "messages" \
116+
--dataset_image_field "images"
117+
```
118+
119+
## Tuning Considerations for vision models
120+
121+
Flash Attention 2.0 is not supported by `MllamaForConditionalGeneration` models, thus when running tuning with the `Llama 3.2 Vision Models` set:
122+
123+
```json
124+
"use_flash_attn": false
125+
```
126+
### Multi-GPU Tuning with FSDP:
127+
128+
When running `multi-GPU` tuning with `FSDP`, you need to wrap specific transformer layers. Use the following setting in FSDP config based on your model:
129+
130+
Granite 3.2 Vision Models:
131+
```json
132+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer" }
133+
```
134+
135+
Llava-Next and Llava-1.5 Models:
136+
```json
137+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer" }
138+
```
139+
140+
Llava-1.6-Mistral Model:
141+
```json
142+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "MistralDecoderLayer" }
143+
```
144+
145+
Llama 3.2 Vision Models: No additional configuration is required.
146+
147+
### Gradient Checkpointing:
148+
149+
We recommend running with argument `gradient_checkpointing=True` as enabling this will greatly reduce the memory needed to load and run the model.
150+
151+
When running with gradient checkpointing for the `Llava` and `Granite` vision models, you will need to also set `gradient_checkpointing_kwargs` to not use the activation checkpoint variant that requires reentrant autograd.
152+
153+
```json
154+
"gradient_checkpointing_kwargs": {"use_reentrant": false}
155+
```
156+
157+
Without setting this, tuning will lead to error:
158+
159+
```sh
160+
RuntimeError: mat2 must be a matrix, got 1-D tensor
161+
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [0] and normalized_shape = [1152]
162+
```
163+
164+
### Other arguments:
165+
166+
To prevent default text-only processing and ensure proper handling of multimodal data, we recommend setting:
167+
168+
```json
169+
"remove_unused_columns": false
170+
"dataset_kwargs": {"skip_prepare_dataset": true}
171+
```
172+
173+
When performing LoRA tuning on vision models, you must specify the `target_modules` explicitly, as no defaults are provided.
174+

fixtures/accelerate_fsdp_defaults.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,12 @@ fsdp_config:
4141
# not needed for HF models that have . _no_split_modules
4242
# the example below is for GPTBigCode
4343
# fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock”
44+
# needed for llava-1.5-vision + llava-next-vision models
45+
# fsdp_transformer_layer_cls_to_wrap: "LlamaDecoderLayer"
46+
# needed for llava-1.6-mistral-vision model
47+
# fsdp_transformer_layer_cls_to_wrap: "MistralDecoderLayer"
48+
# needed for granite-3.2-vision model
49+
# fsdp_transformer_layer_cls_to_wrap: "GraniteDecoderLayer"
4450

4551
# for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the
4652
# learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ dependencies = [
3939
"protobuf>=5.28.0,<6.0.0",
4040
"datasets>=2.15.0,<4.0",
4141
"simpleeval>=0.9.13,<2.0",
42+
"pillow>=11.0.0,<12.0",
4243
]
4344

4445
[project.optional-dependencies]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- set user_supplied_system_message = true %}\n{%- else %}\n {%- set system_message = \"\" %}\n {%- set user_supplied_system_message = false %}\n{%- endif %}\n\n{#- Find out if there are any images #}\n{% set image_ns = namespace(has_images=false) %} \n{%- for message in messages %}\n {%- for content in message['content'] %}\n {%- if content['type'] == 'image' %}\n {%- set image_ns.has_images = true %}\n {%- endif %}\n {%- endfor %}\n{%- endfor %}\n\n{#- System message if there are no images, or if the user supplied one #}\n{%- if user_supplied_system_message or not image_ns.has_images %}\n {{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n {%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n {%- endif %}\n {{- \"Cutting Knowledge Date: December 2023\\n\" }}\n {{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n {%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {%- endif %}\n {{- system_message }}\n {{- \"<|eot_id|>\" }}\n{%- endif %}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' }}\n {%- if message['content'] is string %}\n {{- message['content'] }}\n {%- else %}\n {%- for content in message['content'] %}\n {%- if content['type'] == 'image' %}\n {{- '<|image|>' }}\n {%- elif content['type'] == 'text' %}\n {{- content['text'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
3+
}
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
{
2+
"architectures": [
3+
"MllamaForConditionalGeneration"
4+
],
5+
"image_token_index": 128256,
6+
"model_type": "mllama",
7+
"text_config": {
8+
"cross_attention_layers": [
9+
1
10+
],
11+
"eos_token_id": [
12+
128001,
13+
128008,
14+
128009
15+
],
16+
"hidden_size": 128,
17+
"intermediate_size": 768,
18+
"max_position_embeddings": 1024,
19+
"model_type": "mllama_text_model",
20+
"num_attention_heads": 4,
21+
"num_hidden_layers": 2,
22+
"rope_scaling": {
23+
"factor": 8.0,
24+
"high_freq_factor": 4.0,
25+
"low_freq_factor": 1.0,
26+
"original_max_position_embeddings": 512,
27+
"rope_type": "llama3"
28+
},
29+
"torch_dtype": "float16"
30+
},
31+
"torch_dtype": "float16",
32+
"transformers_version": "4.49.0",
33+
"vision_config": {
34+
"attention_heads": 4,
35+
"hidden_size": 128,
36+
"image_size": 224,
37+
"intermediate_size": 512,
38+
"model_type": "mllama_vision_model",
39+
"num_global_layers": 1,
40+
"num_hidden_layers": 2,
41+
"torch_dtype": "float16",
42+
"vision_output_dim": 256
43+
}
44+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"_from_model_config": true,
3+
"bos_token_id": 128000,
4+
"eos_token_id": [
5+
128001,
6+
128008,
7+
128009
8+
],
9+
"pad_token_id": 128004,
10+
"transformers_version": "4.49.0"
11+
}
67.9 MB
Binary file not shown.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"do_convert_rgb": true,
3+
"do_normalize": true,
4+
"do_pad": true,
5+
"do_rescale": true,
6+
"do_resize": true,
7+
"image_mean": [
8+
0.48145466,
9+
0.4578275,
10+
0.40821073
11+
],
12+
"image_processor_type": "MllamaImageProcessor",
13+
"image_std": [
14+
0.26862954,
15+
0.26130258,
16+
0.27577711
17+
],
18+
"max_image_tiles": 4,
19+
"processor_class": "MllamaProcessor",
20+
"resample": 2,
21+
"rescale_factor": 0.00392156862745098,
22+
"size": {
23+
"height": 560,
24+
"width": 560
25+
}
26+
}

0 commit comments

Comments
 (0)