- The LLaVA-PT is from LLaVA.
- The Hybird-FT is from SViT, LVIS, LRV, MIMIC-IT.
- The LLaVA-FT is from LLaVA.
- Download the training annotations. You can download from Baidu Disk, Google Disk, Peking University Disk or Hugging Face
We also provide the processed data as follows. The link is to BaiDu Disk.
| Data group | Usage | Link |
|---|---|---|
| LLaVA-PT | Stage 1 | LLaVA 1.5-558k |
| Hybird-FT | Stage 2 | SViT-157k, LVIS-220k, LRV-331k, MIMIC-IT-256k |
| LLaVA-FT | Stage 3 | LLaVA 1.5-mix-665k |
For those who can not easily access to BaiDu Disk, you can download data from Hugging Face.
After downloading all of them, organize the data as follows in IMAGE_FOLDER.
IMAGE_FOLDER
βββ llava_image
βββ llava_image_tune
βββ lvis_tune
βββ lrv_tune
βββ svit_tune
βββ mimicit_tune
βββ LASpecify your IMAGE_FOLDER and JSON_FOLDER according to the data preparation.
For training on 384 resolution, we use google/siglip-so400m-patch14-384 as image_tower. Notably, if you pass the --image_tower google/siglip-so400m-patch14-384, you should upgrade the version of transformers to 4.37.0.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.
- Stage 1 pretraining script: pretrain.sh.
- Stage 2 tuning script: finetune.sh.
- Stage 3 moe-tuning script: finetune_moe.sh.