Skip to content

Commit 78bfd64

Browse files
author
Haozhe Qi
committed
Merge remote-tracking branch 'origin/main' into haozhedev
2 parents 693d553 + 79ef45a commit 78bfd64

File tree

5 files changed

+33
-13
lines changed

5 files changed

+33
-13
lines changed

docs/LLaVA_OneVision_Tutorials.ipynb

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,11 @@
6060
"model_name = \"llava_qwen\"\n",
6161
"device = \"cuda\"\n",
6262
"device_map = \"auto\"\n",
63-
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args\n",
63+
"llava_model_args = {\n",
64+
" \"multimodal\": True,\n",
65+
" \"attn_implementation\": \"sdpa\",\n",
66+
"}\n",
67+
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args) # Add any other thing you want to pass in llava_model_args\n",
6468
"\n",
6569
"model.eval()\n",
6670
"\n",
@@ -322,7 +326,10 @@
322326
"model_name = \"llava_qwen\"\n",
323327
"device = \"cuda\"\n",
324328
"device_map = \"auto\"\n",
325-
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation=\"sdpa\")\n",
329+
"llava_model_args = {\n",
330+
" \"multimodal\": True,\n",
331+
"}\n",
332+
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation=\"sdpa\", **llava_model_args)\n",
326333
"\n",
327334
"model.eval()\n",
328335
"\n",

docs/LLaVA_Video_1003.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ print(text_outputs)
8484

8585
## Training
8686

87-
[[Scripts]](/Users/zhangyuanhan/Desktop/LLaVA-NeXT/scripts/video/train): Start training models on your single-image/multi-image/video data.
87+
[[Scripts]](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/yhzhang/video_dev/scripts/video/train/SO400M_Qwen2_72B_ov_to_video_am9_aug6.sh): Start training models on your single-image/multi-image/video data.
8888

8989

9090
## Evaluation Guidance

scripts/train/README.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,11 +73,22 @@ Here we explain the some technical details on our data.
7373
}
7474
```
7575

76-
- single-image stage data mixture [TBD]
76+
- single-image stage data mixture
77+
78+
We have placed our single-image stage data in [single-image-yaml](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/single_image.yaml) for users to review. You can download each subset from [onevision-data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data).
79+
80+
Inside the data yaml, the first indicates the previous llava-1.6/next 790K data, you can download them in [llava-next-data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data).
81+
82+
Inside the yaml, the naming would be different with our paper figure due to writing consideration. For users who need to explore our dataset, you can check the [upload script](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/0070d0ae4931c9b19d9cc57c38e16a87c270a61c/playground/upload_data.py#L175) to find the mapping from our local dataset to HF's version.
83+
7784
- onevision stage data mixture
7885

79-
- Around 800K higher-quality data re-sampled from previous stage (yes, it's data replay!).
86+
Our onevision stage data is available in [onevision-yaml](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/onevision.yaml). The single-image portion can be downloaded from the above Huggingface link for onevision data. Here's a breakdown of each part:
87+
88+
- Around 800K higher-quality data re-sampled from the previous stage (yes, it's data replay!).
8089
- [M4-Instruct Data](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data)
81-
- Video Data
82-
- 65595 re-annotated data. The data sources are from a collection of academic datasets, including Youcook2 (32267), Charades (19851), NextQA (7653), activitynet (5153), ego4d (671). The instruction and response are generated via GPT4o provided by AzureAI. More exquisite details are to be completed by Yuanhan's subsequent work on video specific model to introduce the data annotation pipeline. (it's brilliant, stay tuned!)
83-
- [ShareGPTVideo](https://huggingface.co/ShareGPTVideo). We use a total of 255000 data from it.
90+
- Video Data: We have released the video part along with [llava-video-data](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). Users can download the data, and we utilize the subset used in LLaVA-OneVision:
91+
- We have included captions and open-ended questions in the 0_30_s_academic_v0_1 split, along with 240,000 open-ended QA items and 15,000 caption entries, as part of the video data in LLaVA-Hound for LLaVA-OneVision.
92+
- 0_30_s_academic_v0_1 captions
93+
- 0_30_s_academic_v0_1 open-ended QA
94+
- LLaVA-Hound: Same as above.

scripts/video/train/SO400M_Qwen2_72B_ov_to_video_am9.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,13 @@ echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
2626
# Stage 2
2727
PROMPT_VERSION="qwen_1_5"
2828
MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-ov_to_video_am9"
29-
PREV_STAGE_CHECKPOINT="lmms-lab/llava-onevision-qwen2-72b-ov"
29+
PREV_STAGE_CHECKPOINT="lmms-lab/llava-onevision-qwen2-72b-ov-si"
3030
echo "PREV_STAGE_CHECKPOINT: ${PREV_STAGE_CHECKPOINT}"
3131
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
3232

3333

34-
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}" \
34+
# ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}" \
35+
deepspeed --master_port 30000 \
3536
llava/train/train_mem.py \
3637
--deepspeed scripts/zero3.json \
3738
--model_name_or_path $PREV_STAGE_CHECKPOINT \

scripts/video/train/SO400M_Qwen2_7B_ov_to_video_am9.sh

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,13 @@ echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
2626
# Stage 2
2727
PROMPT_VERSION="qwen_1_5"
2828
MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-ov_to_video_am9"
29-
PREV_STAGE_CHECKPOINT="lmms-lab/llava-onevision-qwen2-7b-ov"
29+
PREV_STAGE_CHECKPOINT="lmms-lab/llava-onevision-qwen2-7b-ov-si"
3030
echo "PREV_STAGE_CHECKPOINT: ${PREV_STAGE_CHECKPOINT}"
3131
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
3232

3333

34-
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}" \
34+
# ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}" \
35+
deepspeed --master_port 30000 \
3536
llava/train/train_mem.py \
3637
--deepspeed scripts/zero3.json \
3738
--model_name_or_path $PREV_STAGE_CHECKPOINT \
@@ -75,7 +76,7 @@ ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nno
7576
--torch_compile True \
7677
--torch_compile_backend "inductor" \
7778
--dataloader_drop_last True \
78-
--frames_upbound 110 \
79+
--frames_upbound 64 \
7980
--mm_newline_position grid \
8081
--add_time_instruction True \
8182
--force_sample True \

0 commit comments

Comments
 (0)