Skip to content

Commit e5da943

Browse files
Merge branch 'yhzhang/llava_video_local' into yhzhang/llava_video_dev
2 parents fbd6a07 + 853ac29 commit e5da943

File tree

5 files changed

+433
-92
lines changed

5 files changed

+433
-92
lines changed

README.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,36 @@
33
</p>
44

55
# LLaVA-NeXT: Open Large Multimodal Models
6+
[![Static Badge](https://img.shields.io/badge/llava_video-paper-green)](https://github.com/LLaVA-VL/LLaVA-NeXT)
67
[![Static Badge](https://img.shields.io/badge/llava_onevision-paper-green)](https://arxiv.org/abs/2408.03326)
78
[![llava_next-blog](https://img.shields.io/badge/llava_next-blog-green)](https://llava-vl.github.io/blog/)
89

910
[![llava_onevision-demo](https://img.shields.io/badge/llava_onevision-demo-red)](https://llava-onevision.lmms-lab.com/)
11+
[![llava_next-video_demo](https://img.shields.io/badge/llava_video-demo-red)](https://huggingface.co/spaces/WildVision/vision-arena)
1012
[![llava_next-interleave_demo](https://img.shields.io/badge/llava_next-interleave_demo-red)](https://huggingface.co/spaces/lmms-lab/LLaVA-NeXT-Interleave-Demo)
11-
[![llava_next-video_demo](https://img.shields.io/badge/llava_next-video_demo-red)](https://huggingface.co/spaces/WildVision/vision-arena)
13+
[![Openbayes Demo](https://img.shields.io/static/v1?label=Demo&message=OpenBayes%E8%B4%9D%E5%BC%8F%E8%AE%A1%E7%AE%97&color=green)](https://openbayes.com/console/public/tutorials/gW0ng9jKXfO)
1214

15+
[![llava_video-checkpoints](https://img.shields.io/badge/llava_video-checkpoints-blue)](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)
1316
[![llava_onevision-checkpoints](https://img.shields.io/badge/llava_onevision-checkpoints-blue)](https://huggingface.co/collections/lmms-lab/llava-onevision-66a259c3526e15166d6bba37)
1417
[![llava_next-interleave_checkpoints](https://img.shields.io/badge/llava_next-interleave_checkpoints-blue)](https://huggingface.co/collections/lmms-lab/llava-next-interleave-66763c55c411b340b35873d1)
15-
[![llava_next-video_checkpoints](https://img.shields.io/badge/llava_next-video_checkpoints-blue)](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)
1618
[![llava_next-image_checkpoints](https://img.shields.io/badge/llava_next-image_checkpoints-blue)](https://huggingface.co/lmms-lab)
1719

1820
## Release Notes
1921

22+
- **[2024/10/04] 🔥 LLaVA-Video** (formerly LLaVA-NeXT-Video) has undergone a major upgrade! We are excited to release **LLaVA-Video-178K**, a high-quality synthetic dataset for video instruction tuning. This dataset includes:
23+
24+
- 178,510 caption entries
25+
- 960,792 open-ended Q&A pairs
26+
- 196,198 multiple-choice Q&A items
27+
28+
Along with this, we’re also releasing the **LLaVA-Video 7B/72B models**, which deliver competitive performance on the latest video benchmarks, including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard), [LongVideoBench](https://longvideobench.github.io/), and [Dream-1K](https://tarsier-vlm.github.io/).
29+
30+
📄 **Explore more**:
31+
- [LLaVA-Video-178K Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K): Download the dataset.
32+
- [LLaVA-Video Models](https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944): Access model checkpoints.
33+
- [Paper](https://github.com/LLaVA-VL/LLaVA-NeXT): Detailed information about LLaVA-Video.
34+
- [LLaVA-Video Documentation](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_Video_1003.md): Guidance on training, inference and evaluation.
35+
2036
- [2024/09/13] 🔥 **🚀 [LLaVA-OneVision-Chat](docs/LLaVA_OneVision_Chat.md)**. The new LLaVA-OV-Chat (7B/72B) significantly improves the chat experience of LLaVA-OV. 📄
2137

2238
![](docs/ov_chat_images/chat_results.png)

docs/LLaVA_Video_1003.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# LLaVA Video
2+
3+
## Table of Contents
4+
5+
1. [Model Summary](##model-summary)
6+
2. [Inference](##inference)
7+
3. [Training](##training)
8+
4. [Evaluation](##evaluation-guidance)
9+
6. [Citation](##citation)
10+
11+
## Model Summary
12+
13+
The LLaVA-Video models are 7/72B parameter models trained on [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) and [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
14+
15+
16+
## Inference
17+
18+
We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
19+
20+
```python
21+
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
22+
from llava.model.builder import load_pretrained_model
23+
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
24+
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
25+
from llava.conversation import conv_templates, SeparatorStyle
26+
from PIL import Image
27+
import requests
28+
import copy
29+
import torch
30+
import sys
31+
import warnings
32+
from decord import VideoReader, cpu
33+
import numpy as np
34+
warnings.filterwarnings("ignore")
35+
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
36+
if max_frames_num == 0:
37+
return np.zeros((1, 336, 336, 3))
38+
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
39+
total_frame_num = len(vr)
40+
video_time = total_frame_num / vr.get_avg_fps()
41+
fps = round(vr.get_avg_fps()/fps)
42+
frame_idx = [i for i in range(0, len(vr), fps)]
43+
frame_time = [i/fps for i in frame_idx]
44+
if len(frame_idx) > max_frames_num or force_sample:
45+
sample_fps = max_frames_num
46+
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
47+
frame_idx = uniform_sampled_frames.tolist()
48+
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
49+
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
50+
spare_frames = vr.get_batch(frame_idx).asnumpy()
51+
# import pdb;pdb.set_trace()
52+
return spare_frames,frame_time,video_time
53+
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
54+
model_name = "llava_qwen"
55+
device = "cuda"
56+
device_map = "auto"
57+
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
58+
model.eval()
59+
video_path = "XXXX"
60+
max_frames_num = "64"
61+
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
62+
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
63+
video = [video]
64+
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
65+
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
66+
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
67+
conv = copy.deepcopy(conv_templates[conv_template])
68+
conv.append_message(conv.roles[0], question)
69+
conv.append_message(conv.roles[1], None)
70+
prompt_question = conv.get_prompt()
71+
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
72+
cont = model.generate(
73+
input_ids,
74+
images=video,
75+
modalities= ["video"],
76+
do_sample=False,
77+
temperature=0,
78+
max_new_tokens=4096,
79+
)
80+
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
81+
print(text_outputs)
82+
```
83+
84+
85+
## Training
86+
87+
[[Scripts]](/Users/zhangyuanhan/Desktop/LLaVA-NeXT/scripts/video/train): Start training models on your single-image/multi-image/video data.
88+
89+
90+
## Evaluation Guidance
91+
92+
We use the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit to evaluate our models. Ensure you have installed the LLaVA-NeXT model files as per the instructions in the main README.md.
93+
94+
Install lmms-eval:
95+
96+
> pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
97+
98+
### Reproducing Evaluation Results
99+
100+
Our models' evaluation results can be fully reproduced using the lmms-eval toolkit. After installing lmms-eval and llava, you can run the evaluation using the following commands.
101+
102+
Note: These commands require flash-attn. If you prefer not to install it, disable flash-attn by adding `attn_implementation=None` to the `--model_args` parameter.
103+
104+
Important: Different torch versions may cause slight variations in results. By default in `lmms-eval`, the requirement for torch version is set to the latest version. In `llava` repo, the torch version is set to `2.1.2`. Torch version `2.1.2` would be stable for both `llava` and `lmms-eval`
105+
106+
### Evaluating LLaVA-Video on multiple datasets
107+
108+
We recommend the developers and researchers to thoroughly evaluate the models on more datasets to get a comprehensive understanding of their performance in different scenarios. So we provide a comprehensive list of datasets for evaluation, and welcome to incoporate more evaluation tasks. Please refer to the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for more details.
109+
110+
```bash
111+
# video tasks
112+
accelerate launch --num_processes=8 \
113+
-m lmms_eval \
114+
--model llava_vid \
115+
--model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average \
116+
--tasks activitynetqa,videochatgpt,nextqa_mc_test,egoschema,video_dc499,videmme,videomme_w_subtitle,perceptiontest_val_mc \
117+
--batch_size 1 \
118+
--log_samples \
119+
--log_samples_suffix llava_vid \
120+
--output_path ./logs/
121+
```
122+

scripts/video/train/SO400M_Qwen2_72B_ov_to_video_am9.sh

Lines changed: 15 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,16 @@
11
#!/bin/bash
22

3-
4-
# You should complete the path of the following attributes:
5-
PROJECT_ROOT="XXXX"
6-
## This could a yaml file for multiple files or a json file for a single file
7-
DATA_PATH="XXXX"
8-
IMAGE_FOLDER="XXXX"
9-
VIDEO_FOLDER="XXXX"
10-
11-
12-
export PYTHONWARNINGS="ignore"
13-
3+
# Set up the data folder
4+
IMAGE_FOLDER="XXX"
5+
VIDEO_FOLDER="XXX"
6+
DATA_YAML="XXX" # e.g exp.yaml
147

158
############### Prepare Envs #################
16-
cd $PROJECT_ROOT
17-
python3 -m pip install --upgrade pip
18-
python3 -m pip install -e ".[train]"
19-
20-
python3 -m pip install ninja
219
python3 -m pip install flash-attn --no-build-isolation
2210
alias python=python3
2311
############### Show Envs ####################
2412

2513
nvidia-smi
26-
# 取 worker0 第一个 port
27-
ports=($(echo $METIS_WORKER_0_PORT | tr ',' ' '))
28-
port=${ports[0]}
29-
port_in_cmd="$(echo "${METIS_WORKER_0_PORT:-2222}" | awk -F',' '{print $1}')"
30-
31-
echo "total workers: ${ARNOLD_WORKER_NUM}"
32-
echo "cur worker id: ${ARNOLD_ID}"
33-
echo "gpus per worker: ${ARNOLD_WORKER_GPU}"
34-
echo "master ip: ${METIS_WORKER_0_HOST}"
35-
echo "master port: ${port}"
36-
echo "master port in cmd: ${port_in_cmd}"
37-
38-
export OMP_NUM_THREADS=8
39-
export NCCL_IB_DISABLE=0
40-
export NCCL_IB_GID_INDEX=3
41-
# export NCCL_IB_HCA=${ARNOLD_RDMA_DEVICE}
42-
export NCCL_SOCKET_IFNAME=eth0
43-
export NCCL_DEBUG=WARN
44-
45-
PORT=26000
46-
GPUS="0,1,2,3,4,5,6,7"
4714

4815
################ Arnold Jobs ################
4916

@@ -53,22 +20,25 @@ VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
5320
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"
5421

5522

56-
# Stage For video
23+
BASE_RUN_NAME="llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-72B-Instruct-mlp2x_gelu-pretrain_blip558k_plain"
24+
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
25+
26+
# Stage 2
5727
PROMPT_VERSION="qwen_1_5"
58-
MID_RUN_NAME="llava_next_video-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-ov_to_video"
59-
PREV_STAGE_CHECKPOINT=""
28+
MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-ov_to_video_am9"
29+
PREV_STAGE_CHECKPOINT="lmms-lab/llava-onevision-qwen2-72b-ov"
6030
echo "PREV_STAGE_CHECKPOINT: ${PREV_STAGE_CHECKPOINT}"
6131
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
6232

6333

64-
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
34+
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}" \
6535
llava/train/train_mem.py \
6636
--deepspeed scripts/zero3.json \
6737
--model_name_or_path $PREV_STAGE_CHECKPOINT \
6838
--version $PROMPT_VERSION \
69-
--data_path ${DATA_PATH} \
70-
--image_folder ${IMAGE_FOLDER} \
71-
--video_folder ${VIDEO_FOLDER} \
39+
--data_path $DATA_YAML \
40+
--image_folder $IMAGE_FOLDER \
41+
--video_folder $VIDEO_FOLDER \
7242
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
7343
--mm_vision_tower_lr=2e-6 \
7444
--vision_tower ${VISION_MODEL_VERSION} \
@@ -97,7 +67,7 @@ ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NN
9767
--lr_scheduler_type "cosine" \
9868
--logging_steps 1 \
9969
--tf32 True \
100-
--model_max_length 12768 \
70+
--model_max_length 32768 \
10171
--gradient_checkpointing True \
10272
--dataloader_num_workers 2 \
10373
--lazy_preprocess True \

scripts/video/train/SO400M_Qwen2_7B_ov_to_video_am9.sh

Lines changed: 15 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,74 +1,44 @@
11
#!/bin/bash
22

3-
4-
# You should complete the path of the following attributes:
5-
PROJECT_ROOT="XXXX"
6-
## This could a yaml file for multiple files or a json file for a single file
7-
DATA_PATH="XXXX"
8-
IMAGE_FOLDER="XXXX"
9-
VIDEO_FOLDER="XXXX"
10-
11-
12-
export PYTHONWARNINGS="ignore"
13-
3+
# Set up the data folder
4+
IMAGE_FOLDER="XXX"
5+
VIDEO_FOLDER="XXX"
6+
DATA_YAML="XXX" # e.g exp.yaml
147

158
############### Prepare Envs #################
16-
cd $PROJECT_ROOT
17-
python3 -m pip install --upgrade pip
18-
python3 -m pip install -e ".[train]"
19-
20-
python3 -m pip install ninja
219
python3 -m pip install flash-attn --no-build-isolation
2210
alias python=python3
2311
############### Show Envs ####################
2412

2513
nvidia-smi
26-
# 取 worker0 第一个 port
27-
ports=($(echo $METIS_WORKER_0_PORT | tr ',' ' '))
28-
port=${ports[0]}
29-
port_in_cmd="$(echo "${METIS_WORKER_0_PORT:-2222}" | awk -F',' '{print $1}')"
30-
31-
echo "total workers: ${ARNOLD_WORKER_NUM}"
32-
echo "cur worker id: ${ARNOLD_ID}"
33-
echo "gpus per worker: ${ARNOLD_WORKER_GPU}"
34-
echo "master ip: ${METIS_WORKER_0_HOST}"
35-
echo "master port: ${port}"
36-
echo "master port in cmd: ${port_in_cmd}"
37-
38-
export OMP_NUM_THREADS=8
39-
export NCCL_IB_DISABLE=0
40-
export NCCL_IB_GID_INDEX=3
41-
# export NCCL_IB_HCA=${ARNOLD_RDMA_DEVICE}
42-
export NCCL_SOCKET_IFNAME=eth0
43-
export NCCL_DEBUG=WARN
44-
45-
PORT=26000
46-
GPUS="0,1,2,3,4,5,6,7"
4714

4815
################ Arnold Jobs ################
4916

5017
LLM_VERSION="Qwen/Qwen2-7B-Instruct"
5118
LLM_VERSION_CLEAN="${LLM_VERSION//\//_}"
5219
VISION_MODEL_VERSION="google/siglip-so400m-patch14-384"
5320
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION//\//_}"
21+
#
5422

23+
BASE_RUN_NAME="llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-7B-Instruct-mlp2x_gelu-pretrain_blip558k_plain"
24+
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
5525

56-
# Stage For video
26+
# Stage 2
5727
PROMPT_VERSION="qwen_1_5"
58-
MID_RUN_NAME="llava_next_video-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-ov_to_video"
59-
PREV_STAGE_CHECKPOINT=""
28+
MID_RUN_NAME="llavanext-${VISION_MODEL_VERSION_CLEAN}-${LLM_VERSION_CLEAN}-ov_to_video_am9"
29+
PREV_STAGE_CHECKPOINT="lmms-lab/llava-onevision-qwen2-7b-ov"
6030
echo "PREV_STAGE_CHECKPOINT: ${PREV_STAGE_CHECKPOINT}"
6131
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
6232

6333

64-
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NNODES}" --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
34+
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}" \
6535
llava/train/train_mem.py \
6636
--deepspeed scripts/zero3.json \
6737
--model_name_or_path $PREV_STAGE_CHECKPOINT \
6838
--version $PROMPT_VERSION \
69-
--data_path ${DATA_PATH} \
70-
--image_folder ${IMAGE_FOLDER} \
71-
--video_folder ${VIDEO_FOLDER} \
39+
--data_path $DATA_YAML \
40+
--image_folder $IMAGE_FOLDER \
41+
--video_folder $VIDEO_FOLDER \
7242
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
7343
--mm_vision_tower_lr=2e-6 \
7444
--vision_tower ${VISION_MODEL_VERSION} \
@@ -97,7 +67,7 @@ ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node="${NUM_GPUS}" --nnodes="${NN
9767
--lr_scheduler_type "cosine" \
9868
--logging_steps 1 \
9969
--tf32 True \
100-
--model_max_length 22768 \
70+
--model_max_length 32768 \
10171
--gradient_checkpointing True \
10272
--dataloader_num_workers 2 \
10373
--lazy_preprocess True \

0 commit comments

Comments
 (0)