Skip to content

Commit 2731972

Browse files
Merge branch 'main' into yhzhang/video_dev
2 parents 7ee057f + 5fbcf27 commit 2731972

File tree

5 files changed

+171
-42
lines changed

5 files changed

+171
-42
lines changed

README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@
1717

1818
## Release Notes
1919

20-
- [2024/08/06] 🔥 **LLaVA-OneVision** is [released](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/). The new 0.5/7/72B model achieves the state-of-the-art level and comparable to most powerful commercial models performance on several single-image, multi-image, and video benchmarks. We benchmarked on a total of 47 benchmarks to comprehensively reflect our model's true capabilities in diverse domains. We also release our training code, and single-image/multi-image data mixture in [LLaVA-OneVision Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)! Our video part data will be released via next upgrade of video specific model, stay tuned! Our training code can be directly used to train on single-image, multi-image and video data.
21-
- Check our [Paper](https://arxiv.org/abs/2408.03326) for more details and to see our insights on training one model to rule them all.
22-
- Check our [LLaVA-OneVision Doc](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision.md) for inference and evaluation guidance.
23-
- Check our [Training Scripts](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train) to start training models on single-image/multi-image/video data.
24-
20+
- [2024/08/06] 🔥 **🚀 [LLaVA-OneVision (OV)](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)!** The new LLaVA-OV models (0.5B/7B/72B) achieve new state-of-the-art performance across single-image, multi-image, and video benchmarks, sometimes rivaling top commercial models on 47 diverse benchmarks. 📄 Explore More:
21+
* [[Paper]](https://arxiv.org/abs/2408.03326): In-depth insights, new emegerging scenarios, ie, strong video understadning through task transfer from images.
22+
* [[LLaVA-OV Doc]](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision.md): Model inference and evaluation guidance.
23+
* [[Scripts]](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train): Start training models on your single-image/multi-image/video data.
24+
2525
- [2024/07/16] 🔥 **LLaVA-NeXT-Video** has been upgraded. The new 32B model achieves the best open-source performance on several video benchmarks, including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard). Please refer to [this page](docs/LLaVA-NeXT-Video_0716.md) for details, refer to [llava_next-video_demo](https://huggingface.co/spaces/WildVision/vision-arena) for demo.
2626

2727

@@ -96,6 +96,9 @@ pip install -e ".[train]"
9696
### Project Navigation
9797
Please checkout the following page for more inference & evaluation details.
9898

99+
#### - **LLaVA-OneVision: Easy Task Transfer**
100+
- [LLaVA-OneVision]([./docs/LLaVA-NeXT.md](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision.md)): for demo inference. The evaluation code is in [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
101+
99102
#### - **LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild**
100103
- [LLaVA-NeXT-Image](./docs/LLaVA-NeXT.md): for image demo inference and evaluation of stronger LMMs using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
101104

@@ -113,9 +116,9 @@ We use [SGLang](https://github.com/sgl-project/sglang) to speed up inference and
113116
**Prepare Environment**:
114117
Following the instruction in the [sglang](https://github.com/sgl-project/sglang?tab=readme-ov-file#install)
115118

116-
### LLaVA-NeXT (Image)
119+
### LLaVA-NeXT/OneVision
117120

118-
Checkout the HTTP Post/Get and SRT usage at [sglang/examples/usage/llava](https://github.com/sgl-project/sglang/blob/main/examples/usage/llava)
121+
Checkout the HTTP Post/Get and SRT usage at [sglang/examples/runtime/llava_onevision](https://github.com/sgl-project/sglang/tree/main/examples/runtime/llava_onevision)
119122

120123
### LLaVA-NeXT (Video)
121124

docs/LLaVA_OneVision_Tutorials.ipynb

Lines changed: 85 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,13 @@
176176
"model_name = \"llava_qwen\"\n",
177177
"device = \"cuda\"\n",
178178
"device_map = \"auto\"\n",
179-
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)\n",
179+
"llava_model_args = {\n",
180+
" \"multimodal\": True,\n",
181+
" }\n",
182+
"overwrite_config = {}\n",
183+
"overwrite_config[\"image_aspect_ratio\"] = \"pad\"\n",
184+
"llava_model_args[\"overwrite_config\"] = overwrite_config\n",
185+
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)\n",
180186
"\n",
181187
"model.eval()\n",
182188
"\n",
@@ -227,10 +233,61 @@
227233
},
228234
{
229235
"cell_type": "code",
230-
"execution_count": null,
236+
"execution_count": 1,
231237
"metadata": {},
232-
"outputs": [],
238+
"outputs": [
239+
{
240+
"name": "stderr",
241+
"output_type": "stream",
242+
"text": [
243+
"/home/tiger/miniconda3/envs/public_llava/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
244+
" from .autonotebook import tqdm as notebook_tqdm\n",
245+
"/home/tiger/miniconda3/envs/public_llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
246+
" warnings.warn(\n",
247+
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
248+
]
249+
},
250+
{
251+
"name": "stdout",
252+
"output_type": "stream",
253+
"text": [
254+
"Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov\n"
255+
]
256+
},
257+
{
258+
"name": "stderr",
259+
"output_type": "stream",
260+
"text": [
261+
"Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
262+
"You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.\n"
263+
]
264+
},
265+
{
266+
"name": "stdout",
267+
"output_type": "stream",
268+
"text": [
269+
"Loading vision tower: google/siglip-so400m-patch14-384\n"
270+
]
271+
},
272+
{
273+
"name": "stderr",
274+
"output_type": "stream",
275+
"text": [
276+
"Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00, 2.07s/it]\n"
277+
]
278+
},
279+
{
280+
"name": "stdout",
281+
"output_type": "stream",
282+
"text": [
283+
"Model Class: LlavaQwenForCausalLM\n",
284+
"(16, 1024, 576, 3)\n",
285+
"The video features a person standing on a stage, dressed in a black shirt and dark pants. A large hand appears from the background, reaching towards the person's pocket. The text 'Source: Joshua AG' is displayed at the top left corner of the frames, and 'EVAN CARMICHAEL' is shown in the top right corner. The text 'Anyone know what this pocket is for?' appears as the hand continues to reach into the pocket. The person then looks down at their pocket, and the text 'I've always wondered that' appears. The hand finally pulls out a small white device labeled 'iPod Nano'. The person holds up the iPod Nano, and the text 'is the new iPod Nano' appears. The video concludes with a close-up of the person holding the iPod Nano, showing it from different angles.\n"
286+
]
287+
}
288+
],
233289
"source": [
290+
"from operator import attrgetter\n",
234291
"from llava.model.builder import load_pretrained_model\n",
235292
"from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token\n",
236293
"from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX\n",
@@ -243,41 +300,39 @@
243300
"import requests\n",
244301
"import copy\n",
245302
"import warnings\n",
303+
"from decord import VideoReader, cpu\n",
246304
"\n",
247305
"warnings.filterwarnings(\"ignore\")\n",
248306
"# Load the OneVision model\n",
249-
"pretrained = \"lmms-lab/llava-onevision-qwen2-0.5b-ov\"\n",
307+
"pretrained = \"lmms-lab/llava-onevision-qwen2-7b-ov\"\n",
250308
"model_name = \"llava_qwen\"\n",
251309
"device = \"cuda\"\n",
252310
"device_map = \"auto\"\n",
253-
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)\n",
311+
"tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation=\"sdpa\")\n",
254312
"\n",
255313
"model.eval()\n",
256314
"\n",
257315
"\n",
258316
"# Function to extract frames from video\n",
259-
"def extract_frames(video_path, num_frames=8):\n",
260-
" cap = cv2.VideoCapture(video_path)\n",
261-
" frames = []\n",
262-
" total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n",
263-
" indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)\n",
264-
"\n",
265-
" for i in indices:\n",
266-
" cap.set(cv2.CAP_PROP_POS_FRAMES, i)\n",
267-
" ret, frame = cap.read()\n",
268-
" if ret:\n",
269-
" frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\n",
270-
" frames.append(Image.fromarray(frame))\n",
271-
"\n",
272-
" cap.release()\n",
273-
" return frames\n",
317+
"def load_video(video_path, max_frames_num):\n",
318+
" if type(video_path) == str:\n",
319+
" vr = VideoReader(video_path, ctx=cpu(0))\n",
320+
" else:\n",
321+
" vr = VideoReader(video_path[0], ctx=cpu(0))\n",
322+
" total_frame_num = len(vr)\n",
323+
" uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)\n",
324+
" frame_idx = uniform_sampled_frames.tolist()\n",
325+
" spare_frames = vr.get_batch(frame_idx).asnumpy()\n",
326+
" return spare_frames # (frames, height, width, channels)\n",
274327
"\n",
275328
"\n",
276329
"# Load and process video\n",
277330
"video_path = \"jobs.mp4\"\n",
278-
"video_frames = extract_frames(video_path)\n",
279-
"image_tensors = process_images(video_frames, image_processor, model.config)\n",
280-
"image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]\n",
331+
"video_frames = load_video(video_path, 16)\n",
332+
"print(video_frames.shape) # (16, 1024, 576, 3)\n",
333+
"image_tensors = []\n",
334+
"frames = image_processor.preprocess(video_frames, return_tensors=\"pt\")[\"pixel_values\"].half().cuda()\n",
335+
"image_tensors.append(frames)\n",
281336
"\n",
282337
"# Prepare conversation input\n",
283338
"conv_template = \"qwen_1_5\"\n",
@@ -299,6 +354,7 @@
299354
" do_sample=False,\n",
300355
" temperature=0,\n",
301356
" max_new_tokens=4096,\n",
357+
" modalities=[\"video\"],\n",
302358
")\n",
303359
"text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)\n",
304360
"print(text_outputs[0])"
@@ -307,7 +363,7 @@
307363
],
308364
"metadata": {
309365
"kernelspec": {
310-
"display_name": "llava",
366+
"display_name": "Python 3.9.2 64-bit",
311367
"language": "python",
312368
"name": "python3"
313369
},
@@ -322,6 +378,11 @@
322378
"nbconvert_exporter": "python",
323379
"pygments_lexer": "ipython3",
324380
"version": "3.10.14"
381+
},
382+
"vscode": {
383+
"interpreter": {
384+
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
385+
}
325386
}
326387
},
327388
"nbformat": 4,

llava/model/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
AVAILABLE_MODELS = {
44
"llava_llama": "LlavaLlamaForCausalLM, LlavaConfig",
55
"llava_qwen": "LlavaQwenForCausalLM, LlavaQwenConfig",
6-
"llava_qwen_moe": "LlavaQwenMoeForCausalLM, LlavaQwenMoeConfig",
76
"llava_mistral": "LlavaMistralForCausalLM, LlavaMistralConfig",
87
"llava_mixtral": "LlavaMixtralForCausalLM, LlavaMixtralConfig",
8+
# "llava_qwen_moe": "LlavaQwenMoeForCausalLM, LlavaQwenMoeConfig",
99
# Add other models as needed
1010
}
1111

llava/model/llava_arch.py

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -253,9 +253,6 @@ def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attentio
253253
if vision_tower is None or images is None or input_ids.shape[1] == 1:
254254
return input_ids, position_ids, attention_mask, past_key_values, None, labels
255255

256-
if isinstance(modalities, str):
257-
modalities = [modalities]
258-
259256
if type(images) is list or images.ndim == 5:
260257
if type(images) is list:
261258
images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
@@ -265,16 +262,13 @@ def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attentio
265262
if modalities[_] == "video":
266263
video_idx_in_batch.append(_)
267264

268-
# print(video_idx_in_batch)
269-
270265
images_list = []
271266
for image in images:
272267
if image.ndim == 4:
273268
images_list.append(image)
274269
else:
275270
images_list.append(image.unsqueeze(0))
276271

277-
# import pdb;pdb.set_trace()
278272
concat_images = torch.cat([image for image in images_list], dim=0)
279273
split_sizes = [image.shape[0] for image in images_list]
280274
encoded_image_features = self.encode_images(concat_images)
@@ -346,8 +340,6 @@ def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attentio
346340
new_image_features.append(image_feature.flatten(0, 1))
347341
else:
348342
raise ValueError(f"Unexpected mm_newline_position: {self.config.mm_newline_position}")
349-
350-
351343
elif image_feature.shape[0] > 1: # multi patches and multi images operations
352344
# rank0_print("Single-images")
353345
base_image_feature = image_feature[0]
@@ -595,4 +587,4 @@ def initialize_vision_tokenizer(self, model_args, tokenizer):
595587
for p in self.get_input_embeddings().parameters():
596588
p.requires_grad = False
597589
for p in self.get_output_embeddings().parameters():
598-
p.requires_grad = False
590+
p.requires_grad = False

0 commit comments

Comments
 (0)