Merge branch 'main' into yhzhang/video_dev

ZhangYuanhan-AI · web-flow · commit 2731972f2cbd · 2024-08-26T15:13:06.000+08:00
diff --git a/README.md b/README.md
@@ -17,11 +17,11 @@
 
 ## Release Notes
 
-- [2024/08/06] 🔥 **LLaVA-OneVision** is [released](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/). The new 0.5/7/72B model achieves the state-of-the-art level and comparable to most powerful commercial models performance on several single-image, multi-image, and video benchmarks. We benchmarked on a total of 47 benchmarks to comprehensively reflect our model's true capabilities in diverse domains. We also release our training code, and single-image/multi-image data mixture in [LLaVA-OneVision Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)! Our video part data will be released via next upgrade of video specific model, stay tuned! Our training code can be directly used to train on single-image, multi-image and video data.
-  - Check our [Paper](https://arxiv.org/abs/2408.03326) for more details and to see our insights on training one model to rule them all.
-  - Check our [LLaVA-OneVision Doc](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision.md) for inference and evaluation guidance.
-  - Check our [Training Scripts](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train) to start training models on single-image/multi-image/video data.
-  
+- [2024/08/06] 🔥 **🚀 [LLaVA-OneVision (OV)](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)!** The new LLaVA-OV models (0.5B/7B/72B) achieve new state-of-the-art performance across single-image, multi-image, and video benchmarks, sometimes rivaling top commercial models on 47 diverse benchmarks. 📄 Explore More:
+  * [[Paper]](https://arxiv.org/abs/2408.03326): In-depth insights, new emegerging scenarios, ie, strong video understadning through task transfer from images.
+  * [[LLaVA-OV Doc]](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision.md): Model inference and evaluation guidance.
+  * [[Scripts]](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train): Start training models on your single-image/multi-image/video data.
+    
 - [2024/07/16] 🔥 **LLaVA-NeXT-Video** has been upgraded. The new 32B model achieves the best open-source performance on several video benchmarks, including [Video-MME](https://video-mme.github.io/home_page.html#leaderboard). Please refer to [this page](docs/LLaVA-NeXT-Video_0716.md) for details, refer to [llava_next-video_demo](https://huggingface.co/spaces/WildVision/vision-arena) for demo.
 
 
@@ -96,6 +96,9 @@ pip install -e ".[train]"
 ### Project Navigation
 Please checkout the following page for more inference & evaluation details.
 
+#### - **LLaVA-OneVision: Easy Task Transfer**
+- [LLaVA-OneVision]([./docs/LLaVA-NeXT.md](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision.md)): for demo inference. The evaluation code is in [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
+
 #### - **LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild**
 - [LLaVA-NeXT-Image](./docs/LLaVA-NeXT.md): for image demo inference and evaluation of stronger LMMs using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
 
@@ -113,9 +116,9 @@ We use [SGLang](https://github.com/sgl-project/sglang) to speed up inference and
 **Prepare Environment**:
     Following the instruction in the [sglang](https://github.com/sgl-project/sglang?tab=readme-ov-file#install)
 
-### LLaVA-NeXT (Image)
+### LLaVA-NeXT/OneVision
 
-Checkout the HTTP Post/Get and SRT usage at [sglang/examples/usage/llava](https://github.com/sgl-project/sglang/blob/main/examples/usage/llava)
+Checkout the HTTP Post/Get and SRT usage at [sglang/examples/runtime/llava_onevision](https://github.com/sgl-project/sglang/tree/main/examples/runtime/llava_onevision)
 
 ### LLaVA-NeXT (Video)
 
diff --git a/docs/LLaVA_OneVision_Tutorials.ipynb b/docs/LLaVA_OneVision_Tutorials.ipynb
@@ -176,7 +176,13 @@
     "model_name = \"llava_qwen\"\n",
     "device = \"cuda\"\n",
     "device_map = \"auto\"\n",
-    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)\n",
+    "llava_model_args = {\n",
+    "        \"multimodal\": True,\n",
+    "    }\n",
+    "overwrite_config = {}\n",
+    "overwrite_config[\"image_aspect_ratio\"] = \"pad\"\n",
+    "llava_model_args[\"overwrite_config\"] = overwrite_config\n",
+    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)\n",
     "\n",
     "model.eval()\n",
     "\n",
@@ -227,10 +233,61 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/tiger/miniconda3/envs/public_llava/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n",
+      "/home/tiger/miniconda3/envs/public_llava/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
+      "  warnings.warn(\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded LLaVA model: lmms-lab/llava-onevision-qwen2-7b-ov\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading vision tower: google/siglip-so400m-patch14-384\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.07s/it]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Model Class: LlavaQwenForCausalLM\n",
+      "(16, 1024, 576, 3)\n",
+      "The video features a person standing on a stage, dressed in a black shirt and dark pants. A large hand appears from the background, reaching towards the person's pocket. The text 'Source: Joshua AG' is displayed at the top left corner of the frames, and 'EVAN CARMICHAEL' is shown in the top right corner. The text 'Anyone know what this pocket is for?' appears as the hand continues to reach into the pocket. The person then looks down at their pocket, and the text 'I've always wondered that' appears. The hand finally pulls out a small white device labeled 'iPod Nano'. The person holds up the iPod Nano, and the text 'is the new iPod Nano' appears. The video concludes with a close-up of the person holding the iPod Nano, showing it from different angles.\n"
+     ]
+    }
+   ],
    "source": [
+    "from operator import attrgetter\n",
     "from llava.model.builder import load_pretrained_model\n",
     "from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token\n",
     "from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX\n",
@@ -243,41 +300,39 @@
     "import requests\n",
     "import copy\n",
     "import warnings\n",
+    "from decord import VideoReader, cpu\n",
     "\n",
     "warnings.filterwarnings(\"ignore\")\n",
     "# Load the OneVision model\n",
-    "pretrained = \"lmms-lab/llava-onevision-qwen2-0.5b-ov\"\n",
+    "pretrained = \"lmms-lab/llava-onevision-qwen2-7b-ov\"\n",
     "model_name = \"llava_qwen\"\n",
     "device = \"cuda\"\n",
     "device_map = \"auto\"\n",
-    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)\n",
+    "tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation=\"sdpa\")\n",
     "\n",
     "model.eval()\n",
     "\n",
     "\n",
     "# Function to extract frames from video\n",
-    "def extract_frames(video_path, num_frames=8):\n",
-    "    cap = cv2.VideoCapture(video_path)\n",
-    "    frames = []\n",
-    "    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n",
-    "    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)\n",
-    "\n",
-    "    for i in indices:\n",
-    "        cap.set(cv2.CAP_PROP_POS_FRAMES, i)\n",
-    "        ret, frame = cap.read()\n",
-    "        if ret:\n",
-    "            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\n",
-    "            frames.append(Image.fromarray(frame))\n",
-    "\n",
-    "    cap.release()\n",
-    "    return frames\n",
+    "def load_video(video_path, max_frames_num):\n",
+    "    if type(video_path) == str:\n",
+    "        vr = VideoReader(video_path, ctx=cpu(0))\n",
+    "    else:\n",
+    "        vr = VideoReader(video_path[0], ctx=cpu(0))\n",
+    "    total_frame_num = len(vr)\n",
+    "    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)\n",
+    "    frame_idx = uniform_sampled_frames.tolist()\n",
+    "    spare_frames = vr.get_batch(frame_idx).asnumpy()\n",
+    "    return spare_frames  # (frames, height, width, channels)\n",
     "\n",
     "\n",
     "# Load and process video\n",
     "video_path = \"jobs.mp4\"\n",
-    "video_frames = extract_frames(video_path)\n",
-    "image_tensors = process_images(video_frames, image_processor, model.config)\n",
-    "image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]\n",
+    "video_frames = load_video(video_path, 16)\n",
+    "print(video_frames.shape) # (16, 1024, 576, 3)\n",
+    "image_tensors = []\n",
+    "frames = image_processor.preprocess(video_frames, return_tensors=\"pt\")[\"pixel_values\"].half().cuda()\n",
+    "image_tensors.append(frames)\n",
     "\n",
     "# Prepare conversation input\n",
     "conv_template = \"qwen_1_5\"\n",
@@ -299,6 +354,7 @@
     "    do_sample=False,\n",
     "    temperature=0,\n",
     "    max_new_tokens=4096,\n",
+    "    modalities=[\"video\"],\n",
     ")\n",
     "text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)\n",
     "print(text_outputs[0])"
@@ -307,7 +363,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "llava",
+   "display_name": "Python 3.9.2 64-bit",
    "language": "python",
    "name": "python3"
   },
@@ -322,6 +378,11 @@
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.10.14"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+   }
   }
  },
  "nbformat": 4,
diff --git a/llava/model/__init__.py b/llava/model/__init__.py
@@ -3,9 +3,9 @@
 AVAILABLE_MODELS = {
     "llava_llama": "LlavaLlamaForCausalLM, LlavaConfig",
     "llava_qwen": "LlavaQwenForCausalLM, LlavaQwenConfig",
-    "llava_qwen_moe": "LlavaQwenMoeForCausalLM, LlavaQwenMoeConfig",
     "llava_mistral": "LlavaMistralForCausalLM, LlavaMistralConfig",
     "llava_mixtral": "LlavaMixtralForCausalLM, LlavaMixtralConfig",
+    # "llava_qwen_moe": "LlavaQwenMoeForCausalLM, LlavaQwenMoeConfig",    
     # Add other models as needed
 }
 
diff --git a/llava/model/llava_arch.py b/llava/model/llava_arch.py
@@ -253,9 +253,6 @@ def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attentio
         if vision_tower is None or images is None or input_ids.shape[1] == 1:
             return input_ids, position_ids, attention_mask, past_key_values, None, labels
 
-        if isinstance(modalities, str):
-            modalities = [modalities]
-
         if type(images) is list or images.ndim == 5:
             if type(images) is list:
                 images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
@@ -265,16 +262,13 @@ def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attentio
                 if modalities[_] == "video":
                     video_idx_in_batch.append(_)
 
-            # print(video_idx_in_batch)
-
             images_list = []
             for image in images:
                 if image.ndim == 4:
                     images_list.append(image)
                 else:
                     images_list.append(image.unsqueeze(0))
 
-            # import pdb;pdb.set_trace()
             concat_images = torch.cat([image for image in images_list], dim=0)
             split_sizes = [image.shape[0] for image in images_list]
             encoded_image_features = self.encode_images(concat_images)
@@ -346,8 +340,6 @@ def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attentio
                             new_image_features.append(image_feature.flatten(0, 1))
                         else:
                             raise ValueError(f"Unexpected mm_newline_position: {self.config.mm_newline_position}")
-
-
                     elif image_feature.shape[0] > 1:  # multi patches and multi images operations
                         # rank0_print("Single-images")
                         base_image_feature = image_feature[0]
@@ -595,4 +587,4 @@ def initialize_vision_tokenizer(self, model_args, tokenizer):
                 for p in self.get_input_embeddings().parameters():
                     p.requires_grad = False
                 for p in self.get_output_embeddings().parameters():
-                    p.requires_grad = False
+                    p.requires_grad = False
diff --git a/scripts/train/README.md b/scripts/train/README.md