add LLaMA-Adapter V2.1

csuhan · csuhan · commit bdac3e98747b · 2023-10-12T16:42:35.000+08:00
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@ This repo proposes **LLaMA-Adapter (V2)**, a lightweight adaption method for fin
 Try out the web demo 🤗 of LLaMA-Adapter: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/csuhan/LLaMA-Adapter), [LLaMA-Adapter V2](http://llama-adapter.opengvlab.com/) and [ImageBind-LLM](http://imagebind-llm.opengvlab.com/).
 
 ## News
+- **[2023.10.11]** We realse **LLaMA-Adapter V2.1**, an improved version of LLaMA-Adapter V2 with stronger multi-modal reasoning performance. Check [llama_adapter_v2_multimodal7b](llama_adapter_v2_multimodal7b) for details.
 - **[2023.08.28]** We release quantized LLM with [OmniQuant](https://github.com/OpenGVLab/OmniQuant), which is an efficient, accurate, and omnibearing (even extremely low bit) quantization algorithm. Multimodal version is coming soon.🔥🔥🔥
 - **[2023.07.24]** We release **[LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory)**, an open-source toolkit for **pre-training**, **fine-tuning** and **deployment** of **Large Language Models (LLMs)** and **mutlimodal LLMs**. Please check [Alpha-VLLM/LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory) for more details!🔥🔥🔥
 - **[2023.07.05]** We release the pretrain/finetune code of [llama_adapter_v2_multimodal7b](https://github.com/OpenGVLab/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal7b).
diff --git a/llama_adapter_v2_multimodal7b/README.md b/llama_adapter_v2_multimodal7b/README.md
@@ -1,6 +1,7 @@
 # LLaMA-Adapter-V2 Multi-modal
 
 ## News
+* [Oct 11, 2023] Release LLaMA-Adapter V2.1 and evaluation on MME.
 * [July 5, 2023] Release pre-traininig and fine-tuning codes.
 * [May 26, 2023] Initial release.
 
@@ -37,8 +38,8 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
 
 llama_dir = "/path/to/LLaMA/"
 
-# choose from BIAS-7B, LORA-BIAS-7B
-model, preprocess = llama.load("BIAS-7B", llama_dir, device)
+# choose from BIAS-7B, LORA-BIAS-7B, LORA-BIAS-7B-v21
+model, preprocess = llama.load("BIAS-7B", llama_dir, llama_type="7B", device=device)
 model.eval()
 
 prompt = llama.format_prompt("Please introduce this painting.")
@@ -55,6 +56,8 @@ The output will look like the following:
 The painting features a cute white lama, or llama, standing on a wooden floor. The llama is holding a variety of tools and accessories, such as a paintbrush, a pencil, a ruler, a pair of scissors, and a paint can. The llama is dressed in a suit, which adds a touch of sophistication to the scene. The painting is a creative and whimsical representation of a person or animal holding various tools and accessories, making it an interesting and unique piece of art.
 ```
 
+## Evaluation
+Check [eval.md](./docs/eval.md) for details.
 
 ## Online demo
 
diff --git a/llama_adapter_v2_multimodal7b/docs/eval.md b/llama_adapter_v2_multimodal7b/docs/eval.md
@@ -0,0 +1,60 @@
+# Evaluation on MME Benchmark
+
+[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.
+
+## Setup & Evaluation
+
+1. Download MME datasets and `eval_tool` from the [MME repo](https://github.com/bradyfu/awesome-multimodal-large-language-models#our-mllm-works), and put them under `MME_Benchmark_release_version`. Now the folder structure will be:
+    ```
+    MME_Benchmark_release_version
+        ├── artwork
+        ├── celebrity
+        ├── code_reasoning
+        ├── color
+        ├── commonsense_reasoning
+        ├── count
+        ├── eval_tool
+        │   ├── calculation.py
+        │   ├── LaVIN
+        │   └── Your_Results
+        ├── existence
+        ├── landmark
+        ├── numerical_calculation
+        ├── OCR
+        ├── position
+        ├── posters
+        ├── scene
+        └── text_translation
+    ```
+2. Generate MME results using: `python util/evaluate_mme.py --pretrained_path [MODEL_PATH] --llama_path [LLAMA_DIR] --output_path [RESULT_FILE_PATH]`
+3. Evaluate LLaMA-Adapter V2.1 with MME's eval_tool: `python MME_Benchmark_release_version/eval_tool/calculation.py --results_dir [RESULT_FILE_PATH]`
+
+## Results
+
+* **LLaMA-Adapter V2.1**
+
+    ```
+    =========== Perception ===========
+    total score: 1326.0875953396435 
+
+            existence  score: 185.0
+            count  score: 133.33333333333331
+            position  score: 56.666666666666664
+            color  score: 118.33333333333334
+            posters  score: 147.9591836734694
+            celebrity  score: 134.70588235294116
+            scene  score: 156.25
+            landmark  score: 167.8391959798995
+            artwork  score: 123.5
+            OCR  score: 102.5
+
+
+    =========== Cognition ===========
+    total score: 356.42857142857144 
+
+            commonsense_reasoning  score: 106.42857142857144
+            numerical_calculation  score: 47.5
+            text_translation  score: 112.5
+            code_reasoning  score: 90.0
+
+    ```
diff --git a/llama_adapter_v2_multimodal7b/docs/train.md b/llama_adapter_v2_multimodal7b/docs/train.md
@@ -69,6 +69,10 @@ import os
 from llama.llama_adapter import LLaMA_adapter
 import util.misc as misc
 import util.extract_adapter_from_checkpoint as extract
+from PIL import Image
+import cv2
+import torch
+import llama
 
 device = "cuda" if torch.cuda.is_available() else "cpu"
 
diff --git a/llama_adapter_v2_multimodal7b/llama/llama_adapter.py b/llama_adapter_v2_multimodal7b/llama/llama_adapter.py
@@ -279,14 +279,15 @@ def generate(
     "BIAS-7B": "https://github.com/OpenGVLab/LLaMA-Adapter/releases/download/v.2.0.0/7fa55208379faf2dd862565284101b0e4a2a72114d6490a95e432cf9d9b6c813_BIAS-7B.pth",
     "LORA-BIAS-7B": "https://github.com/OpenGVLab/LLaMA-Adapter/releases/download/v.2.0.0/1bcbffc43484332672092e0024a8699a6eb5f558161aebf98a7c6b1db67224d1_LORA-BIAS-7B.pth",
     "CAPTION-7B": "https://github.com/OpenGVLab/LLaMA-Adapter/releases/download/v.2.0.0/5088aeb63a89746b90bcfd5cb819e1c7411b2771b267c6d131ce73e250a8abf0_CAPTION-7B.pth",
+    "LORA-BIAS-7B-v21": "https://github.com/OpenGVLab/LLaMA-Adapter/releases/download/v.2.1.0/427dbc27bf62a3ef7a24ffd3ed2c3162_LORA-BIAS-7B-v21.pth",
     # "LORA16-7B": "",
     # "PARTIAL-7B": ""
 }
 
 def available_models():
     return list(_MODELS.keys())
 
-def load(name, llama_dir, device="cuda" if torch.cuda.is_available() else "cpu", download_root='ckpts', max_seq_len=512,
+def load(name, llama_dir, llama_type="7B", device="cuda" if torch.cuda.is_available() else "cpu", download_root='ckpts', max_seq_len=512,
         phase="finetune"):
     if name in _MODELS:
         model_path = _download(_MODELS[name], download_root)
@@ -296,7 +297,7 @@ def load(name, llama_dir, device="cuda" if torch.cuda.is_available() else "cpu",
         return RuntimeError(f"Model {name} not found; available models = {available_models()}"), None
 
     # BIAS-7B or https://xxx/sha256_BIAS-7B.pth -> 7B
-    llama_type = name.split('.')[0].split('-')[-1]
+    # llama_type = name.split('.')[0].split('-')[-1]
     llama_ckpt_dir = os.path.join(llama_dir, llama_type)
     llama_tokenzier_path = os.path.join(llama_dir, 'tokenizer.model')
 
diff --git a/llama_adapter_v2_multimodal7b/util/evaluate_mme.py b/llama_adapter_v2_multimodal7b/util/evaluate_mme.py
@@ -0,0 +1,184 @@
+import os
+import glob
+import argparse
+from tqdm import tqdm
+import PIL
+from PIL import Image
+import torch
+import torch.distributed as dist
+from torch.utils.data import Dataset
+import cv2
+from llama.llama_adapter import LLaMA_adapter
+
+DATA_DIR = "./MME_Benchmark_release_version"
+
+def get_image(image):
+    if type(image) is str:
+        try:
+            return Image.open(image).convert("RGB")
+        except Exception as e:
+            print(f"Fail to read image: {image}")
+            exit(-1)
+    elif type(image) is Image.Image:
+        return image
+    elif type(image) is PIL.JpegImagePlugin.JpegImageFile:
+        return image
+    elif type(image) is PIL.PngImagePlugin.PngImageFile:
+        return image
+    elif type(image) is PIL.MpoImagePlugin.MpoImageFile:
+        return image
+    else:
+        raise NotImplementedError(f"Invalid type of Image: {type(image)}")
+
+
+class MMEDataset(Dataset):
+    def __init__(
+        self,
+        dataset_name
+    ):
+        self.dataset_name = dataset_name
+        self.dataset = []
+        jpg_sets = ["artwork", "celebrity", "color", "count", "existence", "landmark", "OCR", "position", "posters", "scene"]
+        png_sets = ["code_reasoning", "commonsense_reasoning", "numerical_calculation", "text_translation"]
+        image_suffix = '.jpg' if dataset_name in jpg_sets else ".png"
+
+        assert (dataset_name in jpg_sets) or (dataset_name in png_sets), f"Invalid dataset name for MME benchmark: {dataset_name}"
+
+        if os.path.exists(f"{DATA_DIR}/{dataset_name}/images") and os.path.exists(f"{DATA_DIR}/{dataset_name}/questions_answers_YN"):
+            question_files = os.listdir(f"{DATA_DIR}/{dataset_name}/questions_answers_YN")
+            for question_file in question_files:
+                image_file_name = os.path.join(DATA_DIR, dataset_name, "images", question_file.replace('.txt', image_suffix))
+                with open(os.path.join(DATA_DIR, dataset_name, "questions_answers_YN", question_file), 'r', encoding='utf-8') as f:
+                    for line in f.readlines():
+                        try:
+                            question, gt_answer = line.replace('\n', '').split('\t')
+                            self.dataset.append({
+                                "image_path": image_file_name,
+                                "gt_answers": gt_answer,
+                                "question": question
+                            })
+                        except:
+                            pass
+
+        else:
+            question_files = glob.glob(f"{DATA_DIR}/{dataset_name}/*.txt")
+            for question_file in question_files:
+                image_file_name = question_file.replace(".txt", image_suffix)
+                with open(question_file, 'r', encoding='utf-8') as f:
+                    for line in f.readlines():
+                        try:
+                            question, gt_answer = line.replace('\n', '').split('\t')
+                            self.dataset.append({
+                                "image_path": image_file_name,
+                                "gt_answers": gt_answer,
+                                "question": question
+                            })
+                        except:
+                            pass
+
+    def __len__(self):
+        return len(self.dataset)
+
+    def __getitem__(self, idx):
+        return self.dataset[idx]
+
+
+def get_args_parser():
+    parser = argparse.ArgumentParser('Single-turn (conversation) demo', add_help=False)
+    # Model parameters
+    parser.add_argument('--llama_path', default='/path/to/llama', type=str,
+                        help='path to LLaMA pretrained checkpoint')
+    parser.add_argument('--pretrained_path', default='/path/to/pretrained', type=str,
+                        help='directory containing pre-trained checkpoints')
+    parser.add_argument('--lora', default=16, type=int)
+    parser.add_argument('--output_path', default='/path/to/output_results', type=str)
+    return parser
+
+
+if __name__ == "__main__":
+    args = get_args_parser().parse_args()
+
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+
+    llama_dir = args.llama_path
+    llama_type = '7B'
+    llama_ckpt_dir = os.path.join(llama_dir, llama_type)
+    llama_tokenzier_path = os.path.join(llama_dir, 'tokenizer.model')
+    
+    model_path = args.pretrained_path
+    # load llama_adapter weights and model_cfg
+    print(f'Loading LLaMA-Adapter from {model_path}')
+    ckpt = torch.load(model_path, map_location='cpu')
+
+    w_bias = True
+    w_lora = args.lora > 0
+    print('Lora:', w_lora)
+    lora_rank = args.lora
+    model = LLaMA_adapter(
+        llama_ckpt_dir, llama_tokenzier_path,
+        max_seq_len=512, max_batch_size=1,
+        clip_model='ViT-L/14',
+        v_embed_dim=768, v_depth=8,
+        v_num_heads=16, v_mlp_ratio=4.0,
+        query_len=10, query_layer=31,
+        w_bias=w_bias,
+        w_lora=w_lora,
+        lora_rank=lora_rank,
+        w_new_gate=w_lora,  # for compatibility
+        phase='finetune')
+
+    load_result = model.load_state_dict(ckpt['model'], strict=False)
+    print(load_result)
+
+    model = model.to(device)
+    model.half()
+    model.eval()
+    preprocess = model.clip_transform
+
+    prompt_format = (
+        "Below is an instruction that describes a task. "
+        "Write a response that appropriately completes the request using a single word or phrase.\n\n"
+        "### Instruction:\n{instruction}\n\n### Response:"
+    )
+
+    def multi_modal_generate(
+        img_path: str,
+        prompt: str,
+        max_gen_len=30,
+        temperature: float = 0,
+        top_p: float = 0.75,
+    ):
+        img = Image.fromarray(cv2.imread(img_path))
+        img = preprocess(img).unsqueeze(0).half().to(device)
+        prompt = prompt_format.format_map({'instruction': prompt})
+
+        result = model.generate(img, [prompt], 
+                                max_gen_len=max_gen_len, 
+                                temperature=temperature, 
+                                top_p=top_p)
+        return result[0]
+
+
+    result = {}
+    dataset_names = ["artwork", "celebrity", "color", "count", "existence", "OCR", "position", "posters", "scene", "code_reasoning", "commonsense_reasoning", "numerical_calculation", "text_translation", "landmark"] # landmark (03d5e3bfc958be38.jpg)
+    answer_path = args.output_path
+    batch_size = 1
+
+    print("Starting...")
+    for dataset_name in dataset_names:
+        dataset = MMEDataset(dataset_name)
+
+        predictions = []
+        with torch.no_grad():
+            for data in tqdm(dataset, desc=f"Inferencing {dataset_name}"):
+                pred = multi_modal_generate(data['image_path'], data['question'])            
+                predictions.append({'image_path': data['image_path'], 'question': data['question'], 'answer': pred, 'gt_answers': data['gt_answers']})
+
+        os.makedirs(answer_path, exist_ok=True)
+        prediction_file = os.path.join(answer_path, f"{dataset_name}.txt")
+        out_datas = [
+            f"{data['image_path']}\t{data['question']}\t{data['gt_answers']}\t{data['answer']}"
+            for data in predictions
+        ]
+        with open(prediction_file, 'w') as f:
+            f.write('\n'.join(out_datas))