support add_answer for vlm models (#213)

helloyongyang · web-flow · commit 2aaa3abb39df · 2024-11-20T20:21:07.000+08:00
diff --git a/README.md b/README.md
@@ -110,7 +110,7 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 - 💥**Supported Formats**: Supports both ✨`quantization` (integer and floating-point) and ✨`sparsity`, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity.
 
-- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen-vl) models (see [Supported Model List](#supported-model-list)).
+- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen2-vl) models (see [Supported Model List](#supported-model-list)).
 
 - 💥**Multi-backend Compatibility**: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section `Backend` [here](https://llmc-en.readthedocs.io/en/latest/)).
 
@@ -166,7 +166,9 @@ Please refer to the 🚀`Quick Start` section in the [documentation](https://llm
 
 ✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)
 
-✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL)
+✅ [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
+
+✅ [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-2B)
 
 You can add your own model type referring to files under `llmc/models/*.py`.
 
diff --git a/README_ja.md b/README_ja.md
@@ -108,7 +108,7 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 - 💥**サポートされているフォーマット**: ✨`量子化`（整数および浮動小数点）と ✨`疎性` の両方をサポートし、具体的には ✅重量-活性化、✅重量のみ、✅混合精度量子化、および ✅構造化疎性 と ✅非構造化疎性 を含みます。
 
-- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅✅MOE(DeepSeekv2, Deepseekv2.5) モデルや ✅VLM(Llama3.2-vision, Qwen-vl) モデルもサポートしています（[サポートされているモデルリスト](#supported-model-list)を参照してください）。
+- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅✅MOE(DeepSeekv2, Deepseekv2.5) モデルや ✅VLM(Llama3.2-vision, Qwen2-vl) モデルもサポートしています（[サポートされているモデルリスト](#supported-model-list)を参照してください）。
 
 - 💥**マルチバックエンドの互換性**: 複数のバックエンドとシームレスに統合し、展開の柔軟性を強化します。さまざまな量子化設定およびモデルフォーマットが、✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM、✅AutoAWQ など、幅広いバックエンドおよびハードウェアプラットフォームと互換性があり、高い柔軟性を実現しています（`Backend`セクションは[こちら](https://llmc-en.readthedocs.io/en/latest/)をご覧ください）。
 
@@ -164,7 +164,9 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)
 
-✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL)
+✅ [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
+
+✅ [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-2B)
 
 独自のモデルタイプを追加するには、`llmc/models/*.py` ファイルを参照してください。
 
diff --git a/README_zh.md b/README_zh.md
@@ -108,7 +108,7 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 - 💥**支持的格式**: 支持 ✨`量化`（整型和浮点）和 ✨`稀疏化`，具体包括 ✅权重激活量化、✅权重量化、✅混合精度量化，以及 ✅结构化 和 ✅非结构化稀疏化。
 
-- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`，包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等，以及 ✅MOE(DeepSeekv2, Deepseekv2.5) 和 ✅VLM(Llama3.2-vision, Qwen-vl) 模型（参见[支持的模型列表](#supported-model-list)）。
+- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`，包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等，以及 ✅MOE(DeepSeekv2, Deepseekv2.5) 和 ✅VLM(Llama3.2-vision, Qwen2-vl) 模型（参见[支持的模型列表](#supported-model-list)）。
 
 - 💥**多后端兼容性**: 无缝集成多个后端，增强部署灵活性。多种量化设置和模型格式兼容广泛的后端和硬件平台，例如 ✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM 和 ✅AutoAWQ，使其高度灵活（参见✨`推理后端` 章节 [此处](https://llmc-zhcn.readthedocs.io/en/latest/)）。
 
@@ -164,7 +164,9 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)
 
-✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL)
+✅ [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
+
+✅ [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-2B)
 
 你可以参考 `llmc/models/*.py` 文件添加自己的模型类型。
 
diff --git a/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding.yml b/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding.yml
@@ -10,6 +10,7 @@ calib:
     type: img_txt
     download: False
     path: calib data path
+    add_answer: False # Defalut is False. If set it to Ture, calib data will add answers.
     n_samples: 3
     bs: -1
     seq_len: 512
diff --git a/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding_eval_mme.yml b/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding_eval_mme.yml
@@ -0,0 +1,46 @@
+base:
+    seed: &seed 42
+model:
+    type: model_type
+    path: model path
+    tokenizer_mode: slow
+    torch_dtype: auto
+calib:
+    name: vlm_datastes
+    type: img_txt
+    download: False
+    path: calib data path
+    add_answer: False # Defalut is False. If set it to Ture, calib data will add answers.
+    n_samples: 3
+    bs: -1
+    seq_len: 512
+    preproc: vlm_general
+    padding: True
+    seed: *seed
+eval:
+    eval_pos: [pretrain, fake_quant]
+    type: img_txt
+    name: MME
+    download: False
+    path: MME dataset path
+    bs: 16
+    inference_per_block: False
+quant:
+    method: Awq
+    weight:
+        bit: 4
+        symmetric: False
+        granularity: per_group
+        group_size: 128
+    special:
+        trans: True
+        # The options for "trans_version" include "v1" and "v2".
+        # But their results don't differ significantly.
+        trans_version: v2
+        weight_clip: True
+        # For 2-bit quantization, setting "clip_sym: False" will yield better results.
+        clip_sym: False
+save:
+    save_trans: False
+    save_fake: False
+    save_path: /path/to/save/
diff --git a/llmc/data/dataset/base_dataset.py b/llmc/data/dataset/base_dataset.py
@@ -111,8 +111,6 @@ def get_calib_samples(self):
             preproc = PREPROC_REGISTRY[self.preproc]
             samples = preproc(
                 self.calib_dataset,
-                self.tokenizer,
-                self.batch_process,
                 self.n_samples
             )
         else:
@@ -222,15 +220,15 @@ def txt_group_samples_wo_mask(self, samples):  # without mask
     def img_txt_group_samples_with_mask(self, samples):
         calib_samples = []
         if self.calib_bs < 0:
-            calib_samples.append(self.batch_process(samples))
+            calib_samples.append(self.batch_process(samples, calib_or_eval='calib'))
         elif self.calib_bs == 1:
-            calib_samples = [self.batch_process([sample]) for sample in samples]
+            calib_samples = [self.batch_process([sample], calib_or_eval='calib') for sample in samples] # noqa
         elif self.calib_bs > 1:
             for i in range(0, len(samples), self.calib_bs):
                 start = i
                 end = min(i + self.calib_bs, len(samples))
                 batch = samples[start:end]
-                calib_samples.append(self.batch_process(batch))
+                calib_samples.append(self.batch_process(batch, calib_or_eval='calib'))
         return calib_samples
 
     def img_group_samples_wo_mask(self, samples):  # without mask
diff --git a/llmc/data/dataset/specified_preproc.py b/llmc/data/dataset/specified_preproc.py
@@ -102,7 +102,7 @@ def pileval_omni(calib_dataset, tokenizer, n_samples, seq_len):
 
 
 @PREPROC_REGISTRY
-def vlm_general(calib_dataset, tokenizer, batch_process, n_samples):
+def vlm_general(calib_dataset, n_samples):
     img_qa_json = os.path.join(calib_dataset, 'img_qa.json')
     fp = open(img_qa_json)
     img_qas = json.load(fp)
diff --git a/llmc/eval/eval_vlm.py b/llmc/eval/eval_vlm.py
@@ -35,11 +35,6 @@ def load_mme(self):
         return img_qas
 
     def patch_datasets(self, model_type):
-        if self.dataset == 'MME':
-            if model_type == 'InternVL2':
-                for idx in range(len(self.img_qas)):
-                    if '<image>\n' not in self.img_qas[idx]['question']:
-                        self.img_qas[idx]['question'] = '<image>\n' + self.img_qas[idx]['question']
         if model_type == 'InternVL2':
             self.output_include_input = False
         elif model_type == 'Llava':
diff --git a/llmc/models/internvl2.py b/llmc/models/internvl2.py
@@ -137,8 +137,10 @@ def build_model(self):
             'Besides, you can also put the <image> into your calib dataset.'
         )
 
-    def batch_process(self, img_qas):
+    def batch_process(self, img_qas, calib_or_eval='eval'):
+        assert calib_or_eval == 'calib' or calib_or_eval == 'eval'
         questions = []
+        answers = []
         pixel_values_list = []
         num_patches_list = []
         for idx in range(len(img_qas)):
@@ -166,6 +168,7 @@ def batch_process(self, img_qas):
                 else:
                     assert img_qas[idx]['question'].count('<image>') == len(img_path), f"{img_qas[idx]['img']} this data prompt is wrong." # noqa
             questions.append(img_qas[idx]['question'])
+            answers.append(img_qas[idx]['answer'] + '<|im_end|>')
 
         pixel_values = (
             torch.cat(pixel_values_list, dim=0) if len(pixel_values_list) > 0 else None
@@ -189,6 +192,10 @@ def batch_process(self, img_qas):
             template.append_message(template.roles[0], question)
             template.append_message(template.roles[1], None)
             query = template.get_prompt()
+            if calib_or_eval == 'calib' and self.config['calib'].get('add_answer', False):
+                query += answers[idx]
+            if calib_or_eval == 'calib':
+                logger.info(f'Calib data is:\n{query}')
             for _num_patches_i in num_patches:
                 image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.vlm_model.num_image_token * _num_patches_i + IMG_END_TOKEN # noqa
                 query = query.replace('<image>', image_tokens, 1)
diff --git a/llmc/models/llava.py b/llmc/models/llava.py
@@ -33,9 +33,11 @@ def build_model(self):
 
         self.processor = AutoProcessor.from_pretrained(self.model_path)
 
-    def batch_process(self, img_qas):
+    def batch_process(self, img_qas, calib_or_eval='eval'):
+        assert calib_or_eval == 'calib' or calib_or_eval == 'eval'
         messages = []
         images = []
+        answers = []
         for idx in range(len(img_qas)):
             img_path = img_qas[idx]['img']
             image = Image.open(img_path)
@@ -50,10 +52,19 @@ def batch_process(self, img_qas):
             ]
             messages.append(message)
             images.append(image)
+            answers.append(img_qas[idx]['answer'])
         texts = [
-            self.processor.apply_chat_template(msg, add_generation_prompt=True)
-            for msg in messages
+            self.processor.apply_chat_template(messages[n], add_generation_prompt=True)
+            for n in range(len(messages))
         ]
+        if calib_or_eval == 'calib' and self.config['calib'].get('add_answer', False):
+            texts = [
+                texts[n] + ' ' + answers[n]
+                for n in range(len(texts))
+            ]
+        if calib_or_eval == 'calib':
+            logger.info(f'Calib data is:\n{texts}')
+
         inputs = self.processor(
             text=texts,
             images=images,
diff --git a/llmc/models/mllama.py b/llmc/models/mllama.py
@@ -38,7 +38,8 @@ def build_model(self):
         self.model = self.vlm_model.language_model
         self.model_config = self.vlm_model_config.text_config
 
-    def batch_process(self, img_qas):
+    def batch_process(self, img_qas, calib_or_eval='eval'):
+        assert calib_or_eval == 'calib' or calib_or_eval == 'eval'
         if len(img_qas) == 1:
             return self.single_process(img_qas[0])
         processor = AutoProcessor.from_pretrained(self.model_path)
diff --git a/llmc/models/qwen2vl.py b/llmc/models/qwen2vl.py
@@ -60,8 +60,10 @@ def build_model(self):
             max_pixels=self.max_pixels
         )
 
-    def batch_process(self, img_qas):
+    def batch_process(self, img_qas, calib_or_eval='eval'):
+        assert calib_or_eval == 'calib' or calib_or_eval == 'eval'
         messages = []
+        answers = []
         for idx in range(len(img_qas)):
             img_path = img_qas[idx]['img']
             if img_path is not None:
@@ -87,10 +89,19 @@ def batch_process(self, img_qas):
                     }
                 ]
             messages.append(message)
+            answers.append(img_qas[idx]['answer'] + '<|im_end|>')
         texts = [
             self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
             for msg in messages
         ]
+        if calib_or_eval == 'calib' and self.config['calib'].get('add_answer', False):
+            texts = [
+                texts[n] + answers[n]
+                for n in range(len(texts))
+            ]
+        if calib_or_eval == 'calib':
+            logger.info(f'Calib data is:\n{texts}')
+
         image_inputs, video_inputs = process_vision_info(messages)
         inputs = self.processor(
             text=texts,