diff --git a/README.md b/README.md index 3a0849d17..a64a5a34e 100644 --- a/README.md +++ b/README.md @@ -110,7 +110,7 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates - 💥**Supported Formats**: Supports both ✨`quantization` (integer and floating-point) and ✨`sparsity`, specifically including ✅weight-activation, ✅weight-only, ✅mixed-precision quantization, as well as ✅structured and ✅unstructured sparsity. -- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen-vl) models (see [Supported Model List](#supported-model-list)). +- 💥**Wide Model Support**: Offers support for a diverse array of ✨`LLM models`, including ✅LLama, ✅Mistral, ✅InternLM2, ✅Qwen2, among others, as well as ✅MOE(DeepSeekv2, Deepseekv2.5) and ✅VLM(Llama3.2-vision, Qwen2-vl) models (see [Supported Model List](#supported-model-list)). - 💥**Multi-backend Compatibility**: Seamlessly integrates with various backends for enhanced deployment flexibility. Multiple quantization settings and model formats are compatible with a wide range of backends and hardware platforms, such as ✅VLLM, ✅Sglang, ✅LightLLM, ✅MLC-LLM, and ✅AutoAWQ, making it highly versatile(see Section `Backend` [here](https://llmc-en.readthedocs.io/en/latest/)). @@ -166,7 +166,9 @@ Please refer to the 🚀`Quick Start` section in the [documentation](https://llm ✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) -✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) +✅ [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) + +✅ [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-2B) You can add your own model type referring to files under `llmc/models/*.py`. diff --git a/README_ja.md b/README_ja.md index f48871d94..0ddcd89a2 100644 --- a/README_ja.md +++ b/README_ja.md @@ -108,7 +108,7 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates - 💥**サポートされているフォーマット**: ✨`量子化`(整数および浮動小数点)と ✨`疎性` の両方をサポートし、具体的には ✅重量-活性化、✅重量のみ、✅混合精度量子化、および ✅構造化疎性 と ✅非構造化疎性 を含みます。 -- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅✅MOE(DeepSeekv2, Deepseekv2.5) モデルや ✅VLM(Llama3.2-vision, Qwen-vl) モデルもサポートしています([サポートされているモデルリスト](#supported-model-list)を参照してください)。 +- 💥**広範なモデルサポート**: 多様な ✨`LLMモデル` をサポートしており、✅LLama、✅Mistral、✅InternLM2、✅Qwen2 など、さらに ✅✅MOE(DeepSeekv2, Deepseekv2.5) モデルや ✅VLM(Llama3.2-vision, Qwen2-vl) モデルもサポートしています([サポートされているモデルリスト](#supported-model-list)を参照してください)。 - 💥**マルチバックエンドの互換性**: 複数のバックエンドとシームレスに統合し、展開の柔軟性を強化します。さまざまな量子化設定およびモデルフォーマットが、✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM、✅AutoAWQ など、幅広いバックエンドおよびハードウェアプラットフォームと互換性があり、高い柔軟性を実現しています(`Backend`セクションは[こちら](https://llmc-en.readthedocs.io/en/latest/)をご覧ください)。 @@ -164,7 +164,9 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) -✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) +✅ [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) + +✅ [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-2B) 独自のモデルタイプを追加するには、`llmc/models/*.py` ファイルを参照してください。 diff --git a/README_zh.md b/README_zh.md index 8ff248a46..83740de94 100644 --- a/README_zh.md +++ b/README_zh.md @@ -108,7 +108,7 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates - 💥**支持的格式**: 支持 ✨`量化`(整型和浮点)和 ✨`稀疏化`,具体包括 ✅权重激活量化、✅权重量化、✅混合精度量化,以及 ✅结构化 和 ✅非结构化稀疏化。 -- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`,包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等,以及 ✅MOE(DeepSeekv2, Deepseekv2.5) 和 ✅VLM(Llama3.2-vision, Qwen-vl) 模型(参见[支持的模型列表](#supported-model-list))。 +- 💥**广泛模型支持**: 支持多种 ✨`LLM模型`,包括 ✅LLama、✅Mistral、✅InternLM2、✅Qwen2 等,以及 ✅MOE(DeepSeekv2, Deepseekv2.5) 和 ✅VLM(Llama3.2-vision, Qwen2-vl) 模型(参见[支持的模型列表](#supported-model-list))。 - 💥**多后端兼容性**: 无缝集成多个后端,增强部署灵活性。多种量化设置和模型格式兼容广泛的后端和硬件平台,例如 ✅VLLM、✅Sglang、✅LightLLM、✅MLC-LLM 和 ✅AutoAWQ,使其高度灵活(参见✨`推理后端` 章节 [此处](https://llmc-zhcn.readthedocs.io/en/latest/))。 @@ -164,7 +164,9 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates ✅ [Qwen MOE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) -✅ [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) +✅ [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) + +✅ [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-2B) 你可以参考 `llmc/models/*.py` 文件添加自己的模型类型。 diff --git a/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding.yml b/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding.yml index 59cb20d05..ac6d4f1ba 100644 --- a/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding.yml +++ b/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding.yml @@ -10,6 +10,7 @@ calib: type: img_txt download: False path: calib data path + add_answer: False # Defalut is False. If set it to Ture, calib data will add answers. n_samples: 3 bs: -1 seq_len: 512 diff --git a/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding_eval_mme.yml b/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding_eval_mme.yml new file mode 100644 index 000000000..3d57f1b96 --- /dev/null +++ b/configs/quantization/methods/Awq/awq_w_only_custom_vlm_data_padding_eval_mme.yml @@ -0,0 +1,46 @@ +base: + seed: &seed 42 +model: + type: model_type + path: model path + tokenizer_mode: slow + torch_dtype: auto +calib: + name: vlm_datastes + type: img_txt + download: False + path: calib data path + add_answer: False # Defalut is False. If set it to Ture, calib data will add answers. + n_samples: 3 + bs: -1 + seq_len: 512 + preproc: vlm_general + padding: True + seed: *seed +eval: + eval_pos: [pretrain, fake_quant] + type: img_txt + name: MME + download: False + path: MME dataset path + bs: 16 + inference_per_block: False +quant: + method: Awq + weight: + bit: 4 + symmetric: False + granularity: per_group + group_size: 128 + special: + trans: True + # The options for "trans_version" include "v1" and "v2". + # But their results don't differ significantly. + trans_version: v2 + weight_clip: True + # For 2-bit quantization, setting "clip_sym: False" will yield better results. + clip_sym: False +save: + save_trans: False + save_fake: False + save_path: /path/to/save/ diff --git a/llmc/data/dataset/base_dataset.py b/llmc/data/dataset/base_dataset.py index f4c79e69f..98388aa8d 100644 --- a/llmc/data/dataset/base_dataset.py +++ b/llmc/data/dataset/base_dataset.py @@ -111,8 +111,6 @@ def get_calib_samples(self): preproc = PREPROC_REGISTRY[self.preproc] samples = preproc( self.calib_dataset, - self.tokenizer, - self.batch_process, self.n_samples ) else: @@ -222,15 +220,15 @@ def txt_group_samples_wo_mask(self, samples): # without mask def img_txt_group_samples_with_mask(self, samples): calib_samples = [] if self.calib_bs < 0: - calib_samples.append(self.batch_process(samples)) + calib_samples.append(self.batch_process(samples, calib_or_eval='calib')) elif self.calib_bs == 1: - calib_samples = [self.batch_process([sample]) for sample in samples] + calib_samples = [self.batch_process([sample], calib_or_eval='calib') for sample in samples] # noqa elif self.calib_bs > 1: for i in range(0, len(samples), self.calib_bs): start = i end = min(i + self.calib_bs, len(samples)) batch = samples[start:end] - calib_samples.append(self.batch_process(batch)) + calib_samples.append(self.batch_process(batch, calib_or_eval='calib')) return calib_samples def img_group_samples_wo_mask(self, samples): # without mask diff --git a/llmc/data/dataset/specified_preproc.py b/llmc/data/dataset/specified_preproc.py index 56649db10..d8ad4d00f 100644 --- a/llmc/data/dataset/specified_preproc.py +++ b/llmc/data/dataset/specified_preproc.py @@ -102,7 +102,7 @@ def pileval_omni(calib_dataset, tokenizer, n_samples, seq_len): @PREPROC_REGISTRY -def vlm_general(calib_dataset, tokenizer, batch_process, n_samples): +def vlm_general(calib_dataset, n_samples): img_qa_json = os.path.join(calib_dataset, 'img_qa.json') fp = open(img_qa_json) img_qas = json.load(fp) diff --git a/llmc/eval/eval_vlm.py b/llmc/eval/eval_vlm.py index f051e8760..cf6c96e2c 100644 --- a/llmc/eval/eval_vlm.py +++ b/llmc/eval/eval_vlm.py @@ -35,11 +35,6 @@ def load_mme(self): return img_qas def patch_datasets(self, model_type): - if self.dataset == 'MME': - if model_type == 'InternVL2': - for idx in range(len(self.img_qas)): - if '\n' not in self.img_qas[idx]['question']: - self.img_qas[idx]['question'] = '\n' + self.img_qas[idx]['question'] if model_type == 'InternVL2': self.output_include_input = False elif model_type == 'Llava': diff --git a/llmc/models/internvl2.py b/llmc/models/internvl2.py index 7269c1056..cad8aa905 100644 --- a/llmc/models/internvl2.py +++ b/llmc/models/internvl2.py @@ -137,8 +137,10 @@ def build_model(self): 'Besides, you can also put the into your calib dataset.' ) - def batch_process(self, img_qas): + def batch_process(self, img_qas, calib_or_eval='eval'): + assert calib_or_eval == 'calib' or calib_or_eval == 'eval' questions = [] + answers = [] pixel_values_list = [] num_patches_list = [] for idx in range(len(img_qas)): @@ -166,6 +168,7 @@ def batch_process(self, img_qas): else: assert img_qas[idx]['question'].count('') == len(img_path), f"{img_qas[idx]['img']} this data prompt is wrong." # noqa questions.append(img_qas[idx]['question']) + answers.append(img_qas[idx]['answer'] + '<|im_end|>') pixel_values = ( torch.cat(pixel_values_list, dim=0) if len(pixel_values_list) > 0 else None @@ -189,6 +192,10 @@ def batch_process(self, img_qas): template.append_message(template.roles[0], question) template.append_message(template.roles[1], None) query = template.get_prompt() + if calib_or_eval == 'calib' and self.config['calib'].get('add_answer', False): + query += answers[idx] + if calib_or_eval == 'calib': + logger.info(f'Calib data is:\n{query}') for _num_patches_i in num_patches: image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.vlm_model.num_image_token * _num_patches_i + IMG_END_TOKEN # noqa query = query.replace('', image_tokens, 1) diff --git a/llmc/models/llava.py b/llmc/models/llava.py index f3729732a..7b78c6198 100644 --- a/llmc/models/llava.py +++ b/llmc/models/llava.py @@ -33,9 +33,11 @@ def build_model(self): self.processor = AutoProcessor.from_pretrained(self.model_path) - def batch_process(self, img_qas): + def batch_process(self, img_qas, calib_or_eval='eval'): + assert calib_or_eval == 'calib' or calib_or_eval == 'eval' messages = [] images = [] + answers = [] for idx in range(len(img_qas)): img_path = img_qas[idx]['img'] image = Image.open(img_path) @@ -50,10 +52,19 @@ def batch_process(self, img_qas): ] messages.append(message) images.append(image) + answers.append(img_qas[idx]['answer']) texts = [ - self.processor.apply_chat_template(msg, add_generation_prompt=True) - for msg in messages + self.processor.apply_chat_template(messages[n], add_generation_prompt=True) + for n in range(len(messages)) ] + if calib_or_eval == 'calib' and self.config['calib'].get('add_answer', False): + texts = [ + texts[n] + ' ' + answers[n] + for n in range(len(texts)) + ] + if calib_or_eval == 'calib': + logger.info(f'Calib data is:\n{texts}') + inputs = self.processor( text=texts, images=images, diff --git a/llmc/models/mllama.py b/llmc/models/mllama.py index fbd31b77d..9142661f0 100644 --- a/llmc/models/mllama.py +++ b/llmc/models/mllama.py @@ -38,7 +38,8 @@ def build_model(self): self.model = self.vlm_model.language_model self.model_config = self.vlm_model_config.text_config - def batch_process(self, img_qas): + def batch_process(self, img_qas, calib_or_eval='eval'): + assert calib_or_eval == 'calib' or calib_or_eval == 'eval' if len(img_qas) == 1: return self.single_process(img_qas[0]) processor = AutoProcessor.from_pretrained(self.model_path) diff --git a/llmc/models/qwen2vl.py b/llmc/models/qwen2vl.py index bb6e4e263..8daa2ac9b 100644 --- a/llmc/models/qwen2vl.py +++ b/llmc/models/qwen2vl.py @@ -60,8 +60,10 @@ def build_model(self): max_pixels=self.max_pixels ) - def batch_process(self, img_qas): + def batch_process(self, img_qas, calib_or_eval='eval'): + assert calib_or_eval == 'calib' or calib_or_eval == 'eval' messages = [] + answers = [] for idx in range(len(img_qas)): img_path = img_qas[idx]['img'] if img_path is not None: @@ -87,10 +89,19 @@ def batch_process(self, img_qas): } ] messages.append(message) + answers.append(img_qas[idx]['answer'] + '<|im_end|>') texts = [ self.processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages ] + if calib_or_eval == 'calib' and self.config['calib'].get('add_answer', False): + texts = [ + texts[n] + answers[n] + for n in range(len(texts)) + ] + if calib_or_eval == 'calib': + logger.info(f'Calib data is:\n{texts}') + image_inputs, video_inputs = process_vision_info(messages) inputs = self.processor( text=texts,