opea-project
diff --git a/‎comps/finetuning/src/integrations/xtune/README.md‎
Lines changed: 144 additions & 20 deletions b/‎comps/finetuning/src/integrations/xtune/README.md‎
Lines changed: 144 additions & 20 deletions
diff --git a/‎comps/finetuning/src/integrations/xtune/clip_finetune/trainers/clip_adapter_hf.py‎
Lines changed: 62 additions & 11 deletions b/‎comps/finetuning/src/integrations/xtune/clip_finetune/trainers/clip_adapter_hf.py‎
Lines changed: 62 additions & 11 deletions
@@ -4,11 +4,11 @@
 
 > [!NOTE]
 >
-> - _`Xtune`_ incorporates with Llama-Factory to offer various methods for finetuning visual models (CLIP, AdaCLIP), LLM and Multi-modal models. It makes easier to choose the method and to set fine-tuning parameters.
+> - _`Xtune`_ incorporates with Llama-Factory to offer various methods for finetuning visual models (CLIP, CnCLIP, AdaCLIP), LLM and Multi-modal models. It makes easier to choose the method and to set fine-tuning parameters.
 
 The core features include:
 
-- Four finetune method for CLIP, details in [CLIP](./doc/key_features_for_clip_finetune_tool.md)
+- Four finetune method for CLIP & CnCLIP, details in [CLIP](./doc/key_features_for_clip_finetune_tool.md)
 - Three finetune method for AdaCLIP, details in [AdaCLIP](./doc/adaclip_readme.md)
 - Automatic hyperparameter searching enabled by Optuna [Optuna](https://github.com/optuna/optuna)
 - Distillation from large models with Intel ARC GPU
@@ -59,8 +59,8 @@ Blow command is in prepare_xtune.sh. You can ignore it if you don't want to upda
     conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
 # else run on A770
 # You can refer to https://github.com/intel/intel-extension-for-pytorch for latest command to update lib
-    python -m pip install torch==2.5.1+cxx11.abi torchvision==0.20.1+cxx11.abi torchaudio==2.5.1+cxx11.abi  --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-    python -m pip install intel-extension-for-pytorch==2.5.10+xpu oneccl_bind_pt==2.5.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+    python -m pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/xpu
+    python -m pip install intel-extension-for-pytorch==2.8.10+xpu oneccl_bind_pt==2.8.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 ```
 
 ### 2. Install xtune on docker
@@ -106,20 +106,34 @@ then make `dataset_info.json` in your dataset directory
 {
   "caltech101": {
     "file_name": "caltech101.json"
+  },
+  "flickr30k": {
+    "file_name": "flickr30k.json"
   }
 }
 ```
 
-## Fine-Tuning with LLaMA Board GUI (powered by [Gradio](https://github.com/gradio-app/gradio))
-
-> [!NOTE] We don't support multi-card in GUI now, will add it later.
+The directory structure should look like
 
-When run with prepare_xtune.sh, it will automatic run ZE_AFFINITY_MASK=0 llamafactory-cli webui.
+```
+$DATA/
+|-- caltech-101/
+|   |-- 101_ObjectCategories/
+|   | split_zhou_Caltech101.json
+|-- flickr/
+|   |–– flickr30k-images/
+|   |   |-- *.jpg
+|   |-- train_texts.jsonl
+|   |-- val_texts.jsonl
+|   |-- test_texts.jsonl
+|-- dataset_info.json
+|-- caltech101.json
+|-- flickr30k.json
+```
 
-If you see "server start successfully" in terminal.
-You can access in web through http://localhost:7860/
+## Fine-Tuning with LLaMA Board GUI (powered by [Gradio](https://github.com/gradio-app/gradio))
 
-The UI component information can be seen in doc/ui_component.md after run with prepare_xtune.sh.
+> [!NOTE] We don't support multi-card in GUI now, will add it later.
 
 When run with prepare_xtune.sh, it will automatic run ZE_AFFINITY_MASK=0 llamafactory-cli webui.
 
@@ -137,6 +151,58 @@ The UI component information can be seen in doc/ui_component.md after run with p
  Then access in web through http://localhost:7860/
 ```
 
+### GUI using guide
+
+#### CLIP & CnCLIP
+
+![clip ui guide](./pics/clip_ui.png)
+
+- Must be set to the specified parameter values below:
+  | Parameter | Choose Value |
+  | ------------------- | -------------------------------------------- |
+  | `Model name` | `CnVit-B/16` / `CnVit-L/14` /`Vit-B/16` /`Vit-L/14` |
+  | `Model path` | Must be the detail configuration name under `src/llamafactory/clip_finetune/configs/trainers/clip_finetune/`|
+  | `Finetuning method` | clip |
+  | `Stage` | clip|
+  | `Data dir` | Where you put `dataset_info.json`.|
+  | `Method Group` |Finetune|
+  | `clip_finetune method` | `CLIP_Adapter_hf`/ `CLIP_Bias_hf`/ `CLIP_VPT_hf` /`CLIP_Fullfinetune_hf`, must match with `Model name`(configuration name).|
+
+- The matching relationship between `Model name`(configuration name) and `clip_finetune method`:
+
+  | clip_finetune method | `Model name`(configuration name)      |
+  | -------------------- | ------------------------------------- |
+  | CLIP_Adapter_hf      | xx_xx(e.g.,`cnvit_b16`)               |
+  | CLIP_Bias_hf         | xx_xx_bias(e.g.,`cnvit_b16_bias`)     |
+  | CLIP_VPT_hf          | xx_xx_prompt(e.g.,`cnvit_b16_prompt`) |
+  | CLIP_Fullfinetune_hf | xx_xx_ori(e.g.,`cnvit_b16_ori`)       |
+
+#### AdaCLIP
+
+![adaclip ui guide](./pics/adaclip_ui.png)
+
+- Must be set to the specified parameter values below:
+  | Parameter | Choose Value |
+  | ------------------- | -------------------------------------------- |
+  | `Model name` | Custom |
+  | `Model path` | Adaclip model path|
+  | `Finetuning method` | Adaclip|
+  | `Stage` | Adaclip|
+  | `Data dir` | Where you put `dataset_info.json`|
+
+#### Qwen2-VL & Qwen2.5-VL
+
+![qwen-vl ui guide](./pics/qwen_vl_ui.png)
+
+- Must be set to the specified parameter values below:
+  | Parameter | Choose Value |
+  | ------------------- | -------------------------------------------- |
+  | `Model name` | Select Qwen2-VL or Qwen2.5-VL model |
+  | `Model path` | Will be set automatically after setting Model name, you can use your local model path,too.|
+  | `Finetuning method` | lora|
+  | `Stage` | Supervised Fine-Tuning|
+  | `Data dir` | Where you put `dataset_info.json`, can use `data` as default, and update your own data in `data/dataset_info.json`|
+
 ## Fine-Tuning with Shell instead of GUI
 
 After run `prepare_xtune.sh`, it will download all related file. And open webui as default.
@@ -154,6 +220,15 @@ cd src/llamafactory/clip_finetune
 # Please see README.md in src/llamafactory/clip_finetune for detail
 ```
 
+### CnCLIP
+
+Please see [doc](./doc/key_features_for_clip_finetune_tool.md) for how to config feature
+
+```bash
+cd src/llamafactory/clip_finetune
+# Please see README.md in src/llamafactory/clip_finetune for detail
+```
+
 ### AdaCLIP
 
 ```bash
@@ -164,22 +239,24 @@ cd src/llamafactory/adaclip_finetune
 ### Qwen2-VL Training and Hyperparameter Optimization
 
 ```bash
-# Please see Qwen2-VL_README.md in doc for detail, bolow are simple use
+# Please see Qwen-VL_README.md in doc to use more automated fine-tuning methods and hyperparameter tuning, bolow are simple use:
 ```
 
-#### Step 1: Finetune qwen2-vl with logging eval loss
+#### Finetune Qwen2-VL & Qwen2.5-VL with logging eval loss
 
 If you want to finetune with plotting eval loss, please set eval_strategy as steps, eval_stepsand eval_dataset:
 
-```
-# Finetune qwen2-vl with logging eval loss
+##### Qwen2-VL
+
+```bash
 export DATA='where you can find dataset_info.json'
-export dataset=activitynet_qa_2000_limit_20s                    # to point which dataset llamafactory will use
+#To point which dataset llamafactory will use, have to add the datasets into dataset_info.json before finetune.
+export dataset=activitynet_qa_2000_limit_20s
 export eval_dataset=activitynet_qa_val_500_limit_20s
 llamafactory-cli train \
     --stage sft \
     --do_train True \
-    --model_name_or_path $models/Qwen2-VL-7B-Instruct-GPTQ-Int8 \
+    --model_name_or_path /model/Qwen2-VL-7B-Instruct-GPTQ-Int8 \
     --preprocessing_num_workers 16 \
     --finetuning_type lora \
     --template qwen2_vl \
@@ -196,10 +273,10 @@ llamafactory-cli train \
     --max_grad_norm 1.0 \
     --logging_steps 10 \
     --save_steps 100 \
-    --warmup_steps 100 \
+    --warmup_steps 0 \
     --packing False \
     --report_to none \
-    --output_dir saves/Qwen2-VL-7B-Instruct-GPTQ-Int8/lora/finetune_test_valmetrics_evalstep8 \
+    --output_dir saves/Qwen2-VL-7B-Instruct-GPTQ-Int8/lora/finetune_qwen2vl \
     --bf16 True \
     --plot_loss True \
     --ddp_timeout 180000000 \
@@ -216,7 +293,54 @@ llamafactory-cli train \
     --lora_target all
 ```
 
-#### step 2: Evaluation metrics calculation and plotting
+#### Qwen2.5-VL
+
+```bash
+export DATA='where you can find dataset_info.json'
+#To point which dataset llamafactory will use, have to add the datasets into dataset_info.json before finetune.
+export dataset=activitynet_qa_1000_limit_20s
+export eval_dataset=activitynet_qa_val_250_limit_20s
+llamafactory-cli train \
+    --stage sft \
+    --do_train True \
+    --model_name_or_path /home/edgeai/wxs/workspace/models/Qwen2.5-VL-7B-Instruct \
+    --preprocessing_num_workers 16 \
+    --finetuning_type lora \
+    --template qwen2_vl \
+    --flash_attn auto \
+    --dataset_dir $DATA \
+    --dataset $dataset \
+    --cutoff_len 2048 \
+    --learning_rate 5e-05 \
+    --num_train_epochs 2 \
+    --max_samples 100000 \
+    --per_device_train_batch_size 2 \
+    --gradient_accumulation_steps 4 \
+    --lr_scheduler_type cosine \
+    --max_grad_norm 1.0 \
+    --logging_steps 10 \
+    --save_steps 100 \
+    --warmup_steps 0 \
+    --packing False \
+    --report_to none \
+    --output_dir saves/Qwen2.5-VL-7B-Instruct/lora/finetune_qwen2.5vl \
+    --bf16 True \
+    --plot_loss True \
+    --ddp_timeout 180000000 \
+    --optim adamw_torch \
+    --video_fps 0.05 \
+    --per_device_eval_batch_size 1 \
+    --eval_strategy steps \
+    --eval_steps 100 \
+    --eval_dataset $eval_dataset \
+    --predict_with_generate true \
+    --lora_rank 8 \
+    --lora_alpha 16 \
+    --lora_dropout 0 \
+    --lora_target all
+```
+
+#### Calculation and Plotting of Evaluation Metrics During Fine-Tuning
 
 If you want to plot eval metrics:
 Change `MODEL_NAME`,`EXPERIENT_NAME`,`EVAL_DATASET` as you need and run evaluation metrics calculation sctrpt:
 
@@ -12,6 +12,13 @@
 from torch.nn import functional as F
 from transformers import CLIPModel, CLIPProcessor
 
+try:
+    from transformers import ChineseCLIPModel, ChineseCLIPProcessor
+
+    CHINESE_CLIP_AVAILABLE = True
+except ImportError:
+    CHINESE_CLIP_AVAILABLE = False
+
 CUSTOM_TEMPLATES = {
     "OxfordPets": "a photo of a {}, a type of pet.",
     "OxfordFlowers": "a photo of a {}, a type of flower.",
@@ -30,22 +37,31 @@
     "ImageNetA": "a photo of a {}.",
     "ImageNetR": "a photo of a {}.",
     "ITC_Flickr": "{}.",
+    "ITC_FlickrCN": "{}.",
     "ITC_Flickr5k": "{}.",
     "ITC_Mscoco": "{}.",
 }
 _MODELS = {
     "ViT-B/16": "openai/clip-vit-base-patch16",
     "ViT-B/32": "openai/clip-vit-base-patch32",
     "ViT-L/14": "openai/clip-vit-large-patch14",
+    "CnViT-B/16": "OFA-Sys/chinese-clip-vit-base-patch16",
+    "CnViT-L/14": "OFA-Sys/chinese-clip-vit-large-patch14",
 }
 
 
 def load_clip_to_cpu(cfg):
     backbone_name = cfg.MODEL.BACKBONE.NAME
     url = _MODELS[backbone_name]
 
-    model = CLIPModel.from_pretrained(url)
-    processor = CLIPProcessor.from_pretrained(url)
+    # Check if it's a Chinese CLIP model
+    if backbone_name.startswith("CnViT") and CHINESE_CLIP_AVAILABLE:
+        model = ChineseCLIPModel.from_pretrained(url)
+        processor = ChineseCLIPProcessor.from_pretrained(url)
+    else:
+        model = CLIPModel.from_pretrained(url)
+        processor = CLIPProcessor.from_pretrained(url)
+
     # model.initialize_parameters()
 
     return model, processor
@@ -67,7 +83,6 @@ def forward(self, x):
         return x
 
 
-# use clip textencode
 class TextEncoder(nn.Module):
 
     def __init__(self, cfg, classnames, clip_model, processor):
@@ -77,6 +92,8 @@ def __init__(self, cfg, classnames, clip_model, processor):
         self.clip_model = clip_model
         self.tokenizer = processor.tokenizer
         self.dtype = clip_model.dtype
+        # Check if it's Chinese CLIP model by checking model type
+        self.is_chinese_clip = type(clip_model).__name__ == "ChineseCLIPModel"
 
     def forward(self, classname=None):
         # for small dataset, we tokenize all prompt ------- if classname is None
@@ -88,12 +105,26 @@ def forward(self, classname=None):
             temp = CUSTOM_TEMPLATES[self.cfg.DATASET.NAME]
             prompts = [temp.format(c.replace("_", " ")) for c in classname]
 
-        prompts = self.tokenizer(prompts, return_tensors="pt", padding=True)["input_ids"]
+        # Use tokenizer for both models (same interface)
+        # Set max_length to prevent sequence length errors
+        tokenized = self.tokenizer(prompts, return_tensors="pt", padding=True)
+
         if self.cfg.TRAINER.COOP.XPU:
-            prompts = prompts.to(self.cfg.TRAINER.COOP.XPU_ID)
+            tokenized = {k: v.to(self.cfg.TRAINER.COOP.XPU_ID) for k, v in tokenized.items()}
+        else:
+            tokenized = {k: v.to(self.cfg.TRAINER.COOP.CUDA_ID) for k, v in tokenized.items()}
+
+        # Handle different model architectures
+        text_outputs = self.clip_model.text_model(**tokenized)
+
+        if text_outputs.pooler_output is not None:
+            # Standard CLIP has pooler_output
+            text_features = text_outputs.pooler_output
         else:
-            prompts = prompts.to(self.cfg.TRAINER.COOP.CUDA_ID)
-        text_features = self.clip_model.text_model(prompts)[1]
+            # Chinese CLIP doesn't have pooler_output, use last hidden state's first token
+            # Use [CLS] token
+            text_features = text_outputs.last_hidden_state[:, 0, :]
+
         text_features = self.clip_model.text_projection(text_features)
         return text_features
 
@@ -110,11 +141,22 @@ def __init__(self, cfg, classnames, clip_model, processor):
             self.text_encoder = TextEncoder(cfg, classnames, clip_model, processor)
         self.logit_scale = clip_model.logit_scale
         self.dtype = clip_model.dtype
-        # init adapter
-        self.adapter = Adapter(512, 4).to(clip_model.dtype)
+        # Check if it's Chinese CLIP model
+        self.is_chinese_clip = type(clip_model).__name__ == "ChineseCLIPModel"
+        projection_dim = clip_model.visual_projection.out_features
+        self.adapter = Adapter(projection_dim, 4).to(clip_model.dtype)
 
     def forward(self, image, classname=None):
-        image_features = self.image_encoder(image.type(self.dtype))[1]
+        # Handle different vision model outputs
+        vision_outputs = self.image_encoder(image.type(self.dtype))
+        if hasattr(vision_outputs, "pooler_output") and vision_outputs.pooler_output is not None:
+            image_features = vision_outputs.pooler_output
+        elif isinstance(vision_outputs, tuple) and len(vision_outputs) > 1:
+            image_features = vision_outputs[1]  # pooled output
+        else:
+            # Fallback: use last hidden state
+            image_features = vision_outputs.last_hidden_state.mean(dim=1)
+
         image_features = self.visual_projection(image_features)
         # apply adapter in ViT
         x = self.adapter(image_features)
@@ -206,7 +248,16 @@ def get_text_embeds(self, text):
         return text_features
 
     def get_img_embeds(self, image):
-        image_features = self.model.image_encoder(image.type(self.model.dtype))[1]
+        # Handle different vision model outputs
+        vision_outputs = self.model.image_encoder(image.type(self.model.dtype))
+        if hasattr(vision_outputs, "pooler_output") and vision_outputs.pooler_output is not None:
+            image_features = vision_outputs.pooler_output
+        elif isinstance(vision_outputs, tuple) and len(vision_outputs) > 1:
+            image_features = vision_outputs[1]  # pooled output
+        else:
+            # Fallback: use last hidden state
+            image_features = vision_outputs.last_hidden_state.mean(dim=1)
+
         image_features = self.model.visual_projection(image_features)
         x = self.model.adapter(image_features)