Skip to content

hitachi-rd-cv/clip-diagram-finetuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Can Visual Encoder Learn to See Arrows?

This repository is the official implementation of Can Visual Encoder Learn to See Arrows? presented as a poster in the Second Workshop on Visual Concepts at CVPR 2025.

Setup

Requirements

Install dependencies from the requirements file:

pip install -r requirements.txt

Repository Structure

├── data/                    # Dataset generation scripts
├── encoder_finetune/        # CLIP finetuning code
├── decoder_finetune/        # GPT-2 decoder training code  
├── eval/                    # Evaluation scripts
│   ├── probing/            # Linear probing evaluation
│   ├── image_retrieval/    # Image retrieval evaluation
│   └── captioning/         # Diagram captioning evaluation
└── README.md              # This file

Step-by-Step Reproduction

1. Dataset Generation

Generate synthetic diagram-caption pairs without textual/positional biases:

python data/generate_dataset.py \
    --dataset_size 100000 \
    --n_nodes 8 \
    --prob_edge 0.2 \
    --output_dir dataset/synth_n_nodes_8_prob_edge_0_2_no_label/ \
    --seed 42

This creates 100K diagram images with Mermaid-style captions where edges cannot be inferred from node positions or text content.

2. CLIP Encoder Finetuning

Preprocessing

python encoder_finetune/preprocess.py \
    --train_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/train.json \
    --image_dir ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/images \
    --output_dir ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/

Contrastive Learning

Train both CLIP-ViT-Base/32 and CLIP-ViT-Large/14 models:

python encoder_finetune/finetune_clip_hf.py \
    --do_train --do_eval \
    --model_name_or_path [MODEL_NAME_OR_PATH] \
    --train_file dataset/synth_n_nodes_8_prob_edge_0_2_no_label/clip_train.json \
    --validation_file dataset/synth_n_nodes_8_prob_edge_0_2_no_label/clip_val.json \
    --output_dir [OUTPUT_DIR] \
    --caption_column caption --image_column image \
    --remove_unused_columns=False --overwrite_output_dir=True \
    --max_seq_length=77 --num_train_epochs=100 \
    --per_device_train_batch_size=64 --learning_rate="5e-5" \
    --warmup_steps="0" --weight_decay 0.1 \
    --logging_dir [LOGGING_DIR] --logging_step 100
Model MODEL_NAME_OR_PATH OUTPUT_DIR LOGGING_DIR
CLIP-Base openai/clip-vit-base-patch32 checkpoints/clip-base/ checkpoints/clip-base/logs
CLIP-Large openai/clip-vit-large-patch14-336 checkpoints/clip-large/ checkpoints/clip-large/logs

3. Linear Probing

Evaluate edge recognition capabilities with frozen encoders:

python -m eval.probing.run_eval dataset/synth_n_nodes_8_prob_edge_0_2_no_label \
    --n_nodes 8 --rate_train 0.5 --result_dir results/probing/ \
    --imbalance_method undersample \
    [MODEL_SPECIFIC_OPTIONS]
Model Variant MODEL_SPECIFIC_OPTIONS
Original CLIP-Base (no additional options)
Original CLIP-Large --model openai/clip-vit-large-patch14-336
Finetuned CLIP-Base --model checkpoints/clip-base/checkpoint-126600 --preprocessor openai/clip-vit-base-patch32
Finetuned CLIP-Large --model checkpoints/clip-large/checkpoint-63300 --preprocessor openai/clip-vit-large-patch14-336

4. Image Retrieval

Test layout-invariant diagram image retrieval:

Step 1: Extract embeddings

python eval/image_retrieval/embedding_extraction.py \
    --dataset_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/test.json \
    --model [MODEL_PATH] --preprocessor [PREPROCESSOR_PATH] \
    --output_dir ./results/retrieval/embeddings/[VARIANT_NAME]/

Step 2: Evaluate retrieval performance

python eval/image_retrieval/evaluation.py \
    --dataset_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/test.json \
    --model [MODEL_PATH] --preprocessor [PREPROCESSOR_PATH] \
    --image_embeds ./results/retrieval/embeddings/[VARIANT_NAME]/image_embeddings.pkl \
    --batch_size 64 --n_queries 1000 --seed 0 --layout fdp --top_k 100 \
    --output_score --output_retrieved_json \
    --output_dir ./results/retrieval/scores/[VARIANT_NAME]
Model Variant MODEL_PATH PREPROCESSOR_PATH VARIANT_NAME
Original CLIP-Base openai/clip-vit-base-patch32 openai/clip-vit-base-patch32 base_ori
Finetuned CLIP-Base ./checkpoints/clip-base/checkpoint-126600 openai/clip-vit-base-patch32 base_finetuned
Original CLIP-Large openai/clip-vit-large-patch14-336 openai/clip-vit-large-patch14-336 large_ori
Finetuned CLIP-Large ./checkpoints/clip-large/checkpoint-63300 openai/clip-vit-large-patch14-336 large_finetuned

5. Diagram Captioning

Decoder Training

Train GPT-2 decoders on different CLIP encoders:

Step 1: Preprocess data

# Preprocess training data
python decoder_finetune/preprocess.py \
    --input_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/train.json \
    --output_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_train.json

# Preprocess test data  
python decoder_finetune/preprocess.py \
    --input_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/test.json \
    --output_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_test.json

Step 2: Train GPT-2 decoder

python decoder_finetune/train.py \
    --annotation_file ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_train.json \
    --dataset_dir ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label \
    --vision_encoder_name_or_path [ENCODER_PATH] \
    --processor_name [PROCESSOR_NAME] \
    --model_name_or_path openai-community/gpt2 \
    --tokenizer_name openai-community/gpt2 \
    --freeze_vision_encoder True \
    --output_dir ./checkpoints/gpt2_[VARIANT_NAME] \
    --logging_dir ./checkpoints/gpt2_[VARIANT_NAME]/logs \
    --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
    --num_train_epoch 100 --learning_rate 5e-5 --warmup_steps 1024 \
    --do_train --do_eval --predict_with_generate \
    --logging_steps 10 --save_steps 100 --eval_steps 100 \
    --overwrite_output_dir True --report_to tensorboard --seed 42
Model Variant ENCODER_PATH PROCESSOR_NAME VARIANT_NAME
Original CLIP-Base openai/clip-vit-base-patch32 openai/clip-vit-base-patch32 ori_clip_base
Finetuned CLIP-Base ./checkpoints/clip-base/checkpoint-126600 openai/clip-vit-base-patch32 tuned_clip_base
Original CLIP-Large openai/clip-vit-large-patch14-336 openai/clip-vit-large-patch14-336 ori_clip_large
Finetuned CLIP-Large ./checkpoints/clip-large/checkpoint-63300 openai/clip-vit-large-patch14-336 tuned_clip_large

Captioning Inference & Evaluation

Step 1: Generate captions

python decoder_finetune/inference.py \
    --model_checkpoint_path checkpoints/gpt2_[VARIANT_NAME]/checkpoint-237000 \
    --image_processor_name_or_path [PROCESSOR_NAME] \
    --dataset_dir dataset/synth_n_nodes_8_prob_edge_0_2_no_label \
    --run_label [VARIANT_NAME] \
    --output_dir results/captioning/generated_captions/

Step 2: Evaluate captions

python eval/captioning/eval.py \
    --model_name [VARIANT_NAME] \
    --generated_result_json results/captioning/generated_captions/[VARIANT_NAME]_result.json \
    --annotation_json dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_test.json \
    --strict_edge
Model Variant VARIANT_NAME PROCESSOR_NAME
Original CLIP-Base ori_clip_base openai/clip-vit-base-patch32
Finetuned CLIP-Base tuned_clip_base openai/clip-vit-base-patch32
Original CLIP-Large ori_clip_large openai/clip-vit-large-patch14-336
Finetuned CLIP-Large tuned_clip_large openai/clip-vit-large-patch14-336

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{
terashita2025can,
title={Can Visual Encoder Learn to See Arrows?},
author={Naoyuki Terashita and Yusuke Tozaki and Hideaki Omote and Kha Cong Nguyen and Ryosuke Nakamoto and Yuta Koreeda and Hiroaki Ozaki},
booktitle={Second Workshop on Visual Concepts},
year={2025},
url={https://openreview.net/forum?id=eLZURVinXI}
}

About

Official repository for CVPR 2025 workshop paper "Can Visual Encoder Learn to See Arrows?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors