This repository is the official implementation of Can Visual Encoder Learn to See Arrows? presented as a poster in the Second Workshop on Visual Concepts at CVPR 2025.
Install dependencies from the requirements file:
pip install -r requirements.txt├── data/ # Dataset generation scripts
├── encoder_finetune/ # CLIP finetuning code
├── decoder_finetune/ # GPT-2 decoder training code
├── eval/ # Evaluation scripts
│ ├── probing/ # Linear probing evaluation
│ ├── image_retrieval/ # Image retrieval evaluation
│ └── captioning/ # Diagram captioning evaluation
└── README.md # This file
Generate synthetic diagram-caption pairs without textual/positional biases:
python data/generate_dataset.py \
--dataset_size 100000 \
--n_nodes 8 \
--prob_edge 0.2 \
--output_dir dataset/synth_n_nodes_8_prob_edge_0_2_no_label/ \
--seed 42This creates 100K diagram images with Mermaid-style captions where edges cannot be inferred from node positions or text content.
python encoder_finetune/preprocess.py \
--train_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/train.json \
--image_dir ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/images \
--output_dir ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/Train both CLIP-ViT-Base/32 and CLIP-ViT-Large/14 models:
python encoder_finetune/finetune_clip_hf.py \
--do_train --do_eval \
--model_name_or_path [MODEL_NAME_OR_PATH] \
--train_file dataset/synth_n_nodes_8_prob_edge_0_2_no_label/clip_train.json \
--validation_file dataset/synth_n_nodes_8_prob_edge_0_2_no_label/clip_val.json \
--output_dir [OUTPUT_DIR] \
--caption_column caption --image_column image \
--remove_unused_columns=False --overwrite_output_dir=True \
--max_seq_length=77 --num_train_epochs=100 \
--per_device_train_batch_size=64 --learning_rate="5e-5" \
--warmup_steps="0" --weight_decay 0.1 \
--logging_dir [LOGGING_DIR] --logging_step 100| Model | MODEL_NAME_OR_PATH | OUTPUT_DIR | LOGGING_DIR |
|---|---|---|---|
| CLIP-Base | openai/clip-vit-base-patch32 |
checkpoints/clip-base/ |
checkpoints/clip-base/logs |
| CLIP-Large | openai/clip-vit-large-patch14-336 |
checkpoints/clip-large/ |
checkpoints/clip-large/logs |
Evaluate edge recognition capabilities with frozen encoders:
python -m eval.probing.run_eval dataset/synth_n_nodes_8_prob_edge_0_2_no_label \
--n_nodes 8 --rate_train 0.5 --result_dir results/probing/ \
--imbalance_method undersample \
[MODEL_SPECIFIC_OPTIONS]| Model Variant | MODEL_SPECIFIC_OPTIONS |
|---|---|
| Original CLIP-Base | (no additional options) |
| Original CLIP-Large | --model openai/clip-vit-large-patch14-336 |
| Finetuned CLIP-Base | --model checkpoints/clip-base/checkpoint-126600 --preprocessor openai/clip-vit-base-patch32 |
| Finetuned CLIP-Large | --model checkpoints/clip-large/checkpoint-63300 --preprocessor openai/clip-vit-large-patch14-336 |
Test layout-invariant diagram image retrieval:
Step 1: Extract embeddings
python eval/image_retrieval/embedding_extraction.py \
--dataset_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/test.json \
--model [MODEL_PATH] --preprocessor [PREPROCESSOR_PATH] \
--output_dir ./results/retrieval/embeddings/[VARIANT_NAME]/Step 2: Evaluate retrieval performance
python eval/image_retrieval/evaluation.py \
--dataset_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/test.json \
--model [MODEL_PATH] --preprocessor [PREPROCESSOR_PATH] \
--image_embeds ./results/retrieval/embeddings/[VARIANT_NAME]/image_embeddings.pkl \
--batch_size 64 --n_queries 1000 --seed 0 --layout fdp --top_k 100 \
--output_score --output_retrieved_json \
--output_dir ./results/retrieval/scores/[VARIANT_NAME]| Model Variant | MODEL_PATH | PREPROCESSOR_PATH | VARIANT_NAME |
|---|---|---|---|
| Original CLIP-Base | openai/clip-vit-base-patch32 |
openai/clip-vit-base-patch32 |
base_ori |
| Finetuned CLIP-Base | ./checkpoints/clip-base/checkpoint-126600 |
openai/clip-vit-base-patch32 |
base_finetuned |
| Original CLIP-Large | openai/clip-vit-large-patch14-336 |
openai/clip-vit-large-patch14-336 |
large_ori |
| Finetuned CLIP-Large | ./checkpoints/clip-large/checkpoint-63300 |
openai/clip-vit-large-patch14-336 |
large_finetuned |
Train GPT-2 decoders on different CLIP encoders:
Step 1: Preprocess data
# Preprocess training data
python decoder_finetune/preprocess.py \
--input_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/train.json \
--output_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_train.json
# Preprocess test data
python decoder_finetune/preprocess.py \
--input_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/test.json \
--output_json ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_test.jsonStep 2: Train GPT-2 decoder
python decoder_finetune/train.py \
--annotation_file ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_train.json \
--dataset_dir ./dataset/synth_n_nodes_8_prob_edge_0_2_no_label \
--vision_encoder_name_or_path [ENCODER_PATH] \
--processor_name [PROCESSOR_NAME] \
--model_name_or_path openai-community/gpt2 \
--tokenizer_name openai-community/gpt2 \
--freeze_vision_encoder True \
--output_dir ./checkpoints/gpt2_[VARIANT_NAME] \
--logging_dir ./checkpoints/gpt2_[VARIANT_NAME]/logs \
--per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
--num_train_epoch 100 --learning_rate 5e-5 --warmup_steps 1024 \
--do_train --do_eval --predict_with_generate \
--logging_steps 10 --save_steps 100 --eval_steps 100 \
--overwrite_output_dir True --report_to tensorboard --seed 42| Model Variant | ENCODER_PATH | PROCESSOR_NAME | VARIANT_NAME |
|---|---|---|---|
| Original CLIP-Base | openai/clip-vit-base-patch32 |
openai/clip-vit-base-patch32 |
ori_clip_base |
| Finetuned CLIP-Base | ./checkpoints/clip-base/checkpoint-126600 |
openai/clip-vit-base-patch32 |
tuned_clip_base |
| Original CLIP-Large | openai/clip-vit-large-patch14-336 |
openai/clip-vit-large-patch14-336 |
ori_clip_large |
| Finetuned CLIP-Large | ./checkpoints/clip-large/checkpoint-63300 |
openai/clip-vit-large-patch14-336 |
tuned_clip_large |
Step 1: Generate captions
python decoder_finetune/inference.py \
--model_checkpoint_path checkpoints/gpt2_[VARIANT_NAME]/checkpoint-237000 \
--image_processor_name_or_path [PROCESSOR_NAME] \
--dataset_dir dataset/synth_n_nodes_8_prob_edge_0_2_no_label \
--run_label [VARIANT_NAME] \
--output_dir results/captioning/generated_captions/Step 2: Evaluate captions
python eval/captioning/eval.py \
--model_name [VARIANT_NAME] \
--generated_result_json results/captioning/generated_captions/[VARIANT_NAME]_result.json \
--annotation_json dataset/synth_n_nodes_8_prob_edge_0_2_no_label/converted_test.json \
--strict_edge| Model Variant | VARIANT_NAME | PROCESSOR_NAME |
|---|---|---|
| Original CLIP-Base | ori_clip_base |
openai/clip-vit-base-patch32 |
| Finetuned CLIP-Base | tuned_clip_base |
openai/clip-vit-base-patch32 |
| Original CLIP-Large | ori_clip_large |
openai/clip-vit-large-patch14-336 |
| Finetuned CLIP-Large | tuned_clip_large |
openai/clip-vit-large-patch14-336 |
If you find this work useful for your research, please consider citing our paper:
@inproceedings{
terashita2025can,
title={Can Visual Encoder Learn to See Arrows?},
author={Naoyuki Terashita and Yusuke Tozaki and Hideaki Omote and Kha Cong Nguyen and Ryosuke Nakamoto and Yuta Koreeda and Hiroaki Ozaki},
booktitle={Second Workshop on Visual Concepts},
year={2025},
url={https://openreview.net/forum?id=eLZURVinXI}
}