Skip to content

The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Notifications You must be signed in to change notification settings

HumanMLLM/LLaVA-Scissor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Boyuan Sun1,2* · Jiaxing Zhao2* · Xihan Wei2 · Qibin Hou1†

1VCIP, Nankai University    2Tongyi Lab, Alibaba Group    

*Equal contribution  †Corresponding author

Paper PDF

News

Overview

LLaVA-Scissor provides a solution for compressing video tokens based on identifying semantic connected components.

avatar

Previous methods mostly attempt to compress tokens based on attention scores, but fail to achieve complete semantic coverage and tend to repeatedly select similar semantic regions. To address this, we propose a novel method termed Semantic Connected Components (SCC), which identifies semantically coherent groups of tokens to represent distinct, non-overlapping semantic regions across the video. SCC does not rely on positional adjacency, enabling it to capture global semantic relationships across the entire token sequence.

Building upon SCC, we propose LLaVA-Scissor, a two-step spatio-temporal token compression strategy. Specifically, we first extract spatially distinct semantic regions within each frame, and then perform temporal compression by identifying and merging redundant semantics across frames. The final token set consists of compact yet diverse tokens that jointly represent the entire video content without semantic duplication.

For technical details, please refer to our full paper on arXiv.

Getting Started

Installation

git clone [email protected]:HumanMLLM/LLaVA-Scissor.git
cd LLaVA-Scissor
conda create -n llava_scissor python=3.10
conda activate llava_scissor
pip install -e .[train]
pip install "accelerate==0.28.0"
pip install matplotlib huggingface_hub

Enhanced LLaVA-OneVision Baseline:

mkdir model_zoo
cd model_zoo

# (Optional) Using huggingface mirrors
export HF_ENDPOINT=https://hf-mirror.com

# download the SIGLIP model form huggingface
huggingface-cli download --resume-download google/siglip-so400m-patch14-384 --local-dir google/siglip-so400m-patch14-384

# download enhanced baseline model from huggingface
# 7B model
huggingface-cli download --resume-download BBBBCHAN/LLaVA-Scissor-baseline-7B --local-dir LLaVA-Scissor-baseline-7B
# 0.5B model
huggingface-cli download --resume-download LLaVA-Scissor-baseline-0.5B --local-dir LLaVA-Scissor-baseline-0.5B

# We also support the original LLaVA-OneVision models.
huggingface-cli download --resume-download LLaVA-Scissor-baseline-7B --local-dir LLaVA-Scissor-baseline-7B
huggingface-cli download --resume-download LLaVA-Scissor-baseline-0.5B --local-dir LLaVA-Scissor-baseline-0.5B

Setting hyperparameters

We provide use similarity threshold $\tau$ and error tolerance $\epsilon$ to control the compression process. You may find and modify mm_zip_tau and mm_zip_epsilon in config.json of the model directory (model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B for instance).

The default setting uses mm_zip_tau = 0.9 and mm_zip_epsilon = 0.05. The similarity threshold mm_zip_tau controls the semantic similarity required to group tokens into the same connected component, while the error tolerance mm_zip_epsilon determines the approximation accuracy when computing these components.

We recommend to adjust the similarity threshold mm_zip_tau to control the token retation ratio.

Usage

Quick Start

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import numpy as np
from PIL import Image
import requests
import copy
import warnings
from decord import VideoReader, cpu

warnings.filterwarnings("ignore")
# Load model
pretrained = "model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B/"
model_name = "llava_qwen_zip"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa")

model.eval()


def load_video(video_path, max_frames_num):
    if type(video_path) == str:
        vr = VideoReader(video_path, ctx=cpu(0))
    else:
        vr = VideoReader(video_path[0], ctx=cpu(0))
    total_frame_num = len(vr)
    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
    frame_idx = uniform_sampled_frames.tolist()
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    return spare_frames  # (frames, height, width, channels)


# Load and process video
video_path = "Your/path/to/the/video"
video_frames = load_video(video_path, 16)
print(video_frames.shape)
image_tensors = []
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
image_tensors.append(frames)

# Prepare conversation input
conv_template = "qwen_2"
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video in detail."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]

# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

Evaluation

We implemented the evaluation of LLaVA-Scissor based on lmms_eval.

MODEL_NAME="llava_scissor_7B_baseline"
MODEL_PATH='model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B/'

accelerate launch --num_processes 8 --main_process_port 23555 -m lmms_eval \
    --model llava_onevision \
    --model_args pretrained=$MODEL_PATH,conv_template=qwen_2,model_name=llava_qwen_zip \
    --tasks mvbench \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix eval \
    --output_path ./logs/$MODEL_NAME

echo $MODEL_NAME 
echo $MODEL_PATH

We provide more evaluation scripts in scripts/eval/.

Citation

If you find our repo useful for your research, please consider citing our paper:

@article{sun2025llava,
  title={LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs},
  author={Sun, Boyuan and Zhao, Jiaxing and Wei, Xihan and Hou, Qibin},
  journal={arXiv preprint arXiv:2506.21862},
  year={2025}
}

License

This code is licensed under the Creative Commons Attribution-NonCommercial 4.0 International for non-commercial use only. Please note that any commercial use of this code requires formal permission prior to use.

Contact

For technical questions, please contact sbysbysby123[AT]gmail.com.

For commercial licensing, please contact andrewhoux[AT]gmail.com.

Join us

If you are looking for jobs/internships in Tongyi Lab, Alibaba Group, please contact [email protected] (WeChat: weixihan1).

Acknowledgement

We thank LLaVA-Next, Qwen2.5, SIGLIP, Oryx-Dataset and other excellent works for their amazing projects!

About

The official code for the paper: LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published