News

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Boyuan Sun^1,2* · Jiaxing Zhao^2* · Xihan Wei² · Qibin Hou^1†

¹VCIP, Nankai University ²Tongyi Lab, Alibaba Group

*Equal contribution †Corresponding author

News

2025-07-01: 🚀 LLaVA-Scissor was ranked #2 Paper of the Day on HuggingFace Daily Papers.
2025-06-30: 🔥 The code of LLaVA Scissor is released.
2025-06-30: 🔥 The paper of LLaVA Scissor is released.

Overview

LLaVA-Scissor provides a solution for compressing video tokens based on identifying semantic connected components.

Previous methods mostly attempt to compress tokens based on attention scores, but fail to achieve complete semantic coverage and tend to repeatedly select similar semantic regions. To address this, we propose a novel method termed Semantic Connected Components (SCC), which identifies semantically coherent groups of tokens to represent distinct, non-overlapping semantic regions across the video. SCC does not rely on positional adjacency, enabling it to capture global semantic relationships across the entire token sequence.

Building upon SCC, we propose LLaVA-Scissor, a two-step spatio-temporal token compression strategy. Specifically, we first extract spatially distinct semantic regions within each frame, and then perform temporal compression by identifying and merging redundant semantics across frames. The final token set consists of compact yet diverse tokens that jointly represent the entire video content without semantic duplication.

For technical details, please refer to our full paper on arXiv.

Getting Started

Installation

git clone [email protected]:HumanMLLM/LLaVA-Scissor.git
cd LLaVA-Scissor
conda create -n llava_scissor python=3.10
conda activate llava_scissor
pip install -e .[train]
pip install "accelerate==0.28.0"
pip install matplotlib huggingface_hub

Enhanced LLaVA-OneVision Baseline:

mkdir model_zoo
cd model_zoo

# (Optional) Using huggingface mirrors
export HF_ENDPOINT=https://hf-mirror.com

# download the SIGLIP model form huggingface
huggingface-cli download --resume-download google/siglip-so400m-patch14-384 --local-dir google/siglip-so400m-patch14-384

# download enhanced baseline model from huggingface
# 7B model
huggingface-cli download --resume-download BBBBCHAN/LLaVA-Scissor-baseline-7B --local-dir LLaVA-Scissor-baseline-7B
# 0.5B model
huggingface-cli download --resume-download LLaVA-Scissor-baseline-0.5B --local-dir LLaVA-Scissor-baseline-0.5B

# We also support the original LLaVA-OneVision models.
huggingface-cli download --resume-download LLaVA-Scissor-baseline-7B --local-dir LLaVA-Scissor-baseline-7B
huggingface-cli download --resume-download LLaVA-Scissor-baseline-0.5B --local-dir LLaVA-Scissor-baseline-0.5B

Setting hyperparameters

We provide use similarity threshold $\tau$ and error tolerance $\epsilon$ to control the compression process. You may find and modify mm_zip_tau and mm_zip_epsilon in config.json of the model directory (model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B for instance).

The default setting uses mm_zip_tau = 0.9 and mm_zip_epsilon = 0.05. The similarity threshold mm_zip_tau controls the semantic similarity required to group tokens into the same connected component, while the error tolerance mm_zip_epsilon determines the approximation accuracy when computing these components.

We recommend to adjust the similarity threshold mm_zip_tau to control the token retation ratio.

Usage

Quick Start

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import numpy as np
from PIL import Image
import requests
import copy
import warnings
from decord import VideoReader, cpu

warnings.filterwarnings("ignore")
# Load model
pretrained = "model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B/"
model_name = "llava_qwen_zip"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation="sdpa")

model.eval()


def load_video(video_path, max_frames_num):
    if type(video_path) == str:
        vr = VideoReader(video_path, ctx=cpu(0))
    else:
        vr = VideoReader(video_path[0], ctx=cpu(0))
    total_frame_num = len(vr)
    uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
    frame_idx = uniform_sampled_frames.tolist()
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    return spare_frames  # (frames, height, width, channels)


# Load and process video
video_path = "Your/path/to/the/video"
video_frames = load_video(video_path, 16)
print(video_frames.shape)
image_tensors = []
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
image_tensors.append(frames)

# Prepare conversation input
conv_template = "qwen_2"
question = f"{DEFAULT_IMAGE_TOKEN}\nDescribe this video in detail."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]

# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

Evaluation

We implemented the evaluation of LLaVA-Scissor based on lmms_eval.

MODEL_NAME="llava_scissor_7B_baseline"
MODEL_PATH='model_zoo/BBBBCHAN/LLaVA-Scissor-baseline-7B/'

accelerate launch --num_processes 8 --main_process_port 23555 -m lmms_eval \
    --model llava_onevision \
    --model_args pretrained=$MODEL_PATH,conv_template=qwen_2,model_name=llava_qwen_zip \
    --tasks mvbench \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix eval \
    --output_path ./logs/$MODEL_NAME

echo $MODEL_NAME 
echo $MODEL_PATH

We provide more evaluation scripts in scripts/eval/.

Citation

If you find our repo useful for your research, please consider citing our paper:

@article{sun2025llava,
  title={LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs},
  author={Sun, Boyuan and Zhao, Jiaxing and Wei, Xihan and Hou, Qibin},
  journal={arXiv preprint arXiv:2506.21862},
  year={2025}
}

License

This code is licensed under the Creative Commons Attribution-NonCommercial 4.0 International for non-commercial use only. Please note that any commercial use of this code requires formal permission prior to use.

Contact

For technical questions, please contact sbysbysby123[AT]gmail.com.

For commercial licensing, please contact andrewhoux[AT]gmail.com.

Join us

If you are looking for jobs/internships in Tongyi Lab, Alibaba Group, please contact [email protected] (WeChat: weixihan1).

Acknowledgement

We thank LLaVA-Next, Qwen2.5, SIGLIP, Oryx-Dataset and other excellent works for their amazing projects!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
llava.egg-info		llava.egg-info
llava		llava
scripts		scripts
README.md		README.md
inference.py		inference.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

News

Overview

Getting Started

Installation

Enhanced LLaVA-OneVision Baseline:

Setting hyperparameters

Usage

Quick Start

Evaluation

Citation

License

Contact

Join us

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

HumanMLLM/LLaVA-Scissor

Folders and files

Latest commit

History

Repository files navigation

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

News

Overview

Getting Started

Installation

Enhanced LLaVA-OneVision Baseline:

Setting hyperparameters

Usage

Quick Start

Evaluation

Citation

License

Contact

Join us

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages