Skip to content

haowei-freesky/HERMES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo HERMES

KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Project Page Paper HF Paper

🔥 News

  • [2026.03.23] Full code released!
  • [2025.01.23] HERMES reached #3 Paper of the day on Hugging Face Daily Papers!
  • [2025.01.21] HERMES is available on arXiv.

🛠️ Installation

For LLaVA model inference:

conda create -n hermes-llava python=3.12 -y
conda activate hermes-llava
pip install -r requirements_llava.txt
pip install flash-attn --no-build-isolation

For Qwen2.5-VL model inference:

conda create -n hermes-qwen python=3.12 -y
conda activate hermes-qwen
pip install -r requirements_qwen.txt
pip install flash-attn --no-build-isolation

📦 Preparation

Model Preparation

Create a models directory and download the model weights from HuggingFace:

mkdir models

We support the following models (choose one or more):

Model Family Model HuggingFace Link
LLaVA-OneVision llava-onevision-qwen2-0.5b-ov-hf llava-hf/llava-onevision-qwen2-0.5b-ov-hf
LLaVA-OneVision llava-onevision-qwen2-7b-ov-hf llava-hf/llava-onevision-qwen2-7b-ov-hf
LLaVA-OneVision llava-onevision-qwen2-72b-ov-hf llava-hf/llava-onevision-qwen2-72b-ov-hf
Qwen2.5-VL Qwen2.5-VL-3B-Instruct Qwen/Qwen2.5-VL-3B-Instruct
Qwen2.5-VL Qwen2.5-VL-7B-Instruct Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-VL Qwen2.5-VL-32B-Instruct Qwen/Qwen2.5-VL-32B-Instruct

Data Preparation

Download the benchmark videos from their official sources and place them according to the paths specified in the annotation files:

Streaming Benchmarks:

Benchmark Video Path Official Source
StreamingBench /data/streamingbench/videos/ 🤗 StreamingBench
OVO-Bench /data/ovobench/videos/ 🤗 OVO-Bench
RVS-Ego /data/rvs/ego/videos/ 🤗 RVS
RVS-Movie /data/rvs/movie/videos/ 🤗 RVS

Offline Benchmarks:

Benchmark Video Path Official Source
VideoMME /data/videomme/videos/ 🤗 VideoMME
MVBench /data/mvbench/videos/ 🤗 MVBench
EgoSchema /data/egoschema/videos/ 🤗 EgoSchema

The annotation JSON files contain the same information as officially provided, with formatting adjustments to adapt to our codebase.

After preparation, the project structure should look like this:

HERMES/
├── asset/
│   └── logo.png
├── data/
│   ├── egoschema/
│   │   ├── videos/
│   │   └── egoschema.json
│   ├── mvbench/
│   │   ├── videos/
│   │   └── mvbench.json
│   ├── ovobench/
│   │   ├── videos/
│   │   └── ovobench_realtime_backeward.json
│   ├── rvs/
│   │   ├── ego/
│   │   │   ├── videos/
│   │   │   └── ego4d_oe.json
│   │   └── movie/
│   │       ├── videos/
│   │       └── movienet_oe.json
│   ├── streamingbench/
│   │   ├── videos/
│   │   └── streamingbench_realtime.json
│   └── videomme/
│       ├── videos/
│       └── videomme.json
├── eval/
│   ├── eval_multiple_choice.py
│   └── eval_open_ended.py
├── inference/
│   ├── abstract_hermes.py
│   ├── llavaov_hermes.py
│   ├── qwenvl_hermes.py
│   ├── reindex_1d.py
│   └── reindex_3d.py
├── models/
│   ├── llava-onevision-qwen2-0.5b-ov-hf/
│   ├── llava-onevision-qwen2-7b-ov-hf/
│   ├── llava-onevision-qwen2-72b-ov-hf/
│   ├── Qwen2.5-VL-3B-Instruct/
│   ├── Qwen2.5-VL-7B-Instruct/
│   └── Qwen2.5-VL-32B-Instruct/
├── scripts/
│   └── run_infer.sh
├── video_qa/
│   ├── base.py
│   ├── hermes_vqa.py
│   └── run_infer.py
├── LICENSE
├── README.md
├── requirements_llava.txt
└── requirements_qwen.txt

🚀 Inference

Simply run the inference script:

bash scripts/run_infer.sh

Here is the content of scripts/run_infer.sh:

export PYTHONPATH=$(cd "$(dirname "$0")/.." && pwd):$PYTHONPATH

num_chunks=8
model=llava_ov_7b
dataset=streamingbench

python video_qa/run_infer.py \
    --num_chunks $num_chunks \
    --model ${model} \
    --dataset ${dataset} \
    --sample_fps 0.5 \
    --kv_size 6000

Arguments:

Argument Description
model Model to use. Options: llava_ov_0.5b, llava_ov_7b, llava_ov_72b, qwen2.5_vl_3b, qwen2.5_vl_7b, qwen2.5_vl_32b
dataset Benchmark dataset. Options: videomme, mvbench, egoschema, rvs_ego, rvs_movie, ovobench, streamingbench
num_chunks Number of parallel processes for evaluation, typically set to the number of GPUs
sample_fps Frame sampling rate (frames per second) from the video
kv_size Maximum KV cache size for HERMES hierarchical memory management
only_eval If set, skip inference and only run evaluation on existing results

📊 Evaluation

The evaluation scripts compute metrics on the inference results:

  • Multiple-choice benchmarks (VideoMME, MVBench, EgoSchema, OVBench, StreamingBench) are evaluated by eval/eval_multiple_choice.py, which takes a subcommand as its first argument:
Subcommand Description Used by
general Compute overall accuracy, task-specific breakdown (auto-detects OVBench / StreamingBench), and prediction error analysis MVBench, OVBench, StreamingBench, VideoMME
videomme Report accuracy broken down by video duration (short / medium / long) VideoMME
egoschema Generate EgoSchema submission CSV file EgoSchema
python eval/eval_multiple_choice.py general --results_path results/llava_ov_7b/streamingbench/fps0.5-kv6000/results.csv
  • Open-ended benchmarks (RVS-Ego, RVS-Movie) are evaluated by eval/eval_open_ended.py, which uses GPT for answer scoring:
python eval/eval_open_ended.py \
    --pred_path results/llava_ov_7b/rvs_ego/fps0.5-kv6000/results.csv \
    --output_dir results/llava_ov_7b/rvs_ego/fps0.5-kv6000/tmp \
    --output_json results/llava_ov_7b/rvs_ego/fps0.5-kv6000/results.json

📧 Contact

For any questions regarding the paper or the technical implementation, please feel free to contact haowei.zhang123@gmail.com

🙏 Acknowledgements

Our codebase is built upon ReKV. We gratefully acknowledge their contributions to the community.

📝 Citation

If you find our work useful for research, please cite our paper and give us a precious star 😄:

@misc{zhang2026hermeskvcachehierarchical,
      title={HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding}, 
      author={Haowei Zhang and Shudong Yang and Jinlan Fu and See-Kiong Ng and Xipeng Qiu},
      year={2026},
      eprint={2601.14724},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14724}, 
}

About

Official Repository for paper "HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors