Skip to content

MarsJacobs/ml-epicache

 
 

Repository files navigation

EpiCache: Episodic KV Cache Management for long conversational Question Answering

This repository contains the code to reproduce results for the paper EpiCache.
Paper: https://arxiv.org/abs/2509.17396

This code is developed based on KVzip repo (MIT License).
https://github.com/snu-mllab/KVzip and https://github.com/FFY0/AdaKV

Installation

The code here has been verified to work with the combination of CUDA 12.1, Python 3.10, and PyTorch 2.3.

pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
make i

Evaluation

# For EpiCache
bash scripts/run_epicache_qwen.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${DO_ALLOC} ${DATA} ${TARGET_LENGTH}
bash scripts/run_llama_epicache.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${DO_ALLOC} ${DATA} ${TARGET_LENGTH}

# For Baseline (KVzip, KeyDiff, SnapKV, InfiniPot)
bash scripts/run_qwen_baseline.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${METHOD} ${DATA} ${TARGET_LENGTH}
bash scripts/run_llama_baseline.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${METHOD} ${DATA} ${TARGET_LENGTH}

# GPT Score for Realtalk
python run_gpt_eval.py --directory ${RESULT_DIR}

Arguments

  • GPU_NUM: GPU index

  • SCALE: Model scale (3B, 7B, 8B)

  • LEVEL: KV cache selection strategy

    • pair: non-uniform selection (head-wise)
    • pair-uniform: uniform per head
    • Example (KV budget = 16K across 4 heads): pair: [2K, 6K, 3K, 5K], pair-uniform: [4K, 4K, 4K, 4K]
  • KV_BUDGET: KV cache budget ($M$)

  • CHUNK_SIZE: Block size used in block prefill eviction

  • EMBEDDING: Embedding model for clustering & query matching

    • qwen: Qwen-0.6B (higher accuracy)
    • sentence: MiniLM-v2-L6 (faster, recommended)
    • llm: Target LLM’s internal embedding layer (lower accuracy)
  • POWER: Sharpness parameter ($\alpha$) for layer-wise allocation

  • DATA: Dataset (locomo, realtalk, longmemeval)

  • DO_ALLOC: Toggle sensitivity-aware allocation (False = off, True = on)

  • TARGET_LENGTH: Target context length for LongMemEval

  • METHOD: Baseline compression method (kvzip, keydiff, infinipot, snapkv)

Additional Evaluation Options

For convenience, we provide two alternative entry points:

  • run_epicache.py Full EpiCache pipeline: Clustering → query matching per question → KV cache provision → decoding.

  • run_epicache_eval.py Evaluation-optimized: Pre-embeds all benchmark questions and matches them to episodes. Evaluation is then performed cluster-wise based on matched questions.

Dataset

LongMemEval

Please download the compressed raw data from LongMemEval Repo and uncompress it under data/longmemeval/custom_history/. The released data consists of the following parts:

  • 1_attr_bg/data_1_attr_bg.json: user attributes and backgrounds
  • 2_questions/: questions, answers, evidence statements, and evidence sessions
  • 5_filler_sess/data_5_filler_sess.json: filler sessions from ShareGPT and UltraChat
  • 6_session_cache/data_6_session_cache.json: simulated user sessions based on extracted background facts

After downloading, you can generate custom long conversations using:

# Example: generate 50 user–LLM conversations with ~100K tokens
python data/longmemeval/create_custom_dataset.py \
  --num_conversations 50 \
  --min_tokens 95000 \
  --max_tokens 100000 \
  --output_file custom_lme_100000_50.json

python data/longmemeval/convert_longmemeval.py \
   --input_path custom_lme_100000_50.json \
   --output_path custom_lme_100000_50.pt

This script stacks evidence sessions until the target length is reached, producing coherent user–LLM conversations with corresponding QA pairs for evaluation.

LoCoMo & Realtalk

LoCoMo and Realtalk need to be downloaded and converted into pre-processed data for evaluation:

# LoCoMo
python data/locomo/convert_locomo.py

# Realtalk
python data/realtalk/convert_realtalk.py

These scripts will generate:

  • data/locomo_preprocessed.pt
  • data/realtalk_preprocessed.pt

Both files are ready to use for evaluation. If you wish to modify the instructions appended to each subtask, edit the corresponding convert_*.py script.

BookSum (Layer-wise Sensitivity Profiling)

Please download the BookSum dataset by following the instructions from the DuoAttention Repository, and place it under data/booksum/.

You can then preprocess the dataset and run layer-wise profiling as follows:

# BookSum preprocessing (default input length = 16K)
python data/booksum/preproc_booksum.py \
  --model_path ${MODEL} \
  --max_length ${MAX_LENGTH}

# Layer-wise sensitivity profiling (use preprocessed file)
python data/layer_scores/layer_profile.py \
  --model_path ${MODEL} \
  --input_file ${BOOKSUM_PRE}

This will generate a JSON file containing per-layer sensitivity scores. Set the generated file as SCORE_PATH when running scripts/run_epicache_* to enable sensitivity-aware allocation.

Citation

If you find our work useful, please cite:

@misc{epicache,
title = {EpiCache: Episodic KV Cache Management for Long Conversational Question Answering},
author = {Minsoo Kim and Arnav Kundu and Han-Byul Kim and Richa Dixit and Minsik Cho},
year = {2025},
URL = {https://arxiv.org/abs/2509.17396}
}

For questions or feedback, please feel free to:

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.3%
  • Shell 4.8%
  • Cuda 2.0%
  • Other 0.9%