EpiCache: Episodic KV Cache Management for long conversational Question Answering

This repository contains the code to reproduce results for the paper EpiCache.
Paper: https://arxiv.org/abs/2509.17396

This code is developed based on KVzip repo (MIT License).
https://github.com/snu-mllab/KVzip and https://github.com/FFY0/AdaKV

Installation

The code here has been verified to work with the combination of CUDA 12.1, Python 3.10, and PyTorch 2.3.

pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
make i

Evaluation

# For EpiCache
bash scripts/run_epicache_qwen.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${DO_ALLOC} ${DATA} ${TARGET_LENGTH}
bash scripts/run_llama_epicache.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${DO_ALLOC} ${DATA} ${TARGET_LENGTH}

# For Baseline (KVzip, KeyDiff, SnapKV, InfiniPot)
bash scripts/run_qwen_baseline.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${METHOD} ${DATA} ${TARGET_LENGTH}
bash scripts/run_llama_baseline.sh ${GPU_NUM} ${SCALE} ${LEVEL} ${KV_BUDGET} ${CHUNK_SIZE} ${METHOD} ${DATA} ${TARGET_LENGTH}

# GPT Score for Realtalk
python run_gpt_eval.py --directory ${RESULT_DIR}

Arguments

GPU_NUM: GPU index
SCALE: Model scale (3B, 7B, 8B)
LEVEL: KV cache selection strategy
- pair: non-uniform selection (head-wise)
- pair-uniform: uniform per head
- Example (KV budget = 16K across 4 heads): pair: [2K, 6K, 3K, 5K], pair-uniform: [4K, 4K, 4K, 4K]
KV_BUDGET: KV cache budget ($M$)
CHUNK_SIZE: Block size used in block prefill eviction
EMBEDDING: Embedding model for clustering & query matching
- qwen: Qwen-0.6B (higher accuracy)
- sentence: MiniLM-v2-L6 (faster, recommended)
- llm: Target LLM’s internal embedding layer (lower accuracy)
POWER: Sharpness parameter ($\alpha$) for layer-wise allocation
DATA: Dataset (locomo, realtalk, longmemeval)
DO_ALLOC: Toggle sensitivity-aware allocation (False = off, True = on)
TARGET_LENGTH: Target context length for LongMemEval
METHOD: Baseline compression method (kvzip, keydiff, infinipot, snapkv)

Additional Evaluation Options

For convenience, we provide two alternative entry points:

run_epicache.py Full EpiCache pipeline: Clustering → query matching per question → KV cache provision → decoding.
run_epicache_eval.py Evaluation-optimized: Pre-embeds all benchmark questions and matches them to episodes. Evaluation is then performed cluster-wise based on matched questions.

Dataset

LongMemEval

Please download the compressed raw data from LongMemEval Repo and uncompress it under data/longmemeval/custom_history/. The released data consists of the following parts:

1_attr_bg/data_1_attr_bg.json: user attributes and backgrounds
2_questions/: questions, answers, evidence statements, and evidence sessions
5_filler_sess/data_5_filler_sess.json: filler sessions from ShareGPT and UltraChat
6_session_cache/data_6_session_cache.json: simulated user sessions based on extracted background facts

After downloading, you can generate custom long conversations using:

# Example: generate 50 user–LLM conversations with ~100K tokens
python data/longmemeval/create_custom_dataset.py \
  --num_conversations 50 \
  --min_tokens 95000 \
  --max_tokens 100000 \
  --output_file custom_lme_100000_50.json

python data/longmemeval/convert_longmemeval.py \
   --input_path custom_lme_100000_50.json \
   --output_path custom_lme_100000_50.pt

This script stacks evidence sessions until the target length is reached, producing coherent user–LLM conversations with corresponding QA pairs for evaluation.

LoCoMo & Realtalk

LoCoMo and Realtalk need to be downloaded and converted into pre-processed data for evaluation:

# LoCoMo
python data/locomo/convert_locomo.py

# Realtalk
python data/realtalk/convert_realtalk.py

These scripts will generate:

data/locomo_preprocessed.pt
data/realtalk_preprocessed.pt

Both files are ready to use for evaluation. If you wish to modify the instructions appended to each subtask, edit the corresponding convert_*.py script.

BookSum (Layer-wise Sensitivity Profiling)

Please download the BookSum dataset by following the instructions from the DuoAttention Repository, and place it under data/booksum/.

You can then preprocess the dataset and run layer-wise profiling as follows:

# BookSum preprocessing (default input length = 16K)
python data/booksum/preproc_booksum.py \
  --model_path ${MODEL} \
  --max_length ${MAX_LENGTH}

# Layer-wise sensitivity profiling (use preprocessed file)
python data/layer_scores/layer_profile.py \
  --model_path ${MODEL} \
  --input_file ${BOOKSUM_PRE}

This will generate a JSON file containing per-layer sensitivity scores. Set the generated file as SCORE_PATH when running scripts/run_epicache_* to enable sensitivity-aware allocation.

Citation

If you find our work useful, please cite:

@misc{epicache,
title = {EpiCache: Episodic KV Cache Management for Long Conversational Question Answering},
author = {Minsoo Kim and Arnav Kundu and Han-Byul Kim and Richa Dixit and Minsik Cho},
year = {2025},
URL = {https://arxiv.org/abs/2509.17396}
}

For questions or feedback, please feel free to:

Open an issue in this repository, or
Reach out via email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
attention		attention
csrc		csrc
data		data
images		images
model		model
scripts		scripts
utils		utils
.gitignore		.gitignore
ACKNOWLEDGMENTS		ACKNOWLEDGMENTS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
args.py		args.py
makefile		makefile
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_baseline.py		run_baseline.py
run_epicache.py		run_epicache.py
run_epicache_eval.py		run_epicache_eval.py
run_gpt_eval.py		run_gpt_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EpiCache: Episodic KV Cache Management for long conversational Question Answering

Installation

Evaluation

Arguments

Additional Evaluation Options

Dataset

LongMemEval

LoCoMo & Realtalk

BookSum (Layer-wise Sensitivity Profiling)

Citation

About

Uh oh!

Releases

Packages

Languages

License

MarsJacobs/ml-epicache

Folders and files

Latest commit

History

Repository files navigation

EpiCache: Episodic KV Cache Management for long conversational Question Answering

Installation

Evaluation

Arguments

Additional Evaluation Options

Dataset

LongMemEval

LoCoMo & Realtalk

BookSum (Layer-wise Sensitivity Profiling)

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages