FasterDecoding
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 1 deletion b/‎.gitignore‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 40 additions & 20 deletions b/‎README.md‎
Lines changed: 40 additions & 20 deletions
diff --git a/‎figures/LWM-Text-Chat-1M_SnapKV.jpg‎ ‎assets/LWM-Text-Chat-1M_SnapKV.jpg‎figures/LWM-Text-Chat-1M_SnapKV.jpg renamed to assets/LWM-Text-Chat-1M_SnapKV.jpg b/‎figures/LWM-Text-Chat-1M_SnapKV.jpg‎ ‎assets/LWM-Text-Chat-1M_SnapKV.jpg‎figures/LWM-Text-Chat-1M_SnapKV.jpg renamed to assets/LWM-Text-Chat-1M_SnapKV.jpg
diff --git a/‎figures/longbench.jpg‎ ‎assets/longbench.jpg‎figures/longbench.jpg renamed to assets/longbench.jpg b/‎figures/longbench.jpg‎ ‎assets/longbench.jpg‎figures/longbench.jpg renamed to assets/longbench.jpg
diff --git a/‎notebook/example.ipynb‎
Lines changed: 168 additions & 0 deletions b/‎notebook/example.ipynb‎
Lines changed: 168 additions & 0 deletions
@@ -157,4 +157,5 @@ cython_debug/
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+#.idea/
+notebook/test*
@@ -1,31 +1,51 @@
 # SnapKV :camera:
-We introduce an innovative and out-of-box KV cache compression method, SnapKV.
+We introduce an innovative and out-of-box KV cache compression method, [SnapKV](https://arxiv.org/abs/2404.14469).
 ## Requirements
-`transformers>=4.36`
+Currently tested with `transformers==4.37.0`, need to check if it is compatible with higher version.
+```
+transformers>=4.36
+flash-attn==2.4.0
+```
+## Installation
+```
+git clone [email protected]:FasterDecoding/SnapKV.git
+cd SnapKV
+pip install -e .
+```
 ## Quick Start
 ### Use SnapKV-optimized Models
-SnapKV-optimized models are all under models file, which could be directly imported and used the same like baseline models.
 For example: 
 ```python
-from models.modeling_mistral import MistralForCausalLM as SnapKVMistralForCausalLM
-model = SnapKVMistralForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype=torch.float16,
-    low_cpu_mem_usage=True,
-    device_map="auto",
-    use_flash_attention_2=True
-)
-tokenizer = transformers.AutoTokenizer.from_pretrained(
-    model_name,
-    padding_side="right",
-    use_fast=False,
-)
+from snapkv.monkeypatch.monkeypatch import replace_mistral
+replace_mistral() # Use monkey patches enable SnapKV
 ```
 
+Check [the example notebook](./notebook/example.ipynb).
+
 ### Customize Your SnapKV-optimized Models
-SnapKV can be easily integrate with other models. You can follow the comment marked with `[SnapKV]` in [existing models](./models) to constrcut your own models. The detailed algorithm of SnapKV is in [snapkv_utils.py](./snapkv_utils.py)
+SnapKV can be easily integrate with other models. 
 
+You can follow the comment marked with `[SnapKV]` in [existing models](./snapkv/monkeypatch/monkeypatch.py) to construct your own models. (Currently we support [Llama family](./snapkv/monkeypatch/llama_hijack_4_37.py)/ [Mistral](./snapkv/monkeypatch//mistral_hijack_4_37.py)/ [Mixtral](./snapkv/monkeypatch//mixtral_hijack_4_37.py)) 
 
-## Results
-![Comprehensive Experiment Results on LongBench](./figures/longbench.jpg)
-![Pressure Test Result on Needle-in-a-Haystack](./figures/LWM-Text-Chat-1M_SnapKV.jpg)
+The detailed algorithm of SnapKV is in [`snapkv_utils.py`](./snapkv/monkeypatch/snapkv_utils.py)
+
+
+## Partial Results
+![Comprehensive Experiment Results on LongBench](./assets/longbench.jpg)
+![Pressure Test Result on Needle-in-a-Haystack](./assets/LWM-Text-Chat-1M_SnapKV.jpg)
+
+## TODO
+- [ ] Add observation experiments for reduplication.
+- [ ] Add LongBench for reduplication.
+- [ ] Explore the prompt phase compression.
+
+## Citation
+If you feel this project is helpful, please consider cite our report :blush:
+```
+@article{li2024snapkv,
+  title={SnapKV: LLM Knows What You are Looking for Before Generation},
+  author={Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming},
+  journal={arXiv preprint arXiv:2404.14469},
+  year={2024}
+}
+```
@@ -0,0 +1,168 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "# CUDAVISIBLE DEVICES\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
+    "import torch\n",
+    "from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig\n",
+    "import transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from snapkv.monkeypatch.monkeypatch import replace_llama, replace_mistral, replace_mixtral"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "replace_mistral()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from fastchat.model import load_model, get_conversation_template"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    \"mistralai/Mistral-7B-Instruct-v0.2\",\n",
+    "    torch_dtype=torch.bfloat16,\n",
+    "    low_cpu_mem_usage=True,\n",
+    "    device_map=\"auto\",\n",
+    "    use_flash_attention_2=True\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mistral-7B-Instruct-v0.2\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open('snapkv.txt', 'r') as f:\n",
+    "    content = f.read().strip()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "question = \"\\n What is the repository of SnapKV?\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "conv = get_conversation_template(\"longchat\")\n",
+    "conv.messages = []\n",
+    "conv.append_message(conv.roles[0],content + question)\n",
+    "# conv.append_message(conv.roles[0],\"Who is Kobe Bryant?\")\n",
+    "conv.append_message(conv.roles[1], None)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = conv.get_prompt()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "input_ids = tokenizer.encode(prompt, return_tensors='pt')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "input_ids_len = input_ids.size(1)\n",
+    "print(input_ids_len)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "outputs = model.generate(input_ids.cuda(), max_new_tokens=200, do_sample=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(tokenizer.decode(outputs[0][input_ids_len:], skip_special_tokens=True))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "code_attn",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}