You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Hybridization across attention heads (`FastGen`)
16
+
- Hybridization across attention heads, e.g., `FastGen`
17
17
18
18
**Cold Compress** implements existing methods, for which we make sure to give all the credit. Yet, to demystify these approaches, we use generic names to represent classes of existing methods (e.g., `Heavy Hitter` to cover {[`H20`](https://arxiv.org/abs/2306.14048), [`Scissorhands`](https://arxiv.org/abs/2305.17118), [`PyramidKV`](https://arxiv.org/abs/2406.02069)}).
19
19
@@ -37,7 +37,7 @@ bash scripts/prepare_qwen2.sh
37
37
38
38
This will download model and tokenizer files from HuggingFace for [`Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), [`meta-llama/Llama-2-7b-chat-hf`](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and [`Qwen/Qwen2-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and save them into a usable format inside `./checkpoints`.
39
39
40
-
Please raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) if you would like to see more models supported.
40
+
Please raise an [issue](https://github.com/AnswerDotAI/context-compression/issues)or join our [public Discord server](https://discord.gg/NvERJKrEdA)if you would like to see more models supported or collaborate on new releases.
Use `generate.py` to vibe test methods on individual prompts.
104
106
105
-
For benchmarking, use `eval.py` which supports evals on a growing list of long-context tasks: domain-specific ([Dolomites](https://dolomites-benchmark.github.io/index.html)), synthetic ([RULER](https://arxiv.org/abs/2404.06654)), QA ([MuSiQue](https://arxiv.org/abs/2108.00573v3), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [QuALITY](https://arxiv.org/abs/2112.08608)), coding ([RepoBench](https://arxiv.org/abs/2306.03091)), summarization ([QMSum](https://arxiv.org/abs/2104.05938), [SQuALITY](https://arxiv.org/abs/2205.11465)), and long generation (PG-19 books).
107
+
For benchmarking, use `eval.py` which supports evals on a *growing* list of long-context tasks: domain-specific ([Dolomites](https://dolomites-benchmark.github.io/index.html)), synthetic ([RULER](https://arxiv.org/abs/2404.06654)), QA ([MuSiQue](https://arxiv.org/abs/2108.00573v3), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [QuALITY](https://arxiv.org/abs/2112.08608)), coding ([RepoBench](https://arxiv.org/abs/2306.03091)), summarization ([QMSum](https://arxiv.org/abs/2104.05938), [SQuALITY](https://arxiv.org/abs/2205.11465)), and long generation ([PG-19 book corpus](https://github.com/google-deepmind/pg19)).
@@ -138,7 +140,7 @@ We elicit Claude Haiku responses using Answer.AI's Anthropic wrapper [Claudette]
138
140
139
141
### Parallelizing Eval
140
142
141
-
As of now, GPT-Fast and, by extension, **Cold Compress**, only supports single batch, single GPU inference. (We are working on adding batched multi-GPU inference.)
143
+
As of now, GPT-Fast and, by extension, **Cold Compress**, only supports single batch, single GPU inference. *(We are working on adding batched multi-GPU inference.)*
142
144
143
145
To take advantage of multiple GPUs, we’ve written a script to parallelize eval jobs.
144
146
@@ -227,12 +229,12 @@ The value passed for `--max_cache_length` represents the average cache size for
227
229
The default for cache length pattern is tile, which tiles the cache size pattern provided to match the model depth.
You can kick off a pre-defined set of experiments for length customization by running
@@ -369,12 +371,12 @@ Here, we’ve decided to filter out tokens [in the middle](https://arxiv.org/abs
369
371
370
372
# Getting Involved
371
373
372
-
We'd love for you to get involved and collectively aim to improve `Cold Compress` for future releases.
374
+
We'd **love** for you to get involved and collectively aim to improve `Cold Compress` for future releases.
373
375
374
-
1. Participate in discussions of KV Cache Compression on our [discord channel](www.todo.com).
376
+
1. Participate in discussions of KV Cache Compression on our [public Discord server](https://discord.gg/NvERJKrEdA).
375
377
2. Raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) for something you'd like to see fixed, improved, or built.
376
-
3.Even better, issue a [Pull Request](https://github.com/AnswerDotAI/context-compression/pulls)for a change.
377
-
4. Submit a new KV Cache compression method via a [PR](https://github.com/AnswerDotAI/context-compression/pulls) which we will run and add to our [Leaderboard](www.todo.com).
378
+
3.Or, issue a [Pull Request](https://github.com/AnswerDotAI/context-compression/pulls)with code you'd like added.
379
+
4.*[Coming Soon]*Submit a new KV Cache compression method via a [PR](https://github.com/AnswerDotAI/context-compression/pulls) which we will run and add to our Leaderboard.
378
380
379
381
## Getting Involved with Modeling
380
382
@@ -392,9 +394,14 @@ We are actively exploring the following enhancements to our evaluation benchmark
392
394
2. Synthetic tasks with abrupt topic shifts to pressure test token dropping.
393
395
3. A better understanding of the impact of attention loss on downstream loss.
394
396
397
+
## Getting Involved with Optimizations
398
+
We are interested in adding support for:
399
+
1. Batched, multi-GPU inference.
400
+
2. Recently released [`FlexAttention`](https://github.com/pytorch/pytorch/blob/69e2590490e7bf84aaea2dc5b1b56411dc43e406/torch/_higher_order_ops/flex_attention.py) from PyTorch.
401
+
395
402
# Citations
396
403
397
-
**Cold Compress** implements methods from other papers. If you use it in your work, please consider citing the following.
404
+
**Cold Compress** implements methods introduced in existing work. If you it it in *your* work, please make sure to cite the following:
0 commit comments