Skip to content

Commit 87f8769

Browse files
author
Griffin Adams
authored
Update README.md to include link to discord server.
1 parent bead5d3 commit 87f8769

File tree

1 file changed

+19
-12
lines changed

1 file changed

+19
-12
lines changed

README.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Our initial release (**Cold Compress 1.0**) supports a wide set of popular appro
1313
- Sliding window attention, e.g., `Recent Tokens`
1414
- Preservation of attention sinks, e.g., `Global Tokens`
1515
- Layerwise customization, e.g., `Pyramid` shaped compression, alternating `Local-Global` attention
16-
- Hybridization across attention heads (`FastGen`)
16+
- Hybridization across attention heads, e.g., `FastGen`
1717

1818
**Cold Compress** implements existing methods, for which we make sure to give all the credit. Yet, to demystify these approaches, we use generic names to represent classes of existing methods (e.g., `Heavy Hitter` to cover {[`H20`](https://arxiv.org/abs/2306.14048), [`Scissorhands`](https://arxiv.org/abs/2305.17118), [`PyramidKV`](https://arxiv.org/abs/2406.02069)}).
1919

@@ -37,7 +37,7 @@ bash scripts/prepare_qwen2.sh
3737

3838
This will download model and tokenizer files from HuggingFace for [`Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), [`meta-llama/Llama-2-7b-chat-hf`](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and [`Qwen/Qwen2-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and save them into a usable format inside `./checkpoints`.
3939

40-
Please raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) if you would like to see more models supported.
40+
Please raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) or join our [public Discord server](https://discord.gg/NvERJKrEdA) if you would like to see more models supported or collaborate on new releases.
4141

4242
## Quick Start
4343

@@ -88,10 +88,12 @@ python generate.py --compile --prompt reverse_list.txt --checkpoint_path ./check
8888

8989
It can be a pain to pass a long list of hyper-parameters via the command line.
9090

91-
To avoid this, you can create a `Cache Config` *.yaml* file instead.
91+
To avoid this, you can create a Cache Config *.yaml* file instead under `./cache_configs`.
9292

9393
We've pre-populated it with some configurations, including the `recent_global` strategy discussed above.
9494

95+
**[`./cache_configs/recent_global.yaml`]((https://github.com/AnswerDotAI/cold-compress/blob/main/cache_configs/recent_global.yaml)**
96+
9597
```
9698
cache_strategy: ["recent_global"]
9799
prompt_compression_strategy: ["recent_global"]
@@ -102,7 +104,7 @@ global_tokens: 4
102104

103105
Use `generate.py` to vibe test methods on individual prompts.
104106

105-
For benchmarking, use `eval.py` which supports evals on a growing list of long-context tasks: domain-specific ([Dolomites](https://dolomites-benchmark.github.io/index.html)), synthetic ([RULER](https://arxiv.org/abs/2404.06654)), QA ([MuSiQue](https://arxiv.org/abs/2108.00573v3), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [QuALITY](https://arxiv.org/abs/2112.08608)), coding ([RepoBench](https://arxiv.org/abs/2306.03091)), summarization ([QMSum](https://arxiv.org/abs/2104.05938), [SQuALITY](https://arxiv.org/abs/2205.11465)), and long generation (PG-19 books).
107+
For benchmarking, use `eval.py` which supports evals on a *growing* list of long-context tasks: domain-specific ([Dolomites](https://dolomites-benchmark.github.io/index.html)), synthetic ([RULER](https://arxiv.org/abs/2404.06654)), QA ([MuSiQue](https://arxiv.org/abs/2108.00573v3), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [QuALITY](https://arxiv.org/abs/2112.08608)), coding ([RepoBench](https://arxiv.org/abs/2306.03091)), summarization ([QMSum](https://arxiv.org/abs/2104.05938), [SQuALITY](https://arxiv.org/abs/2205.11465)), and long generation ([PG-19 book corpus](https://github.com/google-deepmind/pg19)).
106108

107109
```
108110
python eval.py –cache_config hybrid –tasks dolomites rulerniah
@@ -138,7 +140,7 @@ We elicit Claude Haiku responses using Answer.AI's Anthropic wrapper [Claudette]
138140

139141
### Parallelizing Eval
140142

141-
As of now, GPT-Fast and, by extension, **Cold Compress**, only supports single batch, single GPU inference. (We are working on adding batched multi-GPU inference.)
143+
As of now, GPT-Fast and, by extension, **Cold Compress**, only supports single batch, single GPU inference. *(We are working on adding batched multi-GPU inference.)*
142144

143145
To take advantage of multiple GPUs, we’ve written a script to parallelize eval jobs.
144146

@@ -227,12 +229,12 @@ The value passed for `--max_cache_length` represents the average cache size for
227229
The default for cache length pattern is tile, which tiles the cache size pattern provided to match the model depth.
228230

229231
```
230-
--max_cache_length 0.1 0.5 –cache_length_pattern tile
232+
python eval.py --cache_strategy recent_global --max_cache_length 0.1 0.5 –cache_length_pattern tile
231233
```
232234
Assigns a cache size of `0.1` to the first `L // 2` layers and `0.5` to the second half. To alternate, set:
233235

234236
```
235-
--max_cache_length 0.1 0.5 –cache_length_pattern repeat
237+
python eval.py --cache_strategy recent_global --max_cache_length 0.1 0.5 –cache_length_pattern repeat
236238
```
237239

238240
You can kick off a pre-defined set of experiments for length customization by running
@@ -369,12 +371,12 @@ Here, we’ve decided to filter out tokens [in the middle](https://arxiv.org/abs
369371

370372
# Getting Involved
371373

372-
We'd love for you to get involved and collectively aim to improve `Cold Compress` for future releases.
374+
We'd **love** for you to get involved and collectively aim to improve `Cold Compress` for future releases.
373375

374-
1. Participate in discussions of KV Cache Compression on our [discord channel](www.todo.com).
376+
1. Participate in discussions of KV Cache Compression on our [public Discord server](https://discord.gg/NvERJKrEdA).
375377
2. Raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) for something you'd like to see fixed, improved, or built.
376-
3. Even better, issue a [Pull Request](https://github.com/AnswerDotAI/context-compression/pulls) for a change.
377-
4. Submit a new KV Cache compression method via a [PR](https://github.com/AnswerDotAI/context-compression/pulls) which we will run and add to our [Leaderboard](www.todo.com).
378+
3. Or, issue a [Pull Request](https://github.com/AnswerDotAI/context-compression/pulls) with code you'd like added.
379+
4. *[Coming Soon]* Submit a new KV Cache compression method via a [PR](https://github.com/AnswerDotAI/context-compression/pulls) which we will run and add to our Leaderboard.
378380

379381
## Getting Involved with Modeling
380382

@@ -392,9 +394,14 @@ We are actively exploring the following enhancements to our evaluation benchmark
392394
2. Synthetic tasks with abrupt topic shifts to pressure test token dropping.
393395
3. A better understanding of the impact of attention loss on downstream loss.
394396

397+
## Getting Involved with Optimizations
398+
We are interested in adding support for:
399+
1. Batched, multi-GPU inference.
400+
2. Recently released [`FlexAttention`](https://github.com/pytorch/pytorch/blob/69e2590490e7bf84aaea2dc5b1b56411dc43e406/torch/_higher_order_ops/flex_attention.py) from PyTorch.
401+
395402
# Citations
396403

397-
**Cold Compress** implements methods from other papers. If you use it in your work, please consider citing the following.
404+
**Cold Compress** implements methods introduced in existing work. If you it it in *your* work, please make sure to cite the following:
398405

399406
## Recent Global
400407

0 commit comments

Comments
 (0)