Update README.md to include link to discord server.

Griffin Adams · web-flow · commit 87f87699c6d4 · 2024-07-22T08:23:15.000-04:00
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ Our initial release (**Cold Compress 1.0**) supports a wide set of popular appro
 - Sliding window attention, e.g., `Recent Tokens`
 - Preservation of attention sinks, e.g., `Global Tokens`
 - Layerwise customization, e.g., `Pyramid` shaped compression, alternating `Local-Global` attention
-- Hybridization across attention heads (`FastGen`)
+- Hybridization across attention heads, e.g., `FastGen`
 
 **Cold Compress** implements existing methods, for which we make sure to give all the credit. Yet, to demystify these approaches, we use generic names to represent classes of existing methods (e.g., `Heavy Hitter` to cover {[`H20`](https://arxiv.org/abs/2306.14048), [`Scissorhands`](https://arxiv.org/abs/2305.17118), [`PyramidKV`](https://arxiv.org/abs/2406.02069)}).
 
@@ -37,7 +37,7 @@ bash scripts/prepare_qwen2.sh
 
 This will download model and tokenizer files from HuggingFace for [`Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), [`meta-llama/Llama-2-7b-chat-hf`](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and [`Qwen/Qwen2-7B-Instruct`](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and save them into a usable format inside `./checkpoints`.
 
-Please raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) if you would like to see more models supported.
+Please raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) or join our [public Discord server](https://discord.gg/NvERJKrEdA) if you would like to see more models supported or collaborate on new releases.
 
 ## Quick Start
 
@@ -88,10 +88,12 @@ python generate.py --compile --prompt reverse_list.txt --checkpoint_path ./check
 
 It can be a pain to pass a long list of hyper-parameters via the command line.
 
-To avoid this, you can create a `Cache Config` *.yaml* file instead.
+To avoid this, you can create a Cache Config *.yaml* file instead under `./cache_configs`.
 
 We've pre-populated it with some configurations, including the `recent_global` strategy discussed above.
 
+**[`./cache_configs/recent_global.yaml`]((https://github.com/AnswerDotAI/cold-compress/blob/main/cache_configs/recent_global.yaml)**
+
 ```
 cache_strategy: ["recent_global"]
 prompt_compression_strategy: ["recent_global"]
@@ -102,7 +104,7 @@ global_tokens: 4
 
 Use `generate.py` to vibe test methods on individual prompts.
 
-For benchmarking, use `eval.py` which supports evals on a growing list of long-context tasks: domain-specific ([Dolomites](https://dolomites-benchmark.github.io/index.html)), synthetic ([RULER](https://arxiv.org/abs/2404.06654)), QA ([MuSiQue](https://arxiv.org/abs/2108.00573v3), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [QuALITY](https://arxiv.org/abs/2112.08608)), coding ([RepoBench](https://arxiv.org/abs/2306.03091)), summarization ([QMSum](https://arxiv.org/abs/2104.05938), [SQuALITY](https://arxiv.org/abs/2205.11465)), and long generation (PG-19 books).
+For benchmarking, use `eval.py` which supports evals on a *growing* list of long-context tasks: domain-specific ([Dolomites](https://dolomites-benchmark.github.io/index.html)), synthetic ([RULER](https://arxiv.org/abs/2404.06654)), QA ([MuSiQue](https://arxiv.org/abs/2108.00573v3), [TriviaQA](https://nlp.cs.washington.edu/triviaqa/), [QuALITY](https://arxiv.org/abs/2112.08608)), coding ([RepoBench](https://arxiv.org/abs/2306.03091)), summarization ([QMSum](https://arxiv.org/abs/2104.05938), [SQuALITY](https://arxiv.org/abs/2205.11465)), and long generation ([PG-19 book corpus](https://github.com/google-deepmind/pg19)).
 
 ```
 python eval.py –cache_config hybrid –tasks dolomites rulerniah
@@ -138,7 +140,7 @@ We elicit Claude Haiku responses using Answer.AI's Anthropic wrapper [Claudette]
 
 ### Parallelizing Eval
 
-As of now, GPT-Fast and, by extension, **Cold Compress**, only supports single batch, single GPU inference. (We are working on adding batched multi-GPU inference.)
+As of now, GPT-Fast and, by extension, **Cold Compress**, only supports single batch, single GPU inference. *(We are working on adding batched multi-GPU inference.)*
 
 To take advantage of multiple GPUs, we’ve written a script to parallelize eval jobs. 
 
@@ -227,12 +229,12 @@ The value passed for `--max_cache_length` represents the average cache size for
 The default for cache length pattern is tile, which tiles the cache size pattern provided to match the model depth.
 
 ```
---max_cache_length 0.1 0.5 –cache_length_pattern tile
+python eval.py --cache_strategy recent_global --max_cache_length 0.1 0.5 –cache_length_pattern tile
 ```
 Assigns a cache size of `0.1` to the first `L // 2` layers and `0.5` to the second half.  To alternate, set:
 
 ```
---max_cache_length 0.1 0.5 –cache_length_pattern repeat
+python eval.py --cache_strategy recent_global --max_cache_length 0.1 0.5 –cache_length_pattern repeat
 ```
 
 You can kick off a pre-defined set of experiments for length customization by running
@@ -369,12 +371,12 @@ Here, we’ve decided to filter out tokens [in the middle](https://arxiv.org/abs
 
 # Getting Involved
 
-We'd love for you to get involved and collectively aim to improve `Cold Compress` for future releases.
+We'd **love** for you to get involved and collectively aim to improve `Cold Compress` for future releases.
 
-1. Participate in discussions of KV Cache Compression on our [discord channel](www.todo.com).
+1. Participate in discussions of KV Cache Compression on our [public Discord server](https://discord.gg/NvERJKrEdA).
 2. Raise an [issue](https://github.com/AnswerDotAI/context-compression/issues) for something you'd like to see fixed, improved, or built.
-3. Even better, issue a [Pull Request](https://github.com/AnswerDotAI/context-compression/pulls) for a change.
-4. Submit a new KV Cache compression method via a [PR](https://github.com/AnswerDotAI/context-compression/pulls) which we will run and add to our [Leaderboard](www.todo.com).
+3. Or, issue a [Pull Request](https://github.com/AnswerDotAI/context-compression/pulls) with code you'd like added.
+4. *[Coming Soon]* Submit a new KV Cache compression method via a [PR](https://github.com/AnswerDotAI/context-compression/pulls) which we will run and add to our Leaderboard.
 
 ## Getting Involved with Modeling
 
@@ -392,9 +394,14 @@ We are actively exploring the following enhancements to our evaluation benchmark
 2. Synthetic tasks with abrupt topic shifts to pressure test token dropping.
 3. A better understanding of the impact of attention loss on downstream loss.
 
+## Getting Involved with Optimizations
+We are interested in adding support for:
+1. Batched, multi-GPU inference.
+2. Recently released [`FlexAttention`](https://github.com/pytorch/pytorch/blob/69e2590490e7bf84aaea2dc5b1b56411dc43e406/torch/_higher_order_ops/flex_attention.py) from PyTorch.
+
 # Citations
 
-**Cold Compress** implements methods from other papers. If you use it in your work, please consider citing the following.
+**Cold Compress** implements methods introduced in existing work. If you it it in *your* work, please make sure to cite the following:
 
 ## Recent Global