You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-9Lines changed: 25 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,22 +16,17 @@ Deploying long-context LLMs is costly due to the linear growth of the key-value
16
16
pip install kvpress
17
17
```
18
18
19
-
If possible, install flash attention:
20
-
```bash
21
-
pip install flash-attn --no-build-isolation
22
-
```
23
-
24
-
For a local installation with all dev dependencies, use poetry:
19
+
For a local installation with all dev dependencies, use uv:
25
20
26
21
```bash
27
22
git clone https://github.com/NVIDIA/kvpress.git
28
23
cd kvpress
29
-
poetry install --with dev
24
+
uv sync --all-groups
30
25
```
31
26
32
27
## Usage
33
28
34
-
kvpress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a `compression_ratio` attribute that measures the compression of the cache. The easiest way to use a press is through our custom `KVPressTextGenerationPipeline`. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:
29
+
KVPress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a `compression_ratio` attribute that measures the compression of the cache. The easiest way to use a press is through our custom `KVPressTextGenerationPipeline`. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:
35
30
36
31
```python
37
32
from transformers import pipeline
@@ -208,4 +203,25 @@ with press(model):
208
203
209
204
However, the `generate` method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (_e.g._ for use cases such as chat or document question answering). Finally the `generate` method does not allow to provide generation for multiple questions at once.
210
205
211
-
</details>
206
+
</details>
207
+
208
+
209
+
## Advanced installation settings
210
+
To install optional packages, you can use [uv](https://docs.astral.sh/uv/).
0 commit comments