NVIDIA
diff --git a/‎README.md‎
Lines changed: 12 additions & 16 deletions b/‎README.md‎
Lines changed: 12 additions & 16 deletions
diff --git a/‎evaluation/benchmarks/aime25/calculate_metrics.py‎
Lines changed: 1 addition & 1 deletion b/‎evaluation/benchmarks/aime25/calculate_metrics.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎evaluation/benchmarks/longbench/calculate_metrics.py‎
Lines changed: 2 additions & 2 deletions b/‎evaluation/benchmarks/longbench/calculate_metrics.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎evaluation/evaluate.py‎
Lines changed: 17 additions & 8 deletions b/‎evaluation/evaluate.py‎
Lines changed: 17 additions & 8 deletions
diff --git a/‎evaluation/evaluate_config.yaml‎
Lines changed: 5 additions & 3 deletions b/‎evaluation/evaluate_config.yaml‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎evaluation/evaluate_registry.py‎
Lines changed: 11 additions & 3 deletions b/‎evaluation/evaluate_registry.py‎
Lines changed: 11 additions & 3 deletions
diff --git a/‎evaluation/leaderboard.sh‎
Lines changed: 40 additions & 0 deletions b/‎evaluation/leaderboard.sh‎
Lines changed: 40 additions & 0 deletions
@@ -4,7 +4,8 @@
 [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/nvidia/kvpress)
 [![Blog post](https://img.shields.io/badge/🤗%20Hugging%20Face-Blog-blue)](https://huggingface.co/blog/nvidia/kvpress)
 [![Hugging Face Leaderboard](https://img.shields.io/badge/🤗%20HuggingFace-Leaderboard-orange)](https://huggingface.co/spaces/nvidia/kvpress-leaderboard)
-[![Paper](https://img.shields.io/badge/📄%20arXiv-Paper-red)](https://arxiv.org/abs/2510.00636v1)
+[![arXiv](https://img.shields.io/badge/arXiv-2510.00636-b31b1b.svg)](https://arxiv.org/abs/2510.00636v1)
+
 
 ![kvpress](kvpress.jpg)
 
@@ -54,10 +55,8 @@ KVPress provides a set of "presses" that compress the KV cache during the prefil
 from transformers import pipeline
 from kvpress import ExpectedAttentionPress
 
-device = "cuda:0"
-model = "meta-llama/Llama-3.1-8B-Instruct"
-model_kwargs = {"attn_implementation": "flash_attention_2"}
-pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)
+model = "Qwen/Qwen3-8B"
+pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
 
 context = "A very long text you want to compress once and for all"
 question = "\nA question about the compressed context"  # optional
@@ -71,7 +70,7 @@ In the snippet above, the compression is only applied on the context tokens so t
 <details><summary>
 Decoding Compression
 </summary>
-By default, KVPress applies compression during the pre-filling phase. As a new (experimental) feature, we now support decoding compression via the `DecodingPress` wrapper. `DecodingPress` compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. `DecodingPress` supports the following parameters:
+By default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the `DecodingPress` wrapper. `DecodingPress` compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. `DecodingPress` supports the following parameters:
 
 - `base_press`: Any ScorerPress (e.g., `KNormPress`, `CriticalKVPress`)
 - `compression_interval`: Steps between compressions (default: 10)
@@ -122,7 +121,7 @@ Several presses inherit from `ScorerPress` ([source](kvpress/presses/scorer_pres
 - `ExpectedAttentionPress` ([source](kvpress/presses/expected_attention_press.py), [notebook](notebooks/expected_attention.ipynb)): expected attention weight during the generation phase 
 - `StreamingLLMPress` ([source](kvpress/presses/streaming_llm_press.py), [paper](https://arxiv.org/abs/2309.17453)): keep only the initial and recent tokens 
 - `TOVAPress` ([source](kvpress/presses/tova_press.py), [paper](https://arxiv.org/abs/2401.06104)): attention weight of the last query averaged across heads 
-- `ObservedAttentionPress` ([source](kvpress/presses/observed_attention_press.py), [paper](https://arxiv.org/abs/2306.14048)): average attention weight observed during in pre-filling phase
+- `ObservedAttentionPress` ([source](kvpress/presses/observed_attention_press.py), [paper](https://arxiv.org/abs/2306.14048)): average attention weight observed during in prefilling phase
 - `QFilterPress` ([source](kvpress/presses/qfilter_press.py), [paper](https://arxiv.org/abs/2503.02812)): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.
 - `PyramidKVPress` ([source](kvpress/presses/pyramidkv_press.py), [paper](https://arxiv.org/abs/2406.02069)): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers
 - `LagKVPress` ([source](kvpress/presses/lagkv_press.py), [paper](https://arxiv.org/abs/2504.04704)): leverage on the KV lag-relative information to compress. It's query free, attention-weight free, and flash-attention compatible.
@@ -131,9 +130,10 @@ Several presses inherit from `ScorerPress` ([source](kvpress/presses/scorer_pres
 - `LeverageScorePress` ([source](kvpress/presses/leverage_press.py), [paper](https://arxiv.org/abs/2507.08143)): evicts tokens based on approximate statistical leverage (i.e we preserve outliers in the key space).
 - `CompactorPress` ([source](kvpress/presses/compactor_press.py), [paper](https://arxiv.org/abs/2507.08143)): blends `NonCausalAttnPress` and `LeverageScorePress` based on the compression_ratio.
 - `CURPress` ([source](kvpress/presses/cur_press.py), [paper](https://arxiv.org/abs/2509.15038)): prune keys and values based on the CUR decomposition using approximate leverage scores.
+- `KVzapPress` ([source](kvpress/presses/kvzap/kvzap_press.py), [paper](https://arxiv.org/abs/2601.07891), [training](kvzap)): approximate KVzip+ using a fast surrogate model. To be used in conjunction with the `ThresholdPress`.
 
 Some presses rely on a different logic:
-- `ThinKPress` ([source](kvpress/presses/think_press.py), [paper](https://arxiv.org/pdf/2407.21018)): compress the dimensions of the keys based on the channel attention score on the last queries 
+- `ThinKPress` ([source](kvpress/presses/think_press.py), [paper](https://arxiv.org/abs/2407.21018)): compress the dimensions of the keys based on the channel attention score on the last queries 
 - `SimLayerKVPress` ([source](kvpress/presses/simlayerkv_press.py), [paper](https://arxiv.org/abs/2410.13846)): identify "lazy" layers, and apply the StreamingLLM approach to them 
 - `DuoAttentionPress` ([source](kvpress/presses/duo_attention_press.py), [paper](https://arxiv.org/abs/2410.10819)): split heads into retrieval heads (no compression) and streaming heads (StreamingLLM approach)
 - `FinchPress` ([source](kvpress/presses/finch_press.py), [paper](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00716/125280)): similar to SnapKV with a dynamic window size and key value re-rotation
@@ -148,8 +148,9 @@ Finally we provide wrapper presses that can be combined with other presses:
 - `ChunkPress` ([source](kvpress/presses/chunk_press.py), [paper](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00716/125280)): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences
 - `CriticalKVPress` and `CriticalAdaKVPress` ([source](kvpress/presses/criticalkv_press.py), [paper](https://arxiv.org/abs/2502.03805)): refine the scores using the L1 norm of Wo @ values, coupled with a two-stage selection.
 - `BlockPress` ([source](kvpress/presses/block_press.py), [paper](https://arxiv.org/abs/2504.15364)): segments input sequence into non-overlapping blocks and compresses iteratively.
-- `DecodingPress` ([source](kvpress/presses/decoding_press.py)): Allows for compression during decoding, see decoding section in this README.
-- `PrefillDecodingPress` ([source](kvpress/presses/prefill_decoding_press.py)): Allows to compress both during prefilling and during decoding.
+- `DecodingPress` ([source](kvpress/presses/decoding_press.py)): allows for compression during decoding, see decoding section in this README.
+- `PrefillDecodingPress` ([source](kvpress/presses/prefill_decoding_press.py)): allows to compress both during prefilling and during decoding.
+- `ThresholdPress` ([source](kvpress/presses/threshold_press.py)): evict keys and values with scores below a given threshold of any `ScorerPress` instead of relying on top-k scores. Support both prefilling and decoding (if decoding=True).
 
 For a detailed list of existing KV cache compression methods, check [Awesome-KV-Cache-Compression](https://github.com/October2001/Awesome-KV-Cache-Compression) or [Awesome-LLM-Compression](https://github.com/HuangOwen/Awesome-LLM-Compression?tab=readme-ov-file#kv-cache-compression)
 
@@ -164,11 +165,6 @@ Please refer to the [evaluation](evaluation/README.md) directory in this repo fo
 
 Below we report the average performance on the RULER dataset with 4k context length for different presses, from our [![Hugging Face Leaderboard](https://img.shields.io/badge/🤗%20HuggingFace-Leaderboard-orange)](https://huggingface.co/spaces/nvidia/kvpress-leaderboard)
 
-<p>
-  <img src="leaderboard_plot_score.png" alt="Leaderboard">
-</p>
-
-
 ## Quantization
 
 We support KV cache quantization through the transformers `QuantizedCache` class (see [HF blog post](https://huggingface.co/blog/kv-cache-quantization#how-to-use-quantized-kv-cache-in-%F0%9F%A4%97-transformers)). To use it, simply pass a cache object to your pipeline:
@@ -242,7 +238,7 @@ Memory usage should be reduced by around `compression_ratio * kv_cache_size`. As
 
 ### How does a press work ? </summary>
 
-A press registers a forward hook (`press.forward_hook` method) to each attention layer during the pre-filling phase. Registration can be applied using the press as a context manager (`press.__call__` method):
+A press registers a forward hook (`press.forward_hook` method) to each attention layer during the prefilling phase. Registration can be applied using the press as a context manager (`press.__call__` method):
 
 ```python
 import torch
 
@@ -6,7 +6,7 @@
 
 def extract_boxed(pred_answer):
     try:
-        return str(pred_answer.split("boxed{")[1].split("}")[0])
+        return str(pred_answer.split("boxed{")[-1].split("}")[0])
     except IndexError:
         return None
 
 
@@ -60,9 +60,9 @@ def scorer(dataset, predictions, answers, all_classes):
     for prediction, ground_truths in zip(predictions, answers):
         score = 0.0
         if dataset in ["trec", "triviaqa", "samsum", "lsht"]:
-            prediction = prediction.lstrip("\n").split("\n")[0]
+            prediction = prediction.lstrip().split("\n")[0]
         for ground_truth in ground_truths:
-            score = max(score, dataset2metric[dataset](prediction, ground_truth, all_classes=all_classes))
+            score = max(score, dataset2metric[dataset](prediction.lstrip(), ground_truth, all_classes=all_classes))
         total_score += score
     return round(100 * total_score / len(predictions), 2)
 
 
@@ -18,7 +18,7 @@
 from evaluate_registry import DATASET_REGISTRY, PRESS_REGISTRY, SCORER_REGISTRY
 from fire import Fire
 from tqdm import tqdm
-from transformers import Pipeline, pipeline
+from transformers import FineGrainedFP8Config, Pipeline, pipeline
 
 from kvpress import (
     ComposedPress,
@@ -28,6 +28,7 @@
     ObservedAttentionPress,
     ScorerPress,
     ThinKPress,
+    ThresholdPress,
 )
 
 logger = logging.getLogger(__name__)
@@ -45,6 +46,7 @@ class EvaluationConfig:
     press_name: str = "knorm"
     compression_ratio: float = 1.0
     key_channel_compression_ratio: Optional[float] = None
+    threshold: Optional[float] = None
 
     # Dataset and generation parameters
     fraction: float = 1.0
@@ -71,6 +73,9 @@ class EvaluationConfig:
     # For reproducibility
     seed: int = 42
 
+    # Quantization
+    fp8: bool = False
+
     def __post_init__(self):
         """Validate configuration after initialization."""
         # Validate dataset
@@ -85,11 +90,6 @@ def __post_init__(self):
             logger.info("Using 'no_press' configuration. Overriding compression_ratio to 0.0")
             self.compression_ratio = 0.0
 
-        # Validate compression ratios
-        assert (
-            0.0 <= self.compression_ratio <= 1.0
-        ), f"compression_ratio must be between 0.0 and 1.0, got {self.compression_ratio}"
-
         # Only validate key_channel_compression_ratio if it's not None
         if self.key_channel_compression_ratio is not None:
             assert (
@@ -115,8 +115,6 @@ def get_results_dir(self, output_dir: Path) -> Path:
         ----------
         output_dir : Path
             The output directory path
-        press
-            The press instance to check for ThinKPress components
 
         Returns
         -------
@@ -132,6 +130,8 @@ def get_results_dir(self, output_dir: Path) -> Path:
             f"{self.compression_ratio:.2f}",
         ]
 
+        if self.threshold is not None:
+            components[-1] = f"{self.threshold:.2f}"
         if self.fraction < 1.0:
             components.append(f"fraction{self.fraction:.3f}")
         if self.max_context_length is not None:
@@ -256,6 +256,10 @@ def _setup_press(self):
         if isinstance(press, DuoAttentionPress):
             press.head_compression_ratio = compression_ratio
             logger.info(f"Set DuoAttentionPress head_compression_ratio to {compression_ratio}")
+        elif isinstance(press, ThresholdPress):
+            assert self.config.threshold is not None, "threshold must be set for ThresholdPress"
+            press.threshold = self.config.threshold
+            logger.info(f"Set ThresholdPress threshold to {press.threshold}")
         elif isinstance(press, ComposedPress):
             for ps in press.presses:
                 if isinstance(ps, ThinKPress):
@@ -349,6 +353,11 @@ def _setup_model_pipeline(self):
             logger.info(f"No device specified, auto-detected device: {device}")
 
         model_kwargs = self.config.model_kwargs or {}
+
+        if self.config.fp8:
+            model_kwargs["quantization_config"] = FineGrainedFP8Config()
+            logger.info("FP8 quantization enabled.")
+
         if isinstance(self.press, ObservedAttentionPress):
             model_kwargs["attn_implementation"] = "eager"
             logger.info("ObservedAttentionPress detected, setting attn_implementation to 'eager'.")
 
@@ -7,9 +7,10 @@ model: "meta-llama/Meta-Llama-3.1-8B-Instruct"
 dataset: "ruler"                                  # see DATASET_REGISTRY in evaluate_registry.py
 data_dir: "4096"                                  # Subdirectory of the dataset (if applicable) else leave "null"
 
-press_name: "knorm"                                # see PRESS_REGISTRY in evaluate_registry.py
-compression_ratio: 0.5                             # Compression ratio for the press (0.0 to 1.0)
-key_channel_compression_ratio: null                # For ThinKPress and ComposedPress (0.0 to 1.0)
+press_name: "knorm"                               # see PRESS_REGISTRY in evaluate_registry.py
+compression_ratio: 0.5                            # Compression ratio for the press (0.0 to 1.0)
+key_channel_compression_ratio: null               # For ThinKPress and ComposedPress (0.0 to 1.0)
+threshold: null                                   # For ThresholdPress
 
 fraction: 1.0                                     # Fraction of dataset to evaluate (0.0 to 1.0), for quick testing
 max_new_tokens: null                              # Maximum new tokens to generate (null = use dataset default)
@@ -18,6 +19,7 @@ query_aware: false                                # Whether to include question
 needle_depth: null                                # Depth (int or list of ints) percentage of the needle in the haystack (0 to 100), only for needle_in_haystack dataset
 
 device: null  # Device to use (null = auto-detect, "cuda:0", "cpu", etc.)
+fp8: false    # Whether to use FP8 quantization (FineGrainedFP8Config() from transformers)
 
 # You can add any model kwargs here.
 model_kwargs:
 
@@ -26,15 +26,19 @@
     FinchPress,
     KeyDiffPress,
     KnormPress,
+    KVzapPress,
     KVzipPress,
     ObservedAttentionPress,
     PyramidKVPress,
     QFilterPress,
     RandomPress,
     SnapKVPress,
     StreamingLLMPress,
+    ThresholdPress,
     ThinKPress,
     TOVAPress,
+    CURPress,
+    LagKVPress,
 )
 
 # These dictionaries define the available datasets, scorers, and KVPress methods for evaluation.
@@ -67,22 +71,26 @@
 
 
 PRESS_REGISTRY = {
-    "adakv_expected_attention": AdaKVPress(ExpectedAttentionPress()),
-    "adakv_expected_attention_e2": AdaKVPress(ExpectedAttentionPress(epsilon=1e-2)),
     "adakv_snapkv": AdaKVPress(SnapKVPress()),
     "block_keydiff": BlockPress(press=KeyDiffPress(), block_size=128),
     "chunkkv": ChunkKVPress(press=SnapKVPress(), chunk_length=20),
     "critical_adakv_expected_attention": CriticalAdaKVPress(ExpectedAttentionPress(use_vnorm=False)),
     "critical_adakv_snapkv": CriticalAdaKVPress(SnapKVPress()),
     "critical_expected_attention": CriticalKVPress(ExpectedAttentionPress(use_vnorm=False)),
     "critical_snapkv": CriticalKVPress(SnapKVPress()),
+    "cur": CURPress(),
     "duo_attention": DuoAttentionPress(),
     "duo_attention_on_the_fly": DuoAttentionPress(on_the_fly_scoring=True),
-    "expected_attention": ExpectedAttentionPress(),
+    "expected_attention": AdaKVPress(ExpectedAttentionPress(epsilon=1e-2)),
     "finch": FinchPress(),
     "keydiff": KeyDiffPress(),
     "kvzip": KVzipPress(),
     "kvzip_plus": KVzipPress(kvzip_plus_normalization=True),
+    "kvzap_linear": ThresholdPress(press=KVzapPress(model_type="linear")),
+    "kvzap_mlp": ThresholdPress(press=KVzapPress(model_type="mlp")),
+    "kvzap_mlp_head": KVzapPress(model_type="mlp"),
+    "kvzap_mlp_layer": AdaKVPress(KVzapPress(model_type="mlp")),
+    "lagkv": LagKVPress(),
     "knorm": KnormPress(),
     "observed_attention": ObservedAttentionPress(),
     "pyramidkv": PyramidKVPress(),
 
@@ -0,0 +1,40 @@
+# SPDX-FileCopyrightText: Copyright (c) 1993-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Script to run the leaderboard evaluation on 4 GPUs
+dataset="ruler"
+data_dir="4096"
+model="Qwen/Qwen3-8B"
+output_dir="./results_lb"
+
+# Loop 1: presses not requiring to include the questions in the compression
+press_names=("random" "knorm" "snapkv" "expected_attention" "streaming_llm" "tova" "observed_attention" "qfilter" "pyramidkv" "lagkv" "keydiff" "adakv_compactor" "cur" "duo_attention" "duo_attention_on_the_fly" "kvzip")
+
+python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name no_press --compression_ratio 0.00 --output_dir $output_dir --device "cuda:0"
+
+for press in "${press_names[@]}"; do  
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.25  --output_dir $output_dir --device "cuda:0" &
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.50  --output_dir $output_dir --device "cuda:1" &
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.75  --output_dir $output_dir --device "cuda:2" &
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.875 --output_dir $output_dir --device "cuda:3" &
+    wait
+done
+
+# Use -3, -4, -5, -6 for Qwen3-8B and -6, -7, -8, -9 for Llama-3.1-8B-Instruct
+for press in "kvzap_linear" "kvzap_mlp"; do
+  python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --threshold -3  --output_dir $output_dir --device "cuda:0" &
+  python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --threshold -4  --output_dir $output_dir --device "cuda:1" &
+  python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --threshold -5  --output_dir $output_dir --device "cuda:2" &
+  python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --threshold -6  --output_dir $output_dir --device "cuda:3" &
+  wait
+done
+
+# Loop 2: presses requiring to compress questions
+press_names=("snapkv" "adakv_snapkv" "finch" "chunkkv")
+for press in "${press_names[@]}"; do  
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.25  --output_dir $output_dir --device "cuda:0" --query_aware &
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.50  --output_dir $output_dir --device "cuda:1" --query_aware &
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.75  --output_dir $output_dir --device "cuda:2" --query_aware &
+    python evaluate.py --dataset $dataset --data_dir $data_dir --model $model --press_name $press --compression_ratio 0.875 --output_dir $output_dir --device "cuda:3" --query_aware &
+    wait
+done