[GenAI] Support Token Eviction for LRMs

l-bat · l-bat · commit 0d51c5591c30 · 2025-10-30T11:24:50.000Z
diff --git a/modules/genai_optimizations/README.md b/modules/genai_optimizations/README.md
@@ -6,6 +6,7 @@ This module provides experimental optimizations for GenAI models in PyTorch. The
 
 - Text Generation Using LLMs
 - Visual language text generation
+- Reasoning and Problem Solving
 
 ## Supported Generative AI Optimization Methods
 
@@ -34,6 +35,14 @@ This module provides experimental optimizations for GenAI models in PyTorch. The
   Paper: https://arxiv.org/pdf/2306.14048
   - **SnapKV Mode** – Modifies the *H2O* approach by computing token importance within a small sliding window of the most recent queries during the prefill stage, then reverting to the H2O strategy during decoding. The authors observed that only a small subset of prompt tokens is sufficient for accurate response generation.
   Paper: https://arxiv.org/pdf/2404.14469
+  - **RKV Mode** - Computes token importance scores based on attention weights over a sliding window of the most recent queries during both the prefill and decode stages. Importance scores are stabilized using per-token max-pooling and then averaged across attention heads.
+
+Refined modes enhance standard eviction strategies by selecting the most representative tokens or blocks from the evictable (intermediate) region. These methods aim to balance contextual importance with redundancy reduction to optimize cache efficiency. If `refined_algorithm` is enabled but `refined_tokens` is not specified or set to 0, the number of refined tokens is determined dynamically as part of the intermediate token budget. Budget for primary algorithm is allocated by selecting the minimal number of tokens or groups that together capture at least 90% of the total attention mass, ensuring that all high-importance tokens are retained. For the remaining eviction budget, each token’s dissimilarity is computed relative to the already retained set, promoting information diversity and reducing redundancy.
+
+ Supported refined modes:
+  - **KVCrush Mode** - Selects representative blocks based on diversity rather than raw importance. This is achieved by generating binary indicators for each token, constructing an anchor point (reference pattern) using one of several modes: `random`, `zeros`, `ones`, `mean`, `alternate`, and selecting blocks with the highest Hamming distance to the anchor point.
+  Paper: https://arxiv.org/pdf/2503.00022
+  - **DiverseKV Mode** – Implements a dynamic redundancy scoring mechanism to identify and de-prioritize repetitive tokens based on cosine similarity of key vectors with already retained tokens. Key vectors are normalized, and cosine similarities are computed with diagonal values zeroed to avoid self-similarity. Similarities are thresholded on a per-head basis—only values greater than or equal to the mean similarity for each head are kept and then aggregated across heads. For the remaining eviction budget, each token or group's dissimilarity to already retained tokens or groups is calculated. Tokens/groups with the highest dissimilarity scores are retained, maximizing contextual diversity while reducing redundancy.
 
 ## Supported and tested models
 
@@ -53,6 +62,12 @@ Multimodal Large Language Models:
 - [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
 - [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
 
+Large Reasoning Models:
+
+- [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
+- [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
+- [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning)
+
 ## Prerequisites
 
 Before running algorithms, ensure you have **Python 3.10+** installed and set up your environment.
diff --git a/modules/genai_optimizations/benchmarks/README.md b/modules/genai_optimizations/benchmarks/README.md
@@ -115,12 +115,14 @@ GSM8K (Grade School Math 8K) is a dataset of 8,500 high-quality, linguistically
 
 ```bash
 python math500_gsm_bench.py \
-    --subset gsm \
+    --dataset MATH500 \
     --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+    --max_tokens 5000 \
+    --max_examples 100 \
     --enable_eviction \
     --algorithm rkv \
     --granularity per_group \
-    --intermediate_tokens 1024
+    --intermediate_tokens 512
 ```
 This will automatically:
 
diff --git a/modules/genai_optimizations/benchmarks/math500_gsm_bench.py b/modules/genai_optimizations/benchmarks/math500_gsm_bench.py
@@ -141,7 +141,17 @@ def prepare_dataset(dataset, max_samples=None):
                 }
             )
     elif dataset == "GSM":
-        data_path = "data/gsm/test.jsonl"
+        data_path = "gsm.jsonl"
+
+        if not os.path.exists(data_path):
+            import requests
+            url = "https://raw.githubusercontent.com/VITA-Group/SEAL/main/data/gsm/test.jsonl"
+            response = requests.get(url)
+            response.raise_for_status()
+            with open(data_path, "w", encoding="utf-8") as f:
+                f.write(response.text)
+            print(f"Downloaded and saved to '{data_path}'.")
+
         with open(data_path) as fin:
             for line in fin:
                 example = json.loads(line)
@@ -187,7 +197,7 @@ def main(args):
     prompts = []
     for example in test_data:
         prompt = prefix + "Question: " + example["question"].strip() + "\nAnswer: "
-        if args.use_chat_format:
+        if not args.omit_chat_template:
             if "deepseek" in args.model:
                 messages = [{"role": "user", "content": prefix + "Question: " + example["question"].strip()}]
             else:
@@ -196,7 +206,7 @@ def main(args):
                     {"role": "user", "content": "Question: " + example["question"].strip()},
                 ]
             prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-            if args.remove_bos and tokenizer.bos_token is not None and prompt.startswith(tokenizer.bos_token):
+            if not args.keep_bos and tokenizer.bos_token is not None and prompt.startswith(tokenizer.bos_token):
                 prompt = prompt[len(tokenizer.bos_token) :]
         prompts.append(prompt)
 
@@ -224,35 +234,31 @@ def main(args):
         contexts.append(token_eviction)
 
     outputs = []
-    prompts_with_eviction = 0
     avg_prompt_len = []
     with ExitStack() as stack:
         for ctx in contexts:
             if ctx is not None:
                 stack.enter_context(ctx(model))
 
-            for prompt in prompts:
-                tokenized_batch = tokenizer(prompt, return_tensors="pt", padding=True)
-                tokenized_batch = {k: v.to(model.device) for k, v in tokenized_batch.items()}
-                avg_prompt_len.append(tokenized_batch["input_ids"].shape[1])
-
-                output = model.generate(
-                    **tokenized_batch,
-                    do_sample=False,
-                    max_new_tokens=args.max_tokens,
-                    use_cache=True,
-                    pad_token_id=tokenizer.eos_token_id,
-                )
-                OUTPUT_LENGTHS.append(output.shape[1])
-                if output.shape[1] > token_eviction.max_cache_size:
-                    prompts_with_eviction += 1
-                output = [tokenizer.decode(o[avg_prompt_len[-1]:], skip_special_tokens=True) for o in output]
-                outputs.extend(output)
+        for prompt in tqdm(prompts):
+            tokenized_batch = tokenizer(prompt, return_tensors="pt", padding=True)
+            tokenized_batch = {k: v.to(model.device) for k, v in tokenized_batch.items()}
+            avg_prompt_len.append(tokenized_batch["input_ids"].shape[1])
+
+            output = model.generate(
+                **tokenized_batch,
+                do_sample=False,
+                max_new_tokens=args.max_tokens,
+                use_cache=True,
+                pad_token_id=tokenizer.eos_token_id,
+            )
+            OUTPUT_LENGTHS.append(output.shape[1])
+            output = [tokenizer.decode(o[avg_prompt_len[-1]:], skip_special_tokens=True) for o in output]
+            outputs.extend(output)
 
     outputs = [[trim_output(o)] for o in outputs]
     print(f"Average prompt length: {sum(avg_prompt_len) / len(avg_prompt_len):.2f}")
     print(f"Average length: {sum(OUTPUT_LENGTHS) / len(OUTPUT_LENGTHS):.2f}")
-    print(f"Prompts with eviction: {prompts_with_eviction}/{len(OUTPUT_LENGTHS)}")
 
     predictions = [
         {
@@ -277,17 +283,17 @@ def main(args):
     parser.add_argument("--max_examples", type=int, default=None)
     parser.add_argument("--start", type=int, default=None)
     parser.add_argument("--save_dir", type=str, default="results")
-    parser.add_argument("--use_chat_format", action="store_true")
-    parser.add_argument("--max_tokens", type=int, default=512)
-    parser.add_argument("--remove_bos", action="store_true", default=True)
+    parser.add_argument("--max_tokens", type=int, default=5000)
+    parser.add_argument("--omit_chat_template", action="store_true")
+    parser.add_argument("--keep_bos", action="store_true")
 
     add_attention_args(parser)
     add_token_eviction_args(parser)
     args = parser.parse_args()
 
     args.save_dir = os.path.join(args.save_dir, args.dataset)
-    if args.remove_bos:
-        args.save_dir = args.save_dir + "_remove_bos"
+    if args.keep_bos:
+        args.save_dir = args.save_dir + "_keep_bos"
 
     if args.max_examples or args.start:
         start = 0 if args.start is None else args.start
diff --git a/modules/genai_optimizations/benchmarks/reasoning_parser.py b/modules/genai_optimizations/benchmarks/reasoning_parser.py
@@ -317,7 +317,7 @@ def strip_string(string, skip_unit=False):
     string = string.replace("infinity", "\\infty")
     if "\\infty" not in string:
         string = string.replace("inf", "\\infty")
-    string = string.replace("+\\inity", "\\infty")
+    string = string.replace("\\inity", "\\infty")
 
     # and
     string = string.replace("and", "")
diff --git a/modules/genai_optimizations/genai_opt/sparse_attention.py b/modules/genai_optimizations/genai_opt/sparse_attention.py
@@ -16,6 +16,7 @@
 from transformers.cache_utils import Cache
 from transformers.models.llama.modeling_llama import repeat_kv
 from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
+from transformers.models.phi3.modeling_phi3 import apply_rotary_pos_emb as phi3_apply_rotary_pos_emb
 from transformers.models.qwen2_vl.modeling_qwen2_vl import apply_multimodal_rotary_pos_emb
 
 from block_sparse_attn import block_sparse_attn_func
@@ -619,7 +620,7 @@ def qwen2_vl_forward(
         value_states=value_states,
         attention_mask=attention_mask,
         scaling=module.scaling,
-        dropout_p=module.attention_dropout if module.training else 0.0,
+        dropout=module.attention_dropout if module.training else 0.0,
     )
 
     attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
@@ -657,7 +658,91 @@ def llama_forward(
         key_states=key_states,
         value_states=value_states,
         attention_mask=attention_mask,
-        dropout_p=module.attention_dropout if module.training else 0.0,
+        dropout=module.attention_dropout if module.training else 0.0,
+        scaling=module.scaling,
+    )
+
+    attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+    attn_output = module.o_proj(attn_output)
+    return attn_output, attn_weights
+
+
+def qwen3_forward(
+    module,
+    hidden_states: torch.Tensor,
+    position_embeddings: tuple[torch.Tensor, torch.Tensor],
+    attention_mask: Optional[torch.Tensor],
+    past_key_values: Optional[Cache] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    **kwargs,
+) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
+    input_shape = hidden_states.shape[:-1]
+    hidden_shape = (*input_shape, -1, module.head_dim)
+
+    query_states = module.q_norm(module.q_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+    key_states = module.k_norm(module.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+    value_states = module.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+
+    cos, sin = position_embeddings
+    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+
+    if past_key_values is not None:
+        # sin and cos are specific to RoPE models; cache_position needed for the static cache
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        key_states, value_states = past_key_values.update(key_states, value_states, module.layer_idx, cache_kwargs)
+
+    attn_output, attn_weights = module.attn_interface(
+        module,
+        query_states=query_states,
+        key_states=key_states,
+        value_states=value_states,
+        attention_mask=attention_mask,
+        dropout=module.attention_dropout if module.training else 0.0,
+        scaling=module.scaling,
+    )
+
+    attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+    attn_output = module.o_proj(attn_output)
+    return attn_output, attn_weights
+
+
+def phi_forward(
+    module,
+    hidden_states: torch.Tensor,
+    position_embeddings: tuple[torch.Tensor, torch.Tensor],
+    attention_mask: Optional[torch.Tensor],
+    past_key_values: Optional[Cache] = None,
+    cache_position: Optional[torch.LongTensor] = None,
+    **kwargs,
+) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
+    input_shape = hidden_states.shape[:-1]
+    hidden_shape = (*input_shape, -1, module.head_dim)
+
+    qkv = module.qkv_proj(hidden_states)
+    query_pos = module.config.num_attention_heads * module.head_dim
+    query_states = qkv[..., :query_pos]
+    key_states = qkv[..., query_pos : query_pos + module.num_key_value_heads * module.head_dim]
+    value_states = qkv[..., query_pos + module.num_key_value_heads * module.head_dim :]
+
+    query_states = query_states.view(hidden_shape).transpose(1, 2)
+    key_states = key_states.view(hidden_shape).transpose(1, 2)
+    value_states = value_states.view(hidden_shape).transpose(1, 2)
+
+    cos, sin = position_embeddings
+    query_states, key_states = phi3_apply_rotary_pos_emb(query_states, key_states, cos, sin)
+
+    if past_key_values is not None:
+        # sin and cos are specific to RoPE models; cache_position needed for the static cache
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        key_states, value_states = past_key_values.update(key_states, value_states, module.layer_idx, cache_kwargs)
+
+    attn_output, attn_weights = module.attn_interface(
+        module,
+        query_states=query_states,
+        key_states=key_states,
+        value_states=value_states,
+        attention_mask=attention_mask,
+        dropout=module.attention_dropout if module.training else 0.0,
         scaling=module.scaling,
     )
 
@@ -672,6 +757,8 @@ def llama_forward(
     "LlamaForCausalLM": llama_forward,
     "MistralForCausalLM": llama_forward,
     "Qwen2ForCausalLM": llama_forward,
+    "Qwen3ForCausalLM": qwen3_forward,
+    "Phi3ForCausalLM": phi_forward,
 }
 
 def get_custom_attn_forward(model: PreTrainedModel):
diff --git a/modules/genai_optimizations/genai_opt/token_eviction.py b/modules/genai_optimizations/genai_opt/token_eviction.py