You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Designed to accelerate inference in VLMs, where the number of input visual tokens is often significantly larger than that of textual tokens. Pruning these tokens reduces first-token latency and overall FLOPs while preserving accuracy. In this repository, we implement a visual token pruning method called [CDPruner](https://arxiv.org/pdf/2506.10967), which maximizes the conditional diversity of retained tokens. It can reduce FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy.
13
14
15
+
-[**Sparse Attention**](./sparse_attention.py):
16
+
Designed to accelerate the prefill stage in LLMs and MMLLMs with long prompts, high-resolution images, or videos by attending only to the most relevant query-key blocks. This block-wise attention mechanism reduces memory usage and FLOPs while preserving model accuracy. Supported modes:
17
+
-**Tri-Shape Mode** – A static block-sparse attention pattern that preserves the initial tokens, local windows, and the final segment of the query, forming a triangular structure to capture critical tokens while maintaining instruction-following performance in both turn-0 and multi-request scenarios. Paper: https://arxiv.org/pdf/2412.10319
18
+
-**XAttention Mode** – A dynamic block-sparse attention mechanism that accelerates inference by focusing computation on the most important regions of the attention matrix using antidiagonal block scoring, reducing FLOPs and memory usage without significant loss of accuracy. Paper: https://arxiv.org/pdf/2503.16428
Copy file name to clipboardExpand all lines: modules/genai_optimizations/benchmarks/README.md
+34-2Lines changed: 34 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,13 +3,39 @@
3
3
This folder provides examples for evaluating and optimizing Generative AI models across different scenarios.
4
4
5
5
6
+
<details>
7
+
<summary><b>Large Language Models Optimization Example: LongBench</b></summary>
8
+
9
+
This [example](./longbench.py) demonstrates how to evaluate and optimize LLMs using the [LongBench](https://arxiv.org/pdf/2308.14508), a bilingual, multi-task benchmark designed to assess long-context understanding. LongBench includes 21 datasets across six task categories—single-document QA, multi-document QA, summarization, few-shot learning, synthetic reasoning, and code completion—in both English and Chinese.
10
+
11
+
Sparse attention speeds up the prefill stage in LLMs by attending only to the most relevant query-key blocks. Static patterns like Tri-Shape and dynamic mechanisms like XAttention reduce memory and computation without significant accuracy loss, enabling efficient handling of long prompts.
12
+
13
+
### Run Example
14
+
15
+
```bash
16
+
python longbench.py \
17
+
--subset samsum \
18
+
--model meta-llama/Llama-3.2-1B-Instruct \
19
+
--use_custom_attention \
20
+
--prefill_impl tri-shape
21
+
```
22
+
This will automatically:
23
+
24
+
- Download the selected model and dataset
25
+
- Apply sparse attention computation during the prefill stage
26
+
- Evaluate the model and report the score
27
+
28
+
</details>
29
+
6
30
<details>
7
31
<summary><b>Multimodal Large Language Models Optimization Example: MME Benchmark</b></summary>
8
32
9
33
This [example](./mmebench.py) demonstrates how to evaluate and optimize MLLMs using the [MME benchmark](https://arxiv.org/pdf/2306.13394), which measures both perception and cognition abilities across 14 subtasks. Its concise instruction design enables fair comparison of MLLMs without the need for extensive prompt engineering.
10
34
11
35
Visual token pruning enables significant acceleration of inference in VLMs, where the number of input visual tokens is often much larger than the number of textual tokens. By pruning these tokens, we reduce first-token latency and overall FLOPs while preserving accuracy.
12
36
37
+
Sparse attention speeds up the prefill stage in LLMs and MMLLMs by attending only to the most relevant query-key blocks. Static patterns like Tri-Shape and dynamic mechanisms like XAttention reduce memory and computation without significant accuracy loss, enabling efficient handling of long prompts, high-resolution images, and multi-frame videos.
38
+
13
39
### Run Example
14
40
15
41
```bash
@@ -18,12 +44,15 @@ python mmebench.py \
18
44
--model Qwen/Qwen2.5-VL-3B-Instruct \
19
45
--enable_visual_pruning \
20
46
--num_keep_tokens 128 \
21
-
--theta 0.5
47
+
--theta 0.5 \
48
+
--use_custom_attention \
49
+
--prefill_impl x-attention
22
50
```
23
51
This will automatically:
24
52
25
53
- Download the selected model and dataset
26
54
- Apply the visual token pruning algorithm
55
+
- Apply sparse attention computation during the prefill stage
27
56
- Evaluate the model and report the score
28
57
29
58
</details>
@@ -42,13 +71,16 @@ python milebench.py \
42
71
--model Qwen/Qwen2-VL-2B-Instruct \
43
72
--enable_visual_pruning \
44
73
--num_keep_tokens 64 \
45
-
--theta 0.5
74
+
--theta 0.5 \
75
+
--use_custom_attention \
76
+
--prefill_impl tri-shape
46
77
```
47
78
48
79
This will automatically:
49
80
50
81
- Download the selected model and dataset
51
82
- Apply the visual token pruning algorithm
83
+
- Apply sparse attention computation during the prefill stage
0 commit comments