Skip to content

Commit 7462a45

Browse files
authored
[GenAI] Support Sparse Attention (#1004)
* Add gitignore * fix error with cuda device * [GenAI] Support Sparse Attention * Add LongBench
1 parent c8a0176 commit 7462a45

File tree

9 files changed

+1206
-13
lines changed

9 files changed

+1206
-13
lines changed

modules/genai_optimizations/README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,30 @@ This module provides experimental optimizations for GenAI models in PyTorch. The
44

55
## Supported Generative AI Scenarios
66

7+
- Text Generation Using LLMs
78
- Visual language text generation
89

910
## Supported Generative AI Optimization Methods
1011

1112
- [**Visual Token Pruning**](./visual_token_pruning.py):
1213
Designed to accelerate inference in VLMs, where the number of input visual tokens is often significantly larger than that of textual tokens. Pruning these tokens reduces first-token latency and overall FLOPs while preserving accuracy. In this repository, we implement a visual token pruning method called [CDPruner](https://arxiv.org/pdf/2506.10967), which maximizes the conditional diversity of retained tokens. It can reduce FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy.
1314

15+
- [**Sparse Attention**](./sparse_attention.py):
16+
Designed to accelerate the prefill stage in LLMs and MMLLMs with long prompts, high-resolution images, or videos by attending only to the most relevant query-key blocks. This block-wise attention mechanism reduces memory usage and FLOPs while preserving model accuracy. Supported modes:
17+
- **Tri-Shape Mode** – A static block-sparse attention pattern that preserves the initial tokens, local windows, and the final segment of the query, forming a triangular structure to capture critical tokens while maintaining instruction-following performance in both turn-0 and multi-request scenarios. Paper: https://arxiv.org/pdf/2412.10319
18+
- **XAttention Mode** – A dynamic block-sparse attention mechanism that accelerates inference by focusing computation on the most important regions of the attention matrix using antidiagonal block scoring, reducing FLOPs and memory usage without significant loss of accuracy. Paper: https://arxiv.org/pdf/2503.16428
19+
1420
## Supported and tested models
1521

22+
Large Language Models:
23+
24+
- [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
25+
- [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
26+
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
27+
- [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
28+
- [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)
29+
- [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
30+
1631
Multimodal Large Language Models:
1732

1833
- [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
@@ -33,7 +48,7 @@ source env/bin/activate # On Windows: env\Scripts\activate.bat
3348

3449
### 2. Installation
3550

36-
You can install the package directly from the repository:
51+
You can install the package directly from the repository. To avoid running out of memory during the build, you can limit threads with `MAX_JOBS=4`:
3752

3853
```bash
3954
pip install git+https://github.com/openvinotoolkit/openvino_contrib.git#egg=genai_opt&subdirectory=modules/genai_optimizations

modules/genai_optimizations/benchmarks/README.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,39 @@
33
This folder provides examples for evaluating and optimizing Generative AI models across different scenarios.
44

55

6+
<details>
7+
<summary><b>Large Language Models Optimization Example: LongBench</b></summary>
8+
9+
This [example](./longbench.py) demonstrates how to evaluate and optimize LLMs using the [LongBench](https://arxiv.org/pdf/2308.14508), a bilingual, multi-task benchmark designed to assess long-context understanding. LongBench includes 21 datasets across six task categories—single-document QA, multi-document QA, summarization, few-shot learning, synthetic reasoning, and code completion—in both English and Chinese.
10+
11+
Sparse attention speeds up the prefill stage in LLMs by attending only to the most relevant query-key blocks. Static patterns like Tri-Shape and dynamic mechanisms like XAttention reduce memory and computation without significant accuracy loss, enabling efficient handling of long prompts.
12+
13+
### Run Example
14+
15+
```bash
16+
python longbench.py \
17+
--subset samsum \
18+
--model meta-llama/Llama-3.2-1B-Instruct \
19+
--use_custom_attention \
20+
--prefill_impl tri-shape
21+
```
22+
This will automatically:
23+
24+
- Download the selected model and dataset
25+
- Apply sparse attention computation during the prefill stage
26+
- Evaluate the model and report the score
27+
28+
</details>
29+
630
<details>
731
<summary><b>Multimodal Large Language Models Optimization Example: MME Benchmark</b></summary>
832

933
This [example](./mmebench.py) demonstrates how to evaluate and optimize MLLMs using the [MME benchmark](https://arxiv.org/pdf/2306.13394), which measures both perception and cognition abilities across 14 subtasks. Its concise instruction design enables fair comparison of MLLMs without the need for extensive prompt engineering.
1034

1135
Visual token pruning enables significant acceleration of inference in VLMs, where the number of input visual tokens is often much larger than the number of textual tokens. By pruning these tokens, we reduce first-token latency and overall FLOPs while preserving accuracy.
1236

37+
Sparse attention speeds up the prefill stage in LLMs and MMLLMs by attending only to the most relevant query-key blocks. Static patterns like Tri-Shape and dynamic mechanisms like XAttention reduce memory and computation without significant accuracy loss, enabling efficient handling of long prompts, high-resolution images, and multi-frame videos.
38+
1339
### Run Example
1440

1541
```bash
@@ -18,12 +44,15 @@ python mmebench.py \
1844
--model Qwen/Qwen2.5-VL-3B-Instruct \
1945
--enable_visual_pruning \
2046
--num_keep_tokens 128 \
21-
--theta 0.5
47+
--theta 0.5 \
48+
--use_custom_attention \
49+
--prefill_impl x-attention
2250
```
2351
This will automatically:
2452

2553
- Download the selected model and dataset
2654
- Apply the visual token pruning algorithm
55+
- Apply sparse attention computation during the prefill stage
2756
- Evaluate the model and report the score
2857

2958
</details>
@@ -42,13 +71,16 @@ python milebench.py \
4271
--model Qwen/Qwen2-VL-2B-Instruct \
4372
--enable_visual_pruning \
4473
--num_keep_tokens 64 \
45-
--theta 0.5
74+
--theta 0.5 \
75+
--use_custom_attention \
76+
--prefill_impl tri-shape
4677
```
4778

4879
This will automatically:
4980

5081
- Download the selected model and dataset
5182
- Apply the visual token pruning algorithm
83+
- Apply sparse attention computation during the prefill stage
5284
- Evaluate the model and report the score
5385

5486
</details>

0 commit comments

Comments
 (0)