Skip to content

Commit f8c585f

Browse files
authored
Add GenAI Optimizations module (#1002)
* Add GenAI Optimizations module * Add gitignore * fix error with cuda device * Move args to utils
1 parent 3f4eeda commit f8c585f

File tree

9 files changed

+1135
-0
lines changed

9 files changed

+1135
-0
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# Distribution / packaging
7+
.Python
8+
env/
9+
build/
10+
develop-eggs/
11+
dist/
12+
eggs/
13+
.eggs/
14+
lib/
15+
lib64/
16+
parts/
17+
sdist/
18+
var/
19+
wheels/
20+
*.egg-info/
21+
.installed.cfg
22+
*.egg
23+
24+
# PyInstaller
25+
# Usually these files are written by a python script from a template
26+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
27+
*.manifest
28+
*.spec
29+
30+
# Installer logs
31+
pip-log.txt
32+
pip-delete-this-directory.txt
33+
34+
# pyenv
35+
.python-version
36+
37+
# dotenv
38+
.env
39+
40+
# virtualenv
41+
.venv
42+
venv/
43+
env*
44+
45+
# datasets
46+
*.tar*
47+
MileBench/
48+
49+
# VSCode
50+
.vscode/
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# GenAI Optimizations
2+
3+
This module provides experimental optimizations for GenAI models in PyTorch. The goal is to improve efficiency and performance for generative AI tasks while minimizing accuracy loss. This is PoC code and is intended to be compatible with OpenVINO GenAI.
4+
5+
## Supported Generative AI Scenarios
6+
7+
- Visual language text generation
8+
9+
## Supported Generative AI Optimization Methods
10+
11+
- [**Visual Token Pruning**](./visual_token_pruning.py):
12+
Designed to accelerate inference in VLMs, where the number of input visual tokens is often significantly larger than that of textual tokens. Pruning these tokens reduces first-token latency and overall FLOPs while preserving accuracy. In this repository, we implement a visual token pruning method called [CDPruner](https://arxiv.org/pdf/2506.10967), which maximizes the conditional diversity of retained tokens. It can reduce FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy.
13+
14+
## Supported and tested models
15+
16+
Multimodal Large Language Models:
17+
18+
- [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
19+
- [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
20+
- [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
21+
- [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
22+
23+
## Prerequisites
24+
25+
Before running algorithms, ensure you have **Python 3.10+** installed and set up your environment.
26+
27+
### 1. Create and activate a virtual environment
28+
29+
```bash
30+
python3 -m venv env
31+
source env/bin/activate # On Windows: env\Scripts\activate.bat
32+
```
33+
34+
### 2. Installation
35+
36+
You can install the package directly from the repository:
37+
38+
```bash
39+
pip install git+https://github.com/openvinotoolkit/openvino_contrib.git#egg=genai_opt&subdirectory=modules/genai_optimizations
40+
```
41+
42+
Or install it locally with extra dependencies for benchmarks support:
43+
44+
```bash
45+
pip install .[benchmarks]
46+
```
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Generative AI Models Optimization Examples
2+
3+
This folder provides examples for evaluating and optimizing Generative AI models across different scenarios.
4+
5+
6+
<details>
7+
<summary><b>Multimodal Large Language Models Optimization Example: MME Benchmark</b></summary>
8+
9+
This [example](./mmebench.py) demonstrates how to evaluate and optimize MLLMs using the [MME benchmark](https://arxiv.org/pdf/2306.13394), which measures both perception and cognition abilities across 14 subtasks. Its concise instruction design enables fair comparison of MLLMs without the need for extensive prompt engineering.
10+
11+
Visual token pruning enables significant acceleration of inference in VLMs, where the number of input visual tokens is often much larger than the number of textual tokens. By pruning these tokens, we reduce first-token latency and overall FLOPs while preserving accuracy.
12+
13+
### Run Example
14+
15+
```bash
16+
python mmebench.py \
17+
--subset artwork \
18+
--model Qwen/Qwen2.5-VL-3B-Instruct \
19+
--enable_visual_pruning \
20+
--num_keep_tokens 128 \
21+
--theta 0.5
22+
```
23+
This will automatically:
24+
25+
- Download the selected model and dataset
26+
- Apply the visual token pruning algorithm
27+
- Evaluate the model and report the score
28+
29+
</details>
30+
31+
<details>
32+
<summary><b>Multimodal Large Language Models Optimization Example: MileBench</b></summary>
33+
34+
This [example](./milebench.py) demonstrates how to optimize MLLMs using an experimental visual token pruning algorithm. The example leverages [MileBench](https://arxiv.org/pdf/2404.18532), a pioneering benchmark designed to rigorously evaluate the multimodal long-context capabilities of MLLMs. MileBench encompasses diverse tasks requiring both comprehension and generation, and introduces two distinct evaluation sets— diagnostic and realistic — that systematically assess models’ capacity for long-context adaptation and effective task completion.
35+
36+
37+
### Run Example
38+
39+
```bash
40+
python milebench.py \
41+
--subset WikiVQA \
42+
--model Qwen/Qwen2-VL-2B-Instruct \
43+
--enable_visual_pruning \
44+
--num_keep_tokens 64 \
45+
--theta 0.5
46+
```
47+
48+
This will automatically:
49+
50+
- Download the selected model and dataset
51+
- Apply the visual token pruning algorithm
52+
- Evaluate the model and report the score
53+
54+
</details>

0 commit comments

Comments
 (0)