openvinotoolkit
diff --git a/‎modules/genai_optimizations/.gitignore‎
Lines changed: 50 additions & 0 deletions b/‎modules/genai_optimizations/.gitignore‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎modules/genai_optimizations/README.md‎
Lines changed: 46 additions & 0 deletions b/‎modules/genai_optimizations/README.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎modules/genai_optimizations/benchmarks/README.md‎
Lines changed: 54 additions & 0 deletions b/‎modules/genai_optimizations/benchmarks/README.md‎
Lines changed: 54 additions & 0 deletions
@@ -0,0 +1,50 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# pyenv
+.python-version
+
+# dotenv
+.env
+
+# virtualenv
+.venv
+venv/
+env*
+
+# datasets
+*.tar*
+MileBench/
+
+# VSCode
+.vscode/
@@ -0,0 +1,46 @@
+# GenAI Optimizations
+
+This module provides experimental optimizations for GenAI models in PyTorch. The goal is to improve efficiency and performance for generative AI tasks while minimizing accuracy loss. This is PoC code and is intended to be compatible with OpenVINO GenAI.
+
+## Supported Generative AI Scenarios
+
+- Visual language text generation
+
+## Supported Generative AI Optimization Methods
+
+- [**Visual Token Pruning**](./visual_token_pruning.py):
+  Designed to accelerate inference in VLMs, where the number of input visual tokens is often significantly larger than that of textual tokens. Pruning these tokens reduces first-token latency and overall FLOPs while preserving accuracy. In this repository, we implement a visual token pruning method called [CDPruner](https://arxiv.org/pdf/2506.10967), which maximizes the conditional diversity of retained tokens. It can reduce FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy.
+
+## Supported and tested models
+
+Multimodal Large Language Models:
+
+- [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
+- [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
+- [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+- [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
+
+## Prerequisites
+
+Before running algorithms, ensure you have **Python 3.10+** installed and set up your environment.
+
+### 1. Create and activate a virtual environment
+
+```bash
+python3 -m venv env
+source env/bin/activate      # On Windows: env\Scripts\activate.bat
+```
+
+### 2. Installation
+
+You can install the package directly from the repository:
+
+```bash
+pip install git+https://github.com/openvinotoolkit/openvino_contrib.git#egg=genai_opt&subdirectory=modules/genai_optimizations
+```
+
+Or install it locally with extra dependencies for benchmarks support:
+
+```bash
+pip install .[benchmarks]
+```
@@ -0,0 +1,54 @@
+# Generative AI Models Optimization Examples
+
+This folder provides examples for evaluating and optimizing Generative AI models across different scenarios.
+
+
+<details>
+<summary><b>Multimodal Large Language Models Optimization Example: MME Benchmark</b></summary>
+
+This [example](./mmebench.py) demonstrates how to evaluate and optimize MLLMs using the [MME benchmark](https://arxiv.org/pdf/2306.13394), which measures both perception and cognition abilities across 14 subtasks. Its concise instruction design enables fair comparison of MLLMs without the need for extensive prompt engineering.
+
+Visual token pruning enables significant acceleration of inference in VLMs, where the number of input visual tokens is often much larger than the number of textual tokens. By pruning these tokens, we reduce first-token latency and overall FLOPs while preserving accuracy.
+
+### Run Example
+
+```bash
+python mmebench.py \
+    --subset artwork \
+    --model Qwen/Qwen2.5-VL-3B-Instruct \
+    --enable_visual_pruning \
+    --num_keep_tokens 128 \
+    --theta 0.5
+```
+This will automatically:
+
+- Download the selected model and dataset
+- Apply the visual token pruning algorithm
+- Evaluate the model and report the score
+
+</details>
+
+<details>
+<summary><b>Multimodal Large Language Models Optimization Example: MileBench</b></summary>
+
+This [example](./milebench.py) demonstrates how to optimize MLLMs using an experimental visual token pruning algorithm. The example leverages [MileBench](https://arxiv.org/pdf/2404.18532), a pioneering benchmark designed to rigorously evaluate the multimodal long-context capabilities of MLLMs. MileBench encompasses diverse tasks requiring both comprehension and generation, and introduces two distinct evaluation sets— diagnostic and realistic — that systematically assess models’ capacity for long-context adaptation and effective task completion.
+
+
+### Run Example
+
+```bash
+python milebench.py \
+    --subset WikiVQA \
+    --model Qwen/Qwen2-VL-2B-Instruct \
+    --enable_visual_pruning \
+    --num_keep_tokens 64 \
+    --theta 0.5
+```
+
+This will automatically:
+
+- Download the selected model and dataset
+- Apply the visual token pruning algorithm
+- Evaluate the model and report the score
+
+</details>