Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions notebooks/ministral-3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Visual-language assistant with Ministral-3 and OpenVINO

Ministral-3 (Ministral-3-3B-Instruct-2512) is a lightweight, state-of-the-art multimodal model from Mistral AI, combining a 3.4B parameter language model with a 0.4B parameter vision encoder based on the Pixtral architecture. It is designed for efficient visual-language understanding tasks.

**Key Features of Ministral-3:**
* **Multimodal Understanding**: Combines text and vision capabilities in a compact 3B parameter model, enabling image understanding and visual question answering.
* **Long Context Support**: Supports up to 262,144 tokens with YaRN RoPE scaling for extended context processing.
* **Efficient Architecture**: Uses Grouped Query Attention (32 attention heads with 8 KV heads) for memory-efficient inference.
* **Pixtral Vision Encoder**: Employs a PixtralVisionModel with patch-based image processing and multi-modal projection for seamless vision-language integration.

More details about the model can be found in the [model card](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512) and the [Mistral AI documentation](https://docs.mistral.ai/).

In this tutorial we consider how to convert and optimize Ministral-3 model for creating a multimodal chatbot using [Optimum Intel](https://github.com/huggingface/optimum-intel). Additionally, we demonstrate how to apply model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf).

## Notebook contents
The tutorial consists of the following steps:

- Install requirements
- Convert and Optimize model
- Prepare OpenVINO GenAI Inference Pipeline
- Run OpenVINO GenAI model inference
- Launch Interactive demo

In this demonstration, you'll create an interactive chatbot that can answer questions about provided image content.

## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](../../README.md).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/ministral-3/README.md" />
77 changes: 77 additions & 0 deletions notebooks/ministral-3/gradio_helper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
from pathlib import Path
import gradio as gr

from PIL import Image
import requests
from threading import Thread
import inspect
from transformers import TextIteratorStreamer

example_image_urls = [
(
"https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/1d6a0188-5613-418d-a1fd-4560aae1d907",
"bee.jpg",
),
(
"https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/6cc7feeb-0721-4b5d-8791-2576ed9d2863",
"baklava.png",
),
]
for url, file_name in example_image_urls:
if not Path(file_name).exists():
Image.open(requests.get(url, stream=True, timeout=30).raw).save(file_name)


def make_demo(model, processor):
has_additonal_buttons = "undo_button" in inspect.signature(gr.ChatInterface.__init__).parameters

def bot_streaming(message, history):
print(f"message is - {message}")
print(f"history is - {history}")

files = message["files"] if isinstance(message, dict) else message.files
message_text = message["text"] if isinstance(message, dict) else message.text

image = None
if files:
if isinstance(files[-1], dict):
image = files[-1]["path"]
else:
if isinstance(files[-1], (str, Path)):
image = files[-1]
else:
image = files[-1] if isinstance(files[-1], (list, tuple)) else files[-1].path
if image is not None:
image = Image.open(image).convert("RGB")
# Resize large images to keep patch count manageable
if max(image.size) > 512:
image.thumbnail((512, 512))

inputs = model.preprocess_inputs(text=message_text, image=image, processor=processor)

streamer = TextIteratorStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=128, do_sample=False)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

buffer = ""
for new_text in streamer:
buffer += new_text
yield buffer

additional_buttons = {}
if has_additonal_buttons:
additional_buttons = {"undo_button": None, "retry_button": None}
demo = gr.ChatInterface(
fn=bot_streaming,
title="Ministral-3 OpenVINO Demo",
examples=[
{"text": "What is on the flower?", "files": ["./bee.jpg"]},
{"text": "How to make this pastry?", "files": ["./baklava.png"]},
],
stop_btn=None,
multimodal=True,
**additional_buttons,
)
return demo
Loading