Qwen-3VL: Multimodal Understanding

This Gradio-based web application leverages the Qwen3-VL-4B-Instruct model from Alibaba's Qwen series for multimodal tasks involving images and text. It enables users to upload an image and perform various vision-language tasks, such as querying details, generating captions, detecting points of interest, or identifying bounding boxes for objects. The app includes visual annotations for point and detection tasks using the supervision library. Powered by Hugging Face Transformers, PyTorch, and Gradio, this demo showcases the model's capabilities in real-time image understanding.

Note

HF Demo: https://huggingface.co/spaces/prithivMLmods/Qwen3-VL-HF-Demo

Screenshot 2025-11-18 at 17-20-08 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods

Screenshot 2025-11-18 at 17-04-03 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods

Important

note: remove kernels and flash_attn3 implementation if you are using it on non-hopper architecture gpus.

Features

Query: Ask open-ended questions about the image (e.g., "Count the total number of boats and describe the environment.").
Caption: Generate image captions of varying lengths (e.g., short, detailed).
Point: Detect and annotate 2D keypoints for specific elements (e.g., "The gun held by the person.").
Detect: Identify and annotate bounding boxes for objects (e.g., "The headlight of the car.").
Visual Annotations: Automatically overlays keypoints (red dots) or bounding boxes on the output image.
Custom Theme: Steel-blue themed interface for a modern look.
Examples: Pre-loaded sample images and prompts to get started quickly.
GPU Acceleration: Optimized for CUDA devices if available.

The app processes images at a thumbnail resolution (512x512) for efficiency and supports JSON-formatted outputs for structured tasks.

Installation

Prerequisites

Python 3.10+ (recommended)
pip >= 23.0.0

Requirements

Install the dependencies using the provided requirements.txt or run the following:

For a full list, see requirements.txt:

git+https://github.com/huggingface/transformers.git@v4.57.6
git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
huggingface_hub
qwen-vl-utils
sentencepiece
opencv-python
torch==2.8.0
torchvision
supervision
matplotlib
pdf2image
requests
pymupdf
kernels
hf_xet
spaces
pillow
gradio # - gradio@6.3.0
fpdf
timm
av

Setup

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding.git
cd Qwen-3VL-Multimodal-Understanding

Install dependencies (as above).
Download model weights automatically on first run (requires internet).

Usage

Run the app locally:
```
python app.py
```
This launches a Gradio interface at http://127.0.0.1:7860.
In the interface:
- Upload an image.
- Select a task category (Query, Caption, Point, Detect).
- Enter a prompt tailored to the category.
- Click "Process Image" to generate results.
Outputs:
- Text: Generated response or JSON (with copy button).
- Image: Annotated version if applicable (points or boxes).

Example Prompts

Category	Example Prompt	Expected Output
Query	"Count the total number of boats and describe the environment."	Descriptive text with counts.
Caption	"detailed"	A long, descriptive caption.
Point	"The gun held by the person."	JSON with normalized (0-1) coordinates; red dots on image.
Detect	"Headlight of the car."	JSON with bounding boxes; colored rectangles on image.

Sample images are included in the examples/ folder for testing.

Model Details

Model: Qwen/Qwen3-VL-4B-Instruct (4B parameters, vision-language model).
Processor: Handles chat templating and tokenization.
Device: Auto-detects CUDA; falls back to CPU.
Limitations:
- Max new tokens: 512.
- Coordinates normalized to [0, 1000] in model output, scaled to [0, 1] in app.
- No fine-tuning; relies on zero-shot prompting.

Contributing

Feel free to fork the repo, submit issues, or pull requests. Contributions for new tasks, themes, or optimizations are welcome!

Repository

GitHub: https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding
Hugging Face Space: https://huggingface.co/spaces/prithivMLmods/Qwen3-VL-HF-Demo

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen-3VL: Multimodal Understanding

Features

Installation

Prerequisites

Requirements

Setup

Usage

Example Prompts

Model Details

Contributing

Repository

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen-3VL: Multimodal Understanding

Features

Installation

Prerequisites

Requirements

Setup

Usage

Example Prompts

Model Details

Contributing

Repository

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages