This Gradio-based web application leverages the Qwen3-VL-4B-Instruct model from Alibaba's Qwen series for multimodal tasks involving images and text. It enables users to upload an image and perform various vision-language tasks, such as querying details, generating captions, detecting points of interest, or identifying bounding boxes for objects. The app includes visual annotations for point and detection tasks using the
supervisionlibrary. Powered by Hugging Face Transformers, PyTorch, and Gradio, this demo showcases the model's capabilities in real-time image understanding.
Important
note: remove kernels and flash_attn3 implementation if you are using it on non-hopper architecture gpus.
- Query: Ask open-ended questions about the image (e.g., "Count the total number of boats and describe the environment.").
- Caption: Generate image captions of varying lengths (e.g., short, detailed).
- Point: Detect and annotate 2D keypoints for specific elements (e.g., "The gun held by the person.").
- Detect: Identify and annotate bounding boxes for objects (e.g., "The headlight of the car.").
- Visual Annotations: Automatically overlays keypoints (red dots) or bounding boxes on the output image.
- Custom Theme: Steel-blue themed interface for a modern look.
- Examples: Pre-loaded sample images and prompts to get started quickly.
- GPU Acceleration: Optimized for CUDA devices if available.
The app processes images at a thumbnail resolution (512x512) for efficiency and supports JSON-formatted outputs for structured tasks.
- Python 3.10+ (recommended)
- pip >= 23.0.0
Install the dependencies using the provided requirements.txt or run the following:
For a full list, see requirements.txt:
git+https://github.com/huggingface/transformers.git@v4.57.6
git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
huggingface_hub
qwen-vl-utils
sentencepiece
opencv-python
torch==2.8.0
torchvision
supervision
matplotlib
pdf2image
requests
pymupdf
kernels
hf_xet
spaces
pillow
gradio # - gradio@6.3.0
fpdf
timm
av
- Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding.git cd Qwen-3VL-Multimodal-Understanding - Install dependencies (as above).
- Download model weights automatically on first run (requires internet).
-
Run the app locally:
python app.py
This launches a Gradio interface at
http://127.0.0.1:7860. -
In the interface:
- Upload an image.
- Select a task category (Query, Caption, Point, Detect).
- Enter a prompt tailored to the category.
- Click "Process Image" to generate results.
-
Outputs:
- Text: Generated response or JSON (with copy button).
- Image: Annotated version if applicable (points or boxes).
| Category | Example Prompt | Expected Output |
|---|---|---|
| Query | "Count the total number of boats and describe the environment." | Descriptive text with counts. |
| Caption | "detailed" | A long, descriptive caption. |
| Point | "The gun held by the person." | JSON with normalized (0-1) coordinates; red dots on image. |
| Detect | "Headlight of the car." | JSON with bounding boxes; colored rectangles on image. |
Sample images are included in the examples/ folder for testing.
- Model: Qwen/Qwen3-VL-4B-Instruct (4B parameters, vision-language model).
- Processor: Handles chat templating and tokenization.
- Device: Auto-detects CUDA; falls back to CPU.
- Limitations:
- Max new tokens: 512.
- Coordinates normalized to [0, 1000] in model output, scaled to [0, 1] in app.
- No fine-tuning; relies on zero-shot prompting.
Feel free to fork the repo, submit issues, or pull requests. Contributions for new tasks, themes, or optimizations are welcome!