Skip to content

A modular AI toolkit integrating state-of-the-art vision and language models (BLIP, DETR, ViLT) for real-time image captioning, object detection, and visual Q&A. Built with Streamlit and Hugging Face.

Notifications You must be signed in to change notification settings

Henildiyora/Multi_Modal_Image_Analysis_Dashboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Playground for Vision & Language

Python PyTorch Hugging Face Streamlit

An interactive, multi-page web application that serves as a powerful toolkit for multi-modal AI analysis. This project integrates several state-of-the-art models from the Hugging Face ecosystem to analyze images from multiple perspectives, demonstrating a comprehensive understanding of both Computer Vision and Natural Language Processing.


Core Features & Application Pages

The application is architected as a modular, multi-page Streamlit app, where each page is dedicated to a specific AI task. This allows for a clean user experience and a scalable codebase.

1. Image Captioning & NLP Analysis

This page forms the core of the language analysis. Upon uploading an image, the application:

  • Generates a descriptive caption using Salesforce's BLIP model.
  • Performs Sentiment Analysis on the caption to determine if the tone is positive or negative.
  • Conducts Named Entity Recognition (NER) to identify and extract entities like people, places, and organizations.
  • Offers interactive Zero-Shot Classification, allowing the user to classify the caption against custom, on-the-fly labels.

2. Object Detection

This page showcases a fundamental computer vision task. It uses a DETR (Detection Transformer) model to:

  • Identify multiple objects within the uploaded image.
  • Draw precise bounding boxes around each detected object.
  • Label each object with its class and a confidence score, providing a clear and immediate visual breakdown of the image's contents.

3. Visual Question Answering (VQA)

This page features a state-of-the-art, interactive AI capability. Users can:

  • Upload an image and view it.
  • Ask a natural language question about the image's content (e.g., "What color is the car?", "How many people are in the photo?").
  • Receive a direct, text-based answer generated by a Vision Transformer (ViLT) model that comprehends both the image and the question.

Tech Stack & Architecture

  • Core Framework: Streamlit (for the multi-page web interface)
  • AI & Deep Learning: PyTorch
  • Model Hub: Hugging Face Transformers (for BLIP, DETR, ViLT, and BERT models)
  • NLP Toolkit: spaCy (for robust Named Entity Recognition)
  • Image Processing: Pillow (PIL)
  • Architecture: The application is structured as a Python package with a clear separation of concerns:
    • model_loader.py: A dedicated, cached module for loading all heavy AI models once.
    • analysis_functions.py: Contains the core logic for all AI tasks.
    • ui_utils.py: Helper functions for UI elements like drawing bounding boxes.
    • pages/: Each page of the Streamlit app is a separate file for maximum organization.

Project Structure

Multi_Modal_Image_Analysis_Dashboard/
├── src/
│   ├── init.py
│   ├── model_loader.py
│   ├── analysis_functions.py
│   └── ui_utils.py
├── pages/
│   ├── captioning_and_NLP.py
│   ├── object_Detection.py
│   └── visual_Q&A.py
├── assets/
│   └── arial.ttf
├── app.py
├── requirements.txt
└── README.md

Local Setup & Installation

To run this project on your local machine, follow these steps:

  1. Clone the Repository

    git clone https://github.com/Henildiyora/Multi_Modal_Image_Analysis_Dashboard.git
    cd Multi_Modal_Image_Analysis_Dashboard
  2. Create and Activate a Virtual Environment

    # For macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
    
    # For Windows
    python -m venv venv
    .\venv\Scripts\activate
  3. Install Dependencies

    pip install -r requirements.txt

    Note: The first time you run the app, the Hugging Face models (several GBs) will be downloaded and cached on your machine.

  4. Download the spaCy Model Run the following command to download the English language model for NER:

    python -m spacy download en_core_web_sm
  5. Run the Streamlit App

    streamlit run app.py

    The application will launch in your web browser.


About

A modular AI toolkit integrating state-of-the-art vision and language models (BLIP, DETR, ViLT) for real-time image captioning, object detection, and visual Q&A. Built with Streamlit and Hugging Face.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages