An interactive, multi-page web application that serves as a powerful toolkit for multi-modal AI analysis. This project integrates several state-of-the-art models from the Hugging Face ecosystem to analyze images from multiple perspectives, demonstrating a comprehensive understanding of both Computer Vision and Natural Language Processing.
The application is architected as a modular, multi-page Streamlit app, where each page is dedicated to a specific AI task. This allows for a clean user experience and a scalable codebase.
This page forms the core of the language analysis. Upon uploading an image, the application:
- Generates a descriptive caption using Salesforce's BLIP model.
- Performs Sentiment Analysis on the caption to determine if the tone is positive or negative.
- Conducts Named Entity Recognition (NER) to identify and extract entities like people, places, and organizations.
- Offers interactive Zero-Shot Classification, allowing the user to classify the caption against custom, on-the-fly labels.
This page showcases a fundamental computer vision task. It uses a DETR (Detection Transformer) model to:
- Identify multiple objects within the uploaded image.
- Draw precise bounding boxes around each detected object.
- Label each object with its class and a confidence score, providing a clear and immediate visual breakdown of the image's contents.
This page features a state-of-the-art, interactive AI capability. Users can:
- Upload an image and view it.
- Ask a natural language question about the image's content (e.g., "What color is the car?", "How many people are in the photo?").
- Receive a direct, text-based answer generated by a Vision Transformer (ViLT) model that comprehends both the image and the question.
- Core Framework: Streamlit (for the multi-page web interface)
- AI & Deep Learning: PyTorch
- Model Hub: Hugging Face Transformers (for BLIP, DETR, ViLT, and BERT models)
- NLP Toolkit: spaCy (for robust Named Entity Recognition)
- Image Processing: Pillow (PIL)
- Architecture: The application is structured as a Python package with a clear separation of concerns:
model_loader.py: A dedicated, cached module for loading all heavy AI models once.analysis_functions.py: Contains the core logic for all AI tasks.ui_utils.py: Helper functions for UI elements like drawing bounding boxes.pages/: Each page of the Streamlit app is a separate file for maximum organization.
Multi_Modal_Image_Analysis_Dashboard/
├── src/
│ ├── init.py
│ ├── model_loader.py
│ ├── analysis_functions.py
│ └── ui_utils.py
├── pages/
│ ├── captioning_and_NLP.py
│ ├── object_Detection.py
│ └── visual_Q&A.py
├── assets/
│ └── arial.ttf
├── app.py
├── requirements.txt
└── README.md
To run this project on your local machine, follow these steps:
-
Clone the Repository
git clone https://github.com/Henildiyora/Multi_Modal_Image_Analysis_Dashboard.git cd Multi_Modal_Image_Analysis_Dashboard -
Create and Activate a Virtual Environment
# For macOS/Linux python3 -m venv venv source venv/bin/activate # For Windows python -m venv venv .\venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
Note: The first time you run the app, the Hugging Face models (several GBs) will be downloaded and cached on your machine.
-
Download the spaCy Model Run the following command to download the English language model for NER:
python -m spacy download en_core_web_sm
-
Run the Streamlit App
streamlit run app.py
The application will launch in your web browser.