AI Playground for Vision & Language

An interactive, multi-page web application that serves as a powerful toolkit for multi-modal AI analysis. This project integrates several state-of-the-art models from the Hugging Face ecosystem to analyze images from multiple perspectives, demonstrating a comprehensive understanding of both Computer Vision and Natural Language Processing.

Core Features & Application Pages

The application is architected as a modular, multi-page Streamlit app, where each page is dedicated to a specific AI task. This allows for a clean user experience and a scalable codebase.

1. Image Captioning & NLP Analysis

This page forms the core of the language analysis. Upon uploading an image, the application:

Generates a descriptive caption using Salesforce's BLIP model.
Performs Sentiment Analysis on the caption to determine if the tone is positive or negative.
Conducts Named Entity Recognition (NER) to identify and extract entities like people, places, and organizations.
Offers interactive Zero-Shot Classification, allowing the user to classify the caption against custom, on-the-fly labels.

2. Object Detection

This page showcases a fundamental computer vision task. It uses a DETR (Detection Transformer) model to:

Identify multiple objects within the uploaded image.
Draw precise bounding boxes around each detected object.
Label each object with its class and a confidence score, providing a clear and immediate visual breakdown of the image's contents.

3. Visual Question Answering (VQA)

This page features a state-of-the-art, interactive AI capability. Users can:

Upload an image and view it.
Ask a natural language question about the image's content (e.g., "What color is the car?", "How many people are in the photo?").
Receive a direct, text-based answer generated by a Vision Transformer (ViLT) model that comprehends both the image and the question.

Tech Stack & Architecture

Core Framework: Streamlit (for the multi-page web interface)
AI & Deep Learning: PyTorch
Model Hub: Hugging Face Transformers (for BLIP, DETR, ViLT, and BERT models)
NLP Toolkit: spaCy (for robust Named Entity Recognition)
Image Processing: Pillow (PIL)
Architecture: The application is structured as a Python package with a clear separation of concerns:
- model_loader.py: A dedicated, cached module for loading all heavy AI models once.
- analysis_functions.py: Contains the core logic for all AI tasks.
- ui_utils.py: Helper functions for UI elements like drawing bounding boxes.
- pages/: Each page of the Streamlit app is a separate file for maximum organization.

Project Structure

Multi_Modal_Image_Analysis_Dashboard/
├── src/
│   ├── init.py
│   ├── model_loader.py
│   ├── analysis_functions.py
│   └── ui_utils.py
├── pages/
│   ├── captioning_and_NLP.py
│   ├── object_Detection.py
│   └── visual_Q&A.py
├── assets/
│   └── arial.ttf
├── app.py
├── requirements.txt
└── README.md

Local Setup & Installation

To run this project on your local machine, follow these steps:

Clone the Repository

git clone https://github.com/Henildiyora/Multi_Modal_Image_Analysis_Dashboard.git
cd Multi_Modal_Image_Analysis_Dashboard

Create and Activate a Virtual Environment

# For macOS/Linux
python3 -m venv venv
source venv/bin/activate

# For Windows
python -m venv venv
.\venv\Scripts\activate

Install Dependencies
```
pip install -r requirements.txt
```
Note: The first time you run the app, the Hugging Face models (several GBs) will be downloaded and cached on your machine.
Download the spaCy Model Run the following command to download the English language model for NER:
```
python -m spacy download en_core_web_sm
```
Run the Streamlit App
```
streamlit run app.py
```
The application will launch in your web browser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Playground for Vision & Language

Core Features & Application Pages

1. Image Captioning & NLP Analysis

2. Object Detection

3. Visual Question Answering (VQA)

Tech Stack & Architecture

Project Structure

Local Setup & Installation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pages		pages
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Henildiyora/Multi_Modal_Image_Analysis_Dashboard

Folders and files

Latest commit

History

Repository files navigation

AI Playground for Vision & Language

Core Features & Application Pages

1. Image Captioning & NLP Analysis

2. Object Detection

3. Visual Question Answering (VQA)

Tech Stack & Architecture

Project Structure

Local Setup & Installation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages