The Visual QnA System enables users to upload an image and ask specific questions about its content. Using cutting-edge models like VILT for Visual Question Answering and BLIP for image captioning, this system provides interactive and intelligent responses based on the image analysis. It is perfect for applications in AI-powered chatbots, image understanding, and automated analysis.
Try out the Visual QnA System! 👉🏻
Below is a preview of the Visual QnA System in action. Upload an image and ask questions! 👇🏻
- Features
- Models
- Installation
- Usage
- Technologies Used
- Results
- Conclusion
- Future Enhancements
- License
- Contact
- Upload an image and receive a generated caption.
- Choose from suggested questions or ask your own.
- Get answers to questions based on the image content.
- Built with Streamlit for an interactive and easy-to-use interface.
- A model used for Visual Question Answering.
- Uses a combination of image features and text input to provide answers.
- A model for generating captions from images.
- The captions are used to generate possible questions for the user to ask.
-
Clone the repository:
https://github.com/hk-kumawat/Visual-QnA-System.git
-
Install dependencies:
pip install -r requirements.txt
-
Run the Streamlit App:
streamlit run app.py
-
Upload Image: Choose an image from your local drive.
-
Select Question: You can either pick a suggested question or write your own.
-
Get Answer: Click the "Predict Answer" button to receive an answer to your question about the image.
- Programming Language: Python
- Libraries:
Streamlitfor the web interfacePILfor image handlingTransformersfrom Hugging Face for pre-trained models
- Models:
- VILT:
dandelin/vilt-b32-finetuned-vqa - BLIP:
Salesforce/blip-image-captioning-base
- VILT:
The Visual QnA System offers an interactive experience where users can ask questions about images. It successfully generates captions and suggests questions based on image content, as well as providing accurate answers using the VILT model.
The Visual QnA System successfully answers questions based on image content. Here's an example of how the system works:
In this case, the system was asked, "What sport is being played?" and the response was "Soccer," showcasing its ability to understand the context of images.
The Visual QnA System is a powerful application of computer vision and natural language processing. By integrating image captioning and question answering models, it provides an engaging and intuitive way for users to interact with images. This project demonstrates the potential of AI-driven image understanding and its wide range of applications in fields like AI chatbots, image search engines, and education and e-learning.
With the ability to analyze and answer questions about images, it can enhance customer support, optimize image-based search results, and improve personalized recommendations based on visual content. Additionally, it has immense potential in areas like healthcare for diagnostic imaging, security and surveillance, and even in autonomous vehicles, where understanding the visual environment is critical.
While the Visual QnA System currently delivers concise, single-line responses, future improvements could enable more detailed, context-aware answers. Here are a few potential upgrades:
- Extended Answer Generation: Integrate advanced language models to generate detailed answers that provide in-depth information based on image content.
- Context Awareness: Enable the system to consider multiple objects and interactions in an image, enhancing its capability to answer complex questions.
- Multilingual Support: Add the ability to understand and answer questions in various languages, broadening accessibility.
- Enhanced Accuracy with Fine-Tuning: Train on diverse datasets for specialized fields, such as medical imaging or geographical scenes, to improve precision and expand application areas.
This project is licensed under the MIT License - see the LICENSE file for details.
I’d love to connect and discuss further:
💻 — Explore my projects and contributions.
🌐 — Let’s connect professionally.
📧 — Send me an email for discussions and queries.
"Empowering machines to see, think, and answer – the future of visual intelligence!" - Anonymous


