🚀 Try it live on Hugging Face Spaces
This is a simple Gradio-powered app that uses a pre-trained vision-language model to describe the content of images. Upload any image and see how AI interprets the scene.
Image captioning is a task where a deep learning model generates a textual description of an image. It combines computer vision and natural language processing in one pipeline.
import gradio as gr
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import ImageI used BlipProcessor which is a class from 🤗 transformers library. It is designed for Bootstrapping Language-Image Pretraining (BLIP) models. It basically combines image and text preprocessing
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")def generate_caption(image):
if image is None:
return "Please upload an image to generate a caption."
inputs = processor(images=image, return_tensors="pt")
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
return captioniface = gr.Interface(
fn=generate_caption,
inputs=gr.Image(type="pil", label="Upload Image"),
outputs="text",
live=True,
title="Image Captioning App",
description="Upload an image and get a description of what the image contains.",
allow_flagging="never"
)
iface.launch()git clone https://github.com/96ibman/image_captioning_gradio.gitcd image_captioning_gradiopython -m venv venvvenv/Scripts/activatepip install -r requirements.txtpython app.py