Video Search with Azure Computer Vision 4 (Florence) 🎥🔍

This repository contains a prototype video analytics solution designed to analyze and search visual content within video files using Azure Computer Vision 4 (Florence) and related Azure services.

The solution demonstrates how to:

Extract frames from video files.
Generate vector embeddings for each frame using Azure Computer Vision 4.
Persist and index these embeddings for efficient similarity search.
Perform visual search using either a reference image or a natural language prompt.
Explore the results through a simple web application.

Architecture and Process ⚙️

The end-to-end process can be summarized as follows:

Ingestion
- A video file is provided locally or from a storage location such as Azure Blob Storage.
- Basic metadata (file name, duration, frame rate, resolution) can be captured for reference and logging.
Frame Extraction (OpenCV)
- The video is processed using OpenCV.
- Frames are extracted at a configurable interval (for example, every N frames or every T seconds).
- Each extracted frame is assigned:
  - A unique identifier.
  - A timestamp or frame index, to allow seeking back into the original video.
- Optionally, frames can be stored:
  - Locally on disk (for quick prototyping).
  - In Azure Blob Storage, with their URLs stored alongside metadata.
Feature Extraction (Azure Computer Vision 4 – Florence)
- Each frame image is sent to Azure Computer Vision 4 (Florence) via the Azure AI Vision API.
- The service returns:
  - Vector embeddings representing the visual content of the frame.
  - Optionally, additional information such as tags, captions, or detected objects (depending on the API call).
- These embeddings are the foundation for visual and semantic similarity search across the video.
Storage and Indexing of Embeddings
- For each frame, the following information can be stored:
  - Video identifier.
  - Frame identifier and timestamp.
  - Embedding vector.
  - Optional tags, captions, or detected objects.
- Typical storage and indexing architecture:
  - Metadata store (e.g., Azure Cosmos DB, Azure SQL Database, or another database) for:
    - Video and frame metadata.
    - References to frame image locations (local or Blob Storage).
  - Vector index (e.g., a vector-capable search engine or custom index) to:
    - Store and index high-dimensional embeddings.
    - Support k-nearest-neighbor (k-NN) or cosine similarity search over embeddings.
Visual and Semantic Search Experience 🔎
- Image-to-video search:
  - A user provides a reference image.
  - The image is sent to Azure Computer Vision 4 to obtain an embedding.
  - A similarity query is executed against the stored frame embeddings.
  - The system returns frames (with timestamps) ranked by similarity to the reference image.
- Text-to-video search:
  - A user provides a natural language prompt (e.g., “a person walking on a beach at sunset”).
  - The prompt is converted to an embedding using Florence’s multi-modal capabilities.
  - A similarity query is executed against the frame embeddings to find the most semantically relevant scenes.
- Results include:
  - Matching frames with preview thumbnails.
  - Corresponding timestamps to allow seeking to the exact moment in the original video.
  - Optional tags, captions, or confidence scores.
Web Application and User Interface 🌐
- A sample web application illustrates the end-to-end user experience:
  - Upload or select videos for analysis.
  - Trigger frame extraction and embedding generation.
  - Perform searches using:
    - A reference image upload.
    - A natural language query.
  - Inspect search results:
    - View matching frames and their timestamps.
    - Navigate back to the relevant segment of the original video.
    - Display associated metadata when available.

Azure Services Used ☁️

This prototype is designed to showcase integration with the following Azure services (some are optional and can be adapted to your environment):

Azure Computer Vision 4 (Florence)
- Core service for visual understanding and multi-modal (image + text) embeddings.
- Used to generate:
  - Embeddings for frames.
  - Embeddings for reference images.
  - Embeddings for text prompts.
Azure AI Vision Endpoint
- Provides the API endpoint and authentication (API key) for calling Florence.
- Typically configured via environment variables (e.g., VISION_ENDPOINT, VISION_API_KEY) or a configuration file.
(Optional) Azure Blob Storage
- Can be used to:
  - Store original video files.
  - Store extracted frame images.
- Frame and video URLs can be persisted in the metadata store to avoid duplicating data.
(Optional) Azure Database and/or Search Service
- Azure Cosmos DB or Azure SQL Database for:
  - Storing video metadata.
  - Storing frame metadata (timestamps, paths, tags, captions).
- A vector-aware index or search service (for example, a k-NN index layer) for:
  - Efficient similarity search across embedding vectors.
  - Scaling to larger volumes of videos and frames.

Note: This repository focuses on the conceptual and practical integration with Azure Computer Vision 4 (Florence). The choice of database and vector indexing technology can be tailored to specific performance, cost, and operational requirements.

Prerequisites ✅

To run this prototype, you will typically need:

An Azure subscription.
An Azure AI Vision resource with access to Computer Vision 4 (Florence).
API endpoint and key for the Azure AI Vision resource.
A Python environment with:
- opencv-python (for frame extraction).
- requests or Azure SDK libraries (for calling the Vision API).
- Web framework dependencies (e.g., Flask, FastAPI, or Streamlit) if using the provided web application.

You may also need:

Access to Azure Blob Storage and a chosen database/search service if you want to persist and scale beyond local storage.

Example Workflow 🚀

Place a video file in the expected input location (or configure the source path or Blob container).
Execute the frame extraction script to:
- Extract frames using OpenCV.
- Save frame images locally or to Azure Blob Storage.
Execute the embedding pipeline to:
- Send each frame to Azure Computer Vision 4 (Florence).
- Store embeddings and metadata in your chosen storage and index.
Start the web application.
Use the web UI to:
- Upload a reference image or enter a text prompt.
- Run similarity search against the indexed frame embeddings.
- Inspect search results and navigate to the corresponding segments in the video.

Screenshots of the Web Application 🖼️

Documentation 📚

For full details on Azure Computer Vision 4 (Florence) and Azure AI Vision, refer to the official documentation:

Azure AI Vision documentation

Last updated: 2025-11-21 08:37:33 – Serge Retkowsky (serge.retkowsky@microsoft.com, LinkedIn profile).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
videos		videos
1 Extracting frames from the video.ipynb		1 Extracting frames from the video.ipynb
2 Frames analysis.ipynb		2 Frames analysis.ipynb
3 Images embeddings with Azure Computer Vision 4.ipynb		3 Images embeddings with Azure Computer Vision 4.ipynb
4 Search using an image or a prompt.ipynb		4 Search using an image or a prompt.ipynb
5 Gradio App for video search.ipynb		5 Gradio App for video search.ipynb
README.md		README.md
azure.env		azure.env
azure.py		azure.py
example1.jpg		example1.jpg
example2.jpg		example2.jpg
logo.jpg		logo.jpg
videosearchdemo.mp4		videosearchdemo.mp4
without_background.jpg		without_background.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Search with Azure Computer Vision 4 (Florence) 🎥🔍

Architecture and Process ⚙️

Azure Services Used ☁️

Prerequisites ✅

Example Workflow 🚀

Screenshots of the Web Application 🖼️

Documentation 📚

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Video Search with Azure Computer Vision 4 (Florence) 🎥🔍

Architecture and Process ⚙️

Azure Services Used ☁️

Prerequisites ✅

Example Workflow 🚀

Screenshots of the Web Application 🖼️

Documentation 📚

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages