This repository contains a prototype video analytics solution designed to analyze and search visual content within video files using Azure Computer Vision 4 (Florence) and related Azure services.
The solution demonstrates how to:
- Extract frames from video files.
- Generate vector embeddings for each frame using Azure Computer Vision 4.
- Persist and index these embeddings for efficient similarity search.
- Perform visual search using either a reference image or a natural language prompt.
- Explore the results through a simple web application.
The end-to-end process can be summarized as follows:
-
Ingestion
- A video file is provided locally or from a storage location such as Azure Blob Storage.
- Basic metadata (file name, duration, frame rate, resolution) can be captured for reference and logging.
-
Frame Extraction (OpenCV)
- The video is processed using OpenCV.
- Frames are extracted at a configurable interval (for example, every N frames or every T seconds).
- Each extracted frame is assigned:
- A unique identifier.
- A timestamp or frame index, to allow seeking back into the original video.
- Optionally, frames can be stored:
- Locally on disk (for quick prototyping).
- In Azure Blob Storage, with their URLs stored alongside metadata.
-
Feature Extraction (Azure Computer Vision 4 – Florence)
- Each frame image is sent to Azure Computer Vision 4 (Florence) via the Azure AI Vision API.
- The service returns:
- Vector embeddings representing the visual content of the frame.
- Optionally, additional information such as tags, captions, or detected objects (depending on the API call).
- These embeddings are the foundation for visual and semantic similarity search across the video.
-
Storage and Indexing of Embeddings
- For each frame, the following information can be stored:
- Video identifier.
- Frame identifier and timestamp.
- Embedding vector.
- Optional tags, captions, or detected objects.
- Typical storage and indexing architecture:
- Metadata store (e.g., Azure Cosmos DB, Azure SQL Database, or another database) for:
- Video and frame metadata.
- References to frame image locations (local or Blob Storage).
- Vector index (e.g., a vector-capable search engine or custom index) to:
- Store and index high-dimensional embeddings.
- Support k-nearest-neighbor (k-NN) or cosine similarity search over embeddings.
- Metadata store (e.g., Azure Cosmos DB, Azure SQL Database, or another database) for:
- For each frame, the following information can be stored:
-
Visual and Semantic Search Experience 🔎
- Image-to-video search:
- A user provides a reference image.
- The image is sent to Azure Computer Vision 4 to obtain an embedding.
- A similarity query is executed against the stored frame embeddings.
- The system returns frames (with timestamps) ranked by similarity to the reference image.
- Text-to-video search:
- A user provides a natural language prompt (e.g., “a person walking on a beach at sunset”).
- The prompt is converted to an embedding using Florence’s multi-modal capabilities.
- A similarity query is executed against the frame embeddings to find the most semantically relevant scenes.
- Results include:
- Matching frames with preview thumbnails.
- Corresponding timestamps to allow seeking to the exact moment in the original video.
- Optional tags, captions, or confidence scores.
- Image-to-video search:
-
Web Application and User Interface 🌐
- A sample web application illustrates the end-to-end user experience:
- Upload or select videos for analysis.
- Trigger frame extraction and embedding generation.
- Perform searches using:
- A reference image upload.
- A natural language query.
- Inspect search results:
- View matching frames and their timestamps.
- Navigate back to the relevant segment of the original video.
- Display associated metadata when available.
- A sample web application illustrates the end-to-end user experience:
This prototype is designed to showcase integration with the following Azure services (some are optional and can be adapted to your environment):
-
Azure Computer Vision 4 (Florence)
- Core service for visual understanding and multi-modal (image + text) embeddings.
- Used to generate:
- Embeddings for frames.
- Embeddings for reference images.
- Embeddings for text prompts.
-
Azure AI Vision Endpoint
- Provides the API endpoint and authentication (API key) for calling Florence.
- Typically configured via environment variables (e.g.,
VISION_ENDPOINT,VISION_API_KEY) or a configuration file.
-
(Optional) Azure Blob Storage
- Can be used to:
- Store original video files.
- Store extracted frame images.
- Frame and video URLs can be persisted in the metadata store to avoid duplicating data.
- Can be used to:
-
(Optional) Azure Database and/or Search Service
- Azure Cosmos DB or Azure SQL Database for:
- Storing video metadata.
- Storing frame metadata (timestamps, paths, tags, captions).
- A vector-aware index or search service (for example, a k-NN index layer) for:
- Efficient similarity search across embedding vectors.
- Scaling to larger volumes of videos and frames.
- Azure Cosmos DB or Azure SQL Database for:
Note: This repository focuses on the conceptual and practical integration with Azure Computer Vision 4 (Florence). The choice of database and vector indexing technology can be tailored to specific performance, cost, and operational requirements.
To run this prototype, you will typically need:
- An Azure subscription.
- An Azure AI Vision resource with access to Computer Vision 4 (Florence).
- API endpoint and key for the Azure AI Vision resource.
- A Python environment with:
opencv-python(for frame extraction).requestsor Azure SDK libraries (for calling the Vision API).- Web framework dependencies (e.g.,
Flask,FastAPI, orStreamlit) if using the provided web application.
You may also need:
- Access to Azure Blob Storage and a chosen database/search service if you want to persist and scale beyond local storage.
- Place a video file in the expected input location (or configure the source path or Blob container).
- Execute the frame extraction script to:
- Extract frames using OpenCV.
- Save frame images locally or to Azure Blob Storage.
- Execute the embedding pipeline to:
- Send each frame to Azure Computer Vision 4 (Florence).
- Store embeddings and metadata in your chosen storage and index.
- Start the web application.
- Use the web UI to:
- Upload a reference image or enter a text prompt.
- Run similarity search against the indexed frame embeddings.
- Inspect search results and navigate to the corresponding segments in the video.
For full details on Azure Computer Vision 4 (Florence) and Azure AI Vision, refer to the official documentation:
Last updated: 2025-11-21 08:37:33 – Serge Retkowsky (serge.retkowsky@microsoft.com, LinkedIn profile).

