Skip to content

retkowsky/video-search-azure

Repository files navigation

Video Search with Azure Computer Vision 4 (Florence) 🎥🔍

This repository contains a prototype video analytics solution designed to analyze and search visual content within video files using Azure Computer Vision 4 (Florence) and related Azure services.

The solution demonstrates how to:

  • Extract frames from video files.
  • Generate vector embeddings for each frame using Azure Computer Vision 4.
  • Persist and index these embeddings for efficient similarity search.
  • Perform visual search using either a reference image or a natural language prompt.
  • Explore the results through a simple web application.

Architecture and Process ⚙️

The end-to-end process can be summarized as follows:

  1. Ingestion

    • A video file is provided locally or from a storage location such as Azure Blob Storage.
    • Basic metadata (file name, duration, frame rate, resolution) can be captured for reference and logging.
  2. Frame Extraction (OpenCV)

    • The video is processed using OpenCV.
    • Frames are extracted at a configurable interval (for example, every N frames or every T seconds).
    • Each extracted frame is assigned:
      • A unique identifier.
      • A timestamp or frame index, to allow seeking back into the original video.
    • Optionally, frames can be stored:
      • Locally on disk (for quick prototyping).
      • In Azure Blob Storage, with their URLs stored alongside metadata.
  3. Feature Extraction (Azure Computer Vision 4 – Florence)

    • Each frame image is sent to Azure Computer Vision 4 (Florence) via the Azure AI Vision API.
    • The service returns:
      • Vector embeddings representing the visual content of the frame.
      • Optionally, additional information such as tags, captions, or detected objects (depending on the API call).
    • These embeddings are the foundation for visual and semantic similarity search across the video.
  4. Storage and Indexing of Embeddings

    • For each frame, the following information can be stored:
      • Video identifier.
      • Frame identifier and timestamp.
      • Embedding vector.
      • Optional tags, captions, or detected objects.
    • Typical storage and indexing architecture:
      • Metadata store (e.g., Azure Cosmos DB, Azure SQL Database, or another database) for:
        • Video and frame metadata.
        • References to frame image locations (local or Blob Storage).
      • Vector index (e.g., a vector-capable search engine or custom index) to:
        • Store and index high-dimensional embeddings.
        • Support k-nearest-neighbor (k-NN) or cosine similarity search over embeddings.
  5. Visual and Semantic Search Experience 🔎

    • Image-to-video search:
      • A user provides a reference image.
      • The image is sent to Azure Computer Vision 4 to obtain an embedding.
      • A similarity query is executed against the stored frame embeddings.
      • The system returns frames (with timestamps) ranked by similarity to the reference image.
    • Text-to-video search:
      • A user provides a natural language prompt (e.g., “a person walking on a beach at sunset”).
      • The prompt is converted to an embedding using Florence’s multi-modal capabilities.
      • A similarity query is executed against the frame embeddings to find the most semantically relevant scenes.
    • Results include:
      • Matching frames with preview thumbnails.
      • Corresponding timestamps to allow seeking to the exact moment in the original video.
      • Optional tags, captions, or confidence scores.
  6. Web Application and User Interface 🌐

    • A sample web application illustrates the end-to-end user experience:
      • Upload or select videos for analysis.
      • Trigger frame extraction and embedding generation.
      • Perform searches using:
        • A reference image upload.
        • A natural language query.
      • Inspect search results:
        • View matching frames and their timestamps.
        • Navigate back to the relevant segment of the original video.
        • Display associated metadata when available.

Azure Services Used ☁️

This prototype is designed to showcase integration with the following Azure services (some are optional and can be adapted to your environment):

  • Azure Computer Vision 4 (Florence)

    • Core service for visual understanding and multi-modal (image + text) embeddings.
    • Used to generate:
      • Embeddings for frames.
      • Embeddings for reference images.
      • Embeddings for text prompts.
  • Azure AI Vision Endpoint

    • Provides the API endpoint and authentication (API key) for calling Florence.
    • Typically configured via environment variables (e.g., VISION_ENDPOINT, VISION_API_KEY) or a configuration file.
  • (Optional) Azure Blob Storage

    • Can be used to:
      • Store original video files.
      • Store extracted frame images.
    • Frame and video URLs can be persisted in the metadata store to avoid duplicating data.
  • (Optional) Azure Database and/or Search Service

    • Azure Cosmos DB or Azure SQL Database for:
      • Storing video metadata.
      • Storing frame metadata (timestamps, paths, tags, captions).
    • A vector-aware index or search service (for example, a k-NN index layer) for:
      • Efficient similarity search across embedding vectors.
      • Scaling to larger volumes of videos and frames.

Note: This repository focuses on the conceptual and practical integration with Azure Computer Vision 4 (Florence). The choice of database and vector indexing technology can be tailored to specific performance, cost, and operational requirements.


Prerequisites ✅

To run this prototype, you will typically need:

  • An Azure subscription.
  • An Azure AI Vision resource with access to Computer Vision 4 (Florence).
  • API endpoint and key for the Azure AI Vision resource.
  • A Python environment with:
    • opencv-python (for frame extraction).
    • requests or Azure SDK libraries (for calling the Vision API).
    • Web framework dependencies (e.g., Flask, FastAPI, or Streamlit) if using the provided web application.

You may also need:

  • Access to Azure Blob Storage and a chosen database/search service if you want to persist and scale beyond local storage.

Example Workflow 🚀

  1. Place a video file in the expected input location (or configure the source path or Blob container).
  2. Execute the frame extraction script to:
    • Extract frames using OpenCV.
    • Save frame images locally or to Azure Blob Storage.
  3. Execute the embedding pipeline to:
    • Send each frame to Azure Computer Vision 4 (Florence).
    • Store embeddings and metadata in your chosen storage and index.
  4. Start the web application.
  5. Use the web UI to:
    • Upload a reference image or enter a text prompt.
    • Run similarity search against the indexed frame embeddings.
    • Inspect search results and navigate to the corresponding segments in the video.

Screenshots of the Web Application 🖼️

Search interface example

Search results example


Documentation 📚

For full details on Azure Computer Vision 4 (Florence) and Azure AI Vision, refer to the official documentation:

Last updated: 2025-11-21 08:37:33 – Serge Retkowsky (serge.retkowsky@microsoft.com, LinkedIn profile).

About

Video search using Azure Computer Vision 4 (Florence)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors