FlickSync

Reverse Video Search using Timesformer & FAISS

This project implements a video similarity search system using the Timesformer transformer model (pretrained on Kinetics-400) to generate video embeddings, and FAISS for efficient nearest neighbor search. Users can upload a video, and the app will return visually similar videos from the UCF101 dataset using precomputed embedding indexes.

✨ Features

Video Embedding: Uses Timesformer to extract powerful video representations.
Similarity Search: Efficiently retrieves similar videos using FAISS vector search.
Interactive Frontend: Built with Streamlit for easy video upload and result visualization.
GIF Previews: Generates GIF previews for both uploaded and retrieved videos.

📂 Dataset

The system uses the UCF101 action recognition dataset, which contains 13,320 videos across 101 action categories.
The embedder.ipynb notebook (inside src/) supports generating Timesformer embeddings for all 101 classes, enabling full-scale similarity search.
For a quick test and faster demo experience, a precomputed FAISS index is included in the demo_folder/embeddings/ directory.
This allows the app to run immediately without requiring full dataset processing.

🛠️ Getting Started

Requirements:

Python 3.8+
Jupyter Notebook
PyTorch
transformers, datasets, pandas, scikit-learn, and other standard ML/NLP libraries

Setup:

Clone the repository.
Install dependencies:

pip install requirements.txt

Run embedder.ipynb to generate embeddings for the videos.
Use frontend.py to search for similar videos and compare the different pooling strategies.

streamlit run frontent.py

🧠 Models

Timesformer (default, Hugging Face)
Easily extensible to other video transformer models
Leverages CLS pooling to generate contextually informed embeddings

📊 Results

Retrieves and displays the top-k most similar videos to a given query using transformer-based embeddings and FAISS.
Visual previews (GIFs) make it easy to assess retrieval quality.
Achieves high retrieval accuracy, with combined recall@1: 0.9797, recall@3: 0.9737, and recall@5: 0.9564, indicating that the correct class is almost always among the top results.
Recall@k measures how often the correct item appears within the top-k retrieved results. A higher recall@k indicates better retrieval performance, meaning the system is more likely to present relevant results to the user quickly.
The system is efficient and scalable, capable of handling large video datasets and real-time search scenarios by indexing normalized embeddings with FAISS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlickSync

Reverse Video Search using Timesformer & FAISS

✨ Features

📂 Dataset

🛠️ Getting Started

🧠 Models

📊 Results

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

FlickSync

Reverse Video Search using Timesformer & FAISS

✨ Features

📂 Dataset

🛠️ Getting Started

🧠 Models

📊 Results