Awesome Multimodal Search
A curated collection of 🔍 libraries, ☁️ platforms, 📖 research, 📊 benchmarks, and 📚 tutorials focused on Multimodal Search — enabling semantic retrieval across images, video, audio, and documents.
📢 Stay updated on multimodal search trends! Subscribe to the Mixpeek newsletter for the latest developments in multimodal AI.
Name
Description
Links
Jina AI
Flow-based neural search framework for text, image, video, and audio.
GitHub · Website
Weaviate
Vector DB with modules for image, text, and audio embeddings (e.g. CLIP, ImageBind).
GitHub · Website
Towhee
Multimodal data pipelines with 100+ pretrained models.
GitHub · Website
CLIP Retrieval
Lightweight toolkit to search CLIP-embedded LAION datasets.
GitHub · Demo
Qdrant
Vector database with multimodal search capabilities and filtering.
GitHub · Website
Milvus
Open-source vector database for embedding similarity search.
GitHub · Website
Vespa
Real-time search and recommendation engine with multimodal capabilities.
GitHub · Website
ChromaDB
Embedding database for building AI applications with multimodal data.
GitHub · Website
LlamaIndex
Data framework for connecting custom data to LLMs with multimodal retrieval.
GitHub · Docs
LangChain
Framework for developing applications with LLMs and multimodal retrieval.
GitHub · Website
DocArray
Data structure for multimodal and nested data, pairs with Jina.
GitHub · Docs
Haystack
End-to-end framework for building search pipelines with multimodal support.
GitHub · Website
FAISS
Library for efficient similarity search from Meta Research, supports image vectors.
GitHub · Docs
Name
Modalities
Links
Notes
OpenAI API
Text, image (GPT-4V), audio (Whisper)
Docs
Supports RAG + embeddings
Vertex AI (Google)
Image + Text
Docs
CoCa model embeddings
AWS Rekognition + Kendra + Transcribe
Image, text, audio
Rekognition · Kendra
Modular pipeline for multimodal search
Pinecone
Vector database supporting text, image, audio embeddings
Website
Hybrid search with metadata filtering
Mixpeek
Text, image, video, audio, PDF, time series, tabular
Website · Docs
Multimodal data warehouse with 25+ specialized feature extractors (face grouping, object tracking, scene detection, etc.), automatic model upgrades, and cross-modal correlation capabilities
Microsoft Azure AI Search
Text, images, PDFs, audio transcription
Docs
Cognitive search capabilities
Anthropic Claude API
Text + image understanding
Docs
Claude 3 Opus/Sonnet/Haiku models
Cohere
Text embeddings with multilingual support
Website
Embed, Rerank, and Generate APIs
Supabase Vector
Vector embeddings in Postgres
Docs
pgvector integration
Vectara
Managed neural search platform
Website
Zero-shot cross-modal search
Zilliz Cloud
Managed Milvus service for vector search
Website
Enterprise-grade vector DB service
Algolia
Search API with AI-powered vector search
Website
Hybrid keyword + semantic search
Elastic AI Search
Enterprise search with vector capabilities
Website
ELSER and vector search capabilities
📊 Benchmarks & Leaderboards
Benchmark
Modality
Metric
Example
MS COCO
Image–Text
R@1, R@5, R@10
BLIP-2 > 80% R@1
MSR-VTT
Video–Text
R@1, R@5
Marengo > 60% R@1
Clotho, AudioCaps
Audio–Text
mAP@10, R@10
CLAP ~0.21 mAP
Wiki-SS
Document Screenshots
Top-1 Accuracy
DSE 49% top-1
Flickr30k
Image-Text
R@1, R@5, R@10
CLIP ~65% R@1
MSMARCO
Text-Image
MRR@10, nDCG@10
RankFusion ~0.4 MRR
VQAv2
Image-Question-Answer
Accuracy
LLaVA ~80%
MTEB
Multimodal tasks
Avg. performance
BGE ~65% avg
MSCOCO Captioning
Image-Text
BLEU, METEOR, CIDEr
CoCa 143.6 CIDEr
DiDeMo
Video-Text
R@1, R@5
CLIP4Clip ~45% R@1
AudioSet
Audio classification
mAP
ImageBind ~0.44 mAP
SentEval
Text embeddings
Accuracy
OpenAI text-embedding-3 ~87%
HowTo100M
Video-Text
R@1, R@5
VideoCLIP ~32% R@1
ImageNet
Image classification
Top-1, Top-5
CLIP ~76% Top-1
BEIR
Text retrieval
nDCG@10
GTR ~66% nDCG
Title
Modality
Links
ImageBind + Deep Lake
Unified search
Tutorial
Pinecone + CLIP
Text–Image
Blog
Mixpeek Reverse Video Search
Video-Video
Tutorial
Jina Hello Multimodal
Text + Image
Code
RAG + CLIP + OpenAI
Multimodal RAG
Colab
LangChain Multimodal RAG
Text, Image, Video
Tutorial
Hugging Face CLIP Demo
Text-Image
Demo
Building Multimodal Search Engines
Text, Image
Course
FAISS Tutorial with Images
Image similarity
Tutorial
Video Search with PyTorch
Video retrieval
Tutorial
Milvus Bootcamp
Vector search
Bootcamp
ChromaDB Multimodal Examples
Text, Image
Cookbook
LlamaIndex Multimodal Guide
Text, Image, PDF
Guide
Vespa Image Search Tutorial
Image similarity
Tutorial
ImageBind Zero-Shot Classification
All modalities
Colab
Haystack Multimodal Pipelines
Text, Image, Audio
Tutorial
📰 Multimodal Monday Blog Posts
Title
Date
Author
Summary
Link
Multimodal Monday #3 — Scaling Multimodal AI: Laws, Lightweights & Large Releases
Apr 14, 2025
Philip Bankier
Apple's new scaling law research redefines how multimodal models are built, while Moonshot and OpenGVLab drop powerful open-source VLMs with reasoning and tool-use.
Read More
Multimodal Monday #2 — From Tiny VLMs to 10M‑Token Titans
Apr 6, 2025
Ethan Steininger
Major multimodal model releases including Meta's Llama 4 Scout & Maverick and Microsoft's Phi-4-Multimodal, marking the start of a new era of natively multimodal AI.
Read More
Multimodal Monday #1 - State of the Stack
-
-
Researchers introducing new methods to replace embeddings with discrete IDs for faster cross-modal search.
Read More
📬 Contributions welcome! PRs and issues encouraged.