Text embeddings convert sequences (words, sentences, documents) into dense vector representations that capture semantic meaning. These vectors enable similarity search, clustering, classification, and retrieval tasks.
Modern embedding models go beyond simple word vectors to capture contextual meaning, multi-lingual semantics, and even multiple retrieval modalities in a single representation.
Use embeddings when you need:
- Semantic Search: Find documents similar to a query
- Clustering: Group similar texts together
- Classification: Use embeddings as features for classifiers
- Recommendation: Find similar items based on descriptions
- Retrieval-Augmented Generation (RAG): Retrieve relevant context for LLMs
Creates nested embeddings where shorter truncations maintain good performance, enabling flexible accuracy-efficiency tradeoffs.
Strengths:
- Single model supports multiple dimensions
- No retraining for different sizes
- Graceful performance degradation
- Storage and compute efficiency
Weaknesses:
- Requires special training procedure
- Slightly lower performance at very low dimensions
- Need to choose dimensions at deployment
Use when: You need flexible embedding sizes for different deployment scenarios (edge devices, servers) or want to optimize storage/compute costs.
See: matryoshka_representation_learning.md
Unified model supporting three retrieval modes: dense, sparse (learned), and multi-vector (ColBERT-style).
Strengths:
- Multi-functionality (3 retrieval modes in one)
- Multi-lingual (100+ languages)
- State-of-the-art retrieval performance
- Hybrid search capabilities
Weaknesses:
- Larger model size
- Higher computational cost
- More complex deployment
- Requires index support for all modes
Use when: You need maximum retrieval quality, multi-lingual support, or want to experiment with hybrid retrieval strategies.
See: bge_m3.md
| Feature | Matryoshka | BGE-M3 |
|---|---|---|
| Flexibility | Variable dimensions | Multiple retrieval modes |
| Performance | Good (single mode) | Excellent (hybrid) |
| Efficiency | High (truncatable) | Medium (multi-head) |
| Training Complexity | Medium | High |
| Deployment Complexity | Low | High |
| Multi-lingual | Depends on base model | Native (100+ langs) |
| Best Use Case | Efficiency-focused | Quality-focused |
- Dimension Selection: Test multiple dimensions on your dataset to find the optimal tradeoff
- Training Strategy: Use uniform loss weighting initially, then adjust based on target dimensions
- Normalization: Always normalize embeddings before similarity computation
- Benchmarking: Evaluate at all nesting dimensions during development
- Mode Selection: Start with dense-only, add sparse/ColBERT for difficult queries
- Weight Tuning: Adjust hybrid weights (dense/sparse/colbert) based on your data
- Index Design: Use specialized indices (HNSW for dense, inverted index for sparse)
- Query Analysis: Route queries to appropriate modes based on characteristics
- Batch Processing: Process embeddings in batches for efficiency
- Caching: Cache embeddings for frequently accessed documents
- Normalization: Normalize embeddings for cosine similarity (dot product with normalized vectors)
- Dimensionality: Higher dimensions aren't always better - validate on your task
- Fine-tuning: Fine-tune on domain-specific data when possible
- Retrieval Accuracy: Recall@k, MRR, NDCG
- Clustering Quality: Silhouette score, Davies-Bouldin index
- Semantic Similarity: Correlation with human judgments
- Cross-lingual Transfer: Performance on multi-lingual tasks
- Over-dimensioning: Using unnecessarily large embeddings
- No Fine-tuning: Generic embeddings may underperform on domain-specific tasks
- Ignoring Normalization: Forgetting to normalize before similarity computation
- Batch Size Mismatch: Training/inference batch size differences affecting quality
- Query-Document Asymmetry: Using same encoder for queries and documents when asymmetric may be better
- Sentence Transformers: https://www.sbert.net/
- MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
- BGE Models: https://github.com/FlagOpen/FlagEmbedding
- Matryoshka Paper: https://arxiv.org/abs/2205.13147
- Semantic Search Engine: Build Google-like search over documents
- Question Answering: Retrieve relevant passages for RAG systems
- Duplicate Detection: Find near-duplicate content
- Content Recommendation: Recommend similar articles/products
- Multi-lingual Search: Cross-lingual information retrieval