A complete AI-powered semantic search engine for YouTube videos that understands natural language queries and finds relevant content based on meaning, not just keywords.
- Semantic Understanding: Finds videos based on meaning, not just exact word matches
- Multi-Modal Data: Uses video titles, descriptions, transcripts, and comments
- Custom AI Training: Fine-tune models on your specific YouTube data
- Fast Vector Search: FAISS-powered similarity search
- M1 Mac Optimized: Leverages Metal Performance Shaders for optimal performance
- Real-time Search: Instant results with semantic relevance scoring
User Query → Embedding Model → Vector Search → Ranked Results
↓
[Trained on YouTube Data]
↓
[Video Database + FAISS Index]
- Python 3.9+
- M1 Mac (optimized for Metal Performance Shaders)
- YouTube Data API Key (for data collection)
- 8GB+ RAM (recommended)
cd youtube_semantic_search
source venv/bin/activate# Core ML packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# Other dependencies
pip install sentence-transformers transformers datasets accelerate
pip install faiss-cpu numpy scipy pandas scikit-learn
pip install google-api-python-client google-auth-oauthlib- Go to Google Cloud Console
- Create a new project or select existing
- Enable YouTube Data API v3
- Create credentials (API Key)
- Set environment variable:
export YOUTUBE_API_KEY="your_api_key_here"python demo.pyThis will:
- Create sample training data
- Train a semantic model
- Demonstrate search functionality
python src/data_collector.pypython src/train_model.pypython src/search_engine.pyThe system collects comprehensive video data:
- Titles & Descriptions: From YouTube API
- Transcripts: Auto-generated captions
- Comments: Top relevant comments
- Metadata: Views, likes, duration, etc.
search_queries = [
"machine learning tutorial",
"python programming",
"cooking recipes",
"gaming walkthrough",
"music covers",
"travel vlog",
"fitness workout",
"comedy sketches"
]- Model:
all-MiniLM-L6-v2 - Parameters: ~80M
- Input Length: 512 tokens
- Output: 384-dimensional embeddings
- Data Preparation: Convert YouTube data to training pairs
- Fine-tuning: Use contrastive learning with cosine similarity loss
- Evaluation: Measure semantic similarity accuracy
- Model Saving: Save best performing model
training:
epochs: 10
batch_size: 16
learning_rate: 2e-5
weight_decay: 0.01
use_mixed_precision: true- Semantic Search: Find videos by meaning
- Filtering: By channel, views, likes, category
- Similar Videos: Find related content
- Batch Search: Multiple queries at once
- Relevance Scoring: Cosine similarity scores
# Basic search
results = search_engine.search("how to make pasta", top_k=10)
# Filtered search
filters = {"min_views": 10000, "channel_title": "Gordon Ramsay"}
results = search_engine.search_with_filters("pasta recipe", filters)
# Find similar videos
similar = search_engine.get_similar_videos("video_id_123")- Metal Performance Shaders (MPS): GPU acceleration
- Unified Memory: Efficient memory management
- Optimized Batch Sizes: Tailored for M1 architecture
- Training: ~2-5 minutes per epoch (sample data)
- Search: <100ms per query
- Memory Usage: 2-4GB during training
youtube_semantic_search/
├── src/
│ ├── data_collector.py # YouTube data collection
│ ├── train_model.py # Model training pipeline
│ └── search_engine.py # Semantic search engine
├── configs/
│ └── training_config.yaml # Training configuration
├── data/ # Training and video data
├── models/ # Trained models
├── training/ # Training logs and checkpoints
├── notebooks/ # Jupyter notebooks
├── demo.py # Complete system demo
└── requirements.txt # Dependencies
model:
base_model: "all-MiniLM-L6-v2"
save_dir: "models"
max_seq_length: 512
training:
epochs: 10
batch_size: 16
learning_rate: 2e-5
weight_decay: 0.01m1_optimizations:
use_mps: true
use_metal_optimizations: true
batch_size_multiplier: 1.0- More Data: Collect 100K+ video examples
- Longer Training: 50+ epochs with early stopping
- Larger Models: Use
all-mpnet-base-v2orall-MiniLM-L12-v2 - Cloud Deployment: Use cloud GPUs for faster training
- Real-time Updates: Continuous data collection and model retraining
- Multi-language Support: Train on multiple languages
- Video Thumbnail Analysis: Use CLIP for visual search
- User Feedback Loop: Learn from search result clicks
- A/B Testing: Compare different model versions
# Check MPS availability
python -c "import torch; print(torch.backends.mps.is_available())"# Reduce batch size in config
batch_size: 8 # Instead of 16# Increase delays in data collection
time.sleep(1) # Instead of 0.1- Use MPS: Ensure Metal Performance Shaders are enabled
- Optimize Batch Size: Start small and increase gradually
- Monitor Memory: Watch Activity Monitor during training
- Use SSD: Store data on fast storage
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details
- Sentence Transformers: For the base embedding models
- FAISS: For efficient vector similarity search
- PyTorch: For the deep learning framework
- YouTube Data API: For video data access
- Issues: Create GitHub issues for bugs
- Discussions: Use GitHub discussions for questions
- Wiki: Check the project wiki for detailed guides
Happy Searching! 🎯
Built with ❤️ for M1 Mac users and YouTube content discovery