Skip to content

saksham-45/bettersearchytube

Repository files navigation

YouTube Semantic Search System

A complete AI-powered semantic search engine for YouTube videos that understands natural language queries and finds relevant content based on meaning, not just keywords.

Features

  • Semantic Understanding: Finds videos based on meaning, not just exact word matches
  • Multi-Modal Data: Uses video titles, descriptions, transcripts, and comments
  • Custom AI Training: Fine-tune models on your specific YouTube data
  • Fast Vector Search: FAISS-powered similarity search
  • M1 Mac Optimized: Leverages Metal Performance Shaders for optimal performance
  • Real-time Search: Instant results with semantic relevance scoring

Architecture

User Query → Embedding Model → Vector Search → Ranked Results
                ↓
        [Trained on YouTube Data]
                ↓
    [Video Database + FAISS Index]

Prerequisites

  • Python 3.9+
  • M1 Mac (optimized for Metal Performance Shaders)
  • YouTube Data API Key (for data collection)
  • 8GB+ RAM (recommended)

Installation

1. Clone and Setup

cd youtube_semantic_search
source venv/bin/activate

2. Install Dependencies

# Core ML packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Other dependencies
pip install sentence-transformers transformers datasets accelerate
pip install faiss-cpu numpy scipy pandas scikit-learn
pip install google-api-python-client google-auth-oauthlib

3. Get YouTube API Key

  1. Go to Google Cloud Console
  2. Create a new project or select existing
  3. Enable YouTube Data API v3
  4. Create credentials (API Key)
  5. Set environment variable:
export YOUTUBE_API_KEY="your_api_key_here"

Quick Start

Run the Complete Demo

python demo.py

This will:

  1. Create sample training data
  2. Train a semantic model
  3. Demonstrate search functionality

Manual Step-by-Step

1. Collect YouTube Data

python src/data_collector.py

2. Train the Model

python src/train_model.py

3. Use the Search Engine

python src/search_engine.py

Data Collection

The system collects comprehensive video data:

  • Titles & Descriptions: From YouTube API
  • Transcripts: Auto-generated captions
  • Comments: Top relevant comments
  • Metadata: Views, likes, duration, etc.

Sample Search Queries

search_queries = [
    "machine learning tutorial",
    "python programming",
    "cooking recipes",
    "gaming walkthrough",
    "music covers",
    "travel vlog",
    "fitness workout",
    "comedy sketches"
]

Model Training

Base Model

  • Model: all-MiniLM-L6-v2
  • Parameters: ~80M
  • Input Length: 512 tokens
  • Output: 384-dimensional embeddings

Training Process

  1. Data Preparation: Convert YouTube data to training pairs
  2. Fine-tuning: Use contrastive learning with cosine similarity loss
  3. Evaluation: Measure semantic similarity accuracy
  4. Model Saving: Save best performing model

Training Configuration

training:
  epochs: 10
  batch_size: 16
  learning_rate: 2e-5
  weight_decay: 0.01
  use_mixed_precision: true

Search Engine

Features

  • Semantic Search: Find videos by meaning
  • Filtering: By channel, views, likes, category
  • Similar Videos: Find related content
  • Batch Search: Multiple queries at once
  • Relevance Scoring: Cosine similarity scores

Usage Examples

# Basic search
results = search_engine.search("how to make pasta", top_k=10)

# Filtered search
filters = {"min_views": 10000, "channel_title": "Gordon Ramsay"}
results = search_engine.search_with_filters("pasta recipe", filters)

# Find similar videos
similar = search_engine.get_similar_videos("video_id_123")

Performance

M1 Mac Optimizations

  • Metal Performance Shaders (MPS): GPU acceleration
  • Unified Memory: Efficient memory management
  • Optimized Batch Sizes: Tailored for M1 architecture

Expected Performance

  • Training: ~2-5 minutes per epoch (sample data)
  • Search: <100ms per query
  • Memory Usage: 2-4GB during training

Project Structure

youtube_semantic_search/
├── src/
│   ├── data_collector.py      # YouTube data collection
│   ├── train_model.py         # Model training pipeline
│   └── search_engine.py       # Semantic search engine
├── configs/
│   └── training_config.yaml   # Training configuration
├── data/                      # Training and video data
├── models/                    # Trained models
├── training/                  # Training logs and checkpoints
├── notebooks/                 # Jupyter notebooks
├── demo.py                    # Complete system demo
└── requirements.txt           # Dependencies

Configuration

Training Parameters

model:
  base_model: "all-MiniLM-L6-v2"
  save_dir: "models"
  max_seq_length: 512

training:
  epochs: 10
  batch_size: 16
  learning_rate: 2e-5
  weight_decay: 0.01

M1 Optimizations

m1_optimizations:
  use_mps: true
  use_metal_optimizations: true
  batch_size_multiplier: 1.0

Scaling Up

For Production Use

  1. More Data: Collect 100K+ video examples
  2. Longer Training: 50+ epochs with early stopping
  3. Larger Models: Use all-mpnet-base-v2 or all-MiniLM-L12-v2
  4. Cloud Deployment: Use cloud GPUs for faster training
  5. Real-time Updates: Continuous data collection and model retraining

Advanced Features

  • Multi-language Support: Train on multiple languages
  • Video Thumbnail Analysis: Use CLIP for visual search
  • User Feedback Loop: Learn from search result clicks
  • A/B Testing: Compare different model versions

Troubleshooting

Common Issues

MPS Not Available

# Check MPS availability
python -c "import torch; print(torch.backends.mps.is_available())"

Memory Issues

# Reduce batch size in config
batch_size: 8  # Instead of 16

API Quota Exceeded

# Increase delays in data collection
time.sleep(1)  # Instead of 0.1

Performance Tips

  1. Use MPS: Ensure Metal Performance Shaders are enabled
  2. Optimize Batch Size: Start small and increase gradually
  3. Monitor Memory: Watch Activity Monitor during training
  4. Use SSD: Store data on fast storage

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

MIT License - see LICENSE file for details

Acknowledgments

  • Sentence Transformers: For the base embedding models
  • FAISS: For efficient vector similarity search
  • PyTorch: For the deep learning framework
  • YouTube Data API: For video data access

Support

  • Issues: Create GitHub issues for bugs
  • Discussions: Use GitHub discussions for questions
  • Wiki: Check the project wiki for detailed guides

Happy Searching! 🎯

Built with ❤️ for M1 Mac users and YouTube content discovery

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors