A Python-based pipeline for downloading, processing, and analyzing YouTube video transcripts and comments using AI. Designed for market research in regenerative agriculture and market gardening.
- Data Collection: Downloads transcripts and comments from YouTube channels
- Multi-language Support: Automatic translation to English
- 3-Stage AI Pipeline:
- Stage 1: Data compression (cost-effective)
- Stage 2: Topic extraction and atomic insights
- Stage 3: Vector embeddings for semantic search
- Hybrid Search: Full-text (FTS5) and semantic vector search
- Resumable Operations: Pipeline can be paused and resumed
- Rate Limiting: Built-in monitoring and cost tracking
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp .env.example .env # Edit .env with your API keys -
Configure channels in
config.py:CHANNEL_IDS = [ "UC295-Dw_tDNtZXFeAPAW6Aw", # Your channel IDs ]
python main.pypython reset_processing.pyfrom query.query_utils import QueryUtils
query = QueryUtils()
# Text search
results = query.search_text("lettuce washing")
# Semantic search
results = await query.search_semantic("sustainable farming practices")
# Browse insights
insights = query.browse_insights(limit=50)
# Get insight details
details = query.get_insight_details(insight_id=123)All settings are in config.py:
CHANNEL_IDS: YouTube channels to processSTOP_AFTER_STAGE: Halt pipeline at specific stageSTAGE1_MODEL,STAGE2_MODEL,EMBEDDING_MODEL: AI models to useMAX_TRANSCRIPT_CONCURRENCY: Concurrent transcript downloadsRATE_LIMIT_WARNING_THRESHOLDS: API usage warning levels
flow/
├── config.py # Configuration
├── main.py # Main pipeline
├── reset_processing.py # Reset utility
├── database/ # Database layer
├── models/ # Pydantic validation models
├── fetchers/ # Data downloaders
├── processors/ # AI processing stages
├── utils/ # Logging, rate limiting, etc.
├── prompts/ # AI prompt templates
└── query/ # Search and retrieval
SCOPE.md: Complete technical specificationGEMINI_RATES.md: API rate limits and pricingTRANSCRIPTS.md: youtube-transcript-api documentation
Personal project - not for distribution