Sistema de recomendación de discos construido a partir de datos de Sputnikmusic.
Este proyecto es exclusivamente educativo. Desarrollado para aprender sobre web scraping, procesamiento de datos y sistemas de recomendación. Implementa rate limiting, delays entre requests y scraping ético para no sobrecargar la plataforma.
El proyecto se divide en dos etapas principales:
Pipeline de scraping y crawling que recolecta información de Sputnikmusic:
- Artistas y discografías completas
- Releases con metadata
- Interacciones de usuarios
- Perfiles de usuarios con roles y estadísticas
Documentación detallada de extracción
RRF-Ensemble: Fusión de rankings con Reciprocal Rank Fusion
Motor híbrido que combina múltiples estrategias para generar recomendaciones personalizadas formado por:
- NMF: Factorización matricial no negativa
- Two Towers: Arquitectura de deep learning
- Co-ocurrencia: Basado en patrones de consumo conjunto
- Content-based: Perfiles de géneros y artistas
Documentación detallada de estrategias
# Crear entorno
mamba env create -f environment.yml
conda activate sputnik-sr
# Inicializar base de datos
sqlite3 data/sputnik.db < data/schema.sql# Obtener charts anuales con soundoffs
python -m crawler --start-year 1960 --end-year 2025 --db data/sputnik.db
# Expandir discografías de artistas
python -m crawler.discography --db data/sputnik.db --batch-size 25
# Expandir ratings de usuarios
python -m crawler.user_expander --db data/sputnik.db --batch-size 25# Construir co-ocurrencias
python offline_recommender/build_release_pairs.py --database data/sputnik.db
# Construir embeddings NMF
python offline_recommender/build_nmf_embeddings.py --database data/sputnik.db
# Construir embeddings Two Towers
python offline_recommender/build_two_towers.py --database data/sputnik.db
# Iniciar aplicación web
python -m app.app
# Abrir http://localhost:5050sputnik-SR/
├── scraper/ # Parsing de HTML y cliente HTTP
├── crawler/ # Orquestadores de crawling
├── app/ # Aplicación Flask y motor de recomendación
├── offline_recommender/ # Scripts de construcción y evaluación
├── maintenance/ # Scripts de mantenimiento de la DB
├── data/ # Esquema SQL y bases de datos
├── models/ # Modelos entrenados y vocabularios
├── scripts/ # Utilidades bash
├── tests/ # Suite de pruebas
├── notebooks/ # Análisis exploratorios y de resultados finales
└── docs/ # Documentación detallada
| Documento | Descripción |
|---|---|
| Extracción de Datos | Scraping, crawling, flujo de ingestión, monitoreo |
| Estrategias de Recomendación | Algoritmos, métricas, configuración, evaluación |
| Mantenimiento | Scripts de salud y optimización de la DB |
# Ejecutar tests
pytest -q
# Linter
ruff check .
# Pre-commits
pre-commit install
pre-commit run --all-filesMIT License - Ver LICENSE para más detalles.
Album recommendation system built from Sputnikmusic data.
This project is educational only. Built to learn web scraping, data processing, and recommender systems. It implements rate limiting, delays between requests, and ethical scraping practices to avoid overloading the platform.
The project is split into two main stages:
Scraping + crawling pipeline that collects from Sputnikmusic:
- Artists and complete discographies
- Releases and metadata
- User interactions
- User profiles (roles and statistics)
RRF-Ensemble: rank fusion via Reciprocal Rank Fusion
Hybrid engine that combines multiple strategies to produce personalized recommendations:
- NMF: Non-negative Matrix Factorization
- Two Towers: Deep-learning architecture
- Co-occurrence: consumption co-occurrence signals
- Content-based: genre + artist profiles
Note: the results notebooks are written in Spanish, but they should be easy to interpret via the plots, tables, and code.
# Create environment
mamba env create -f environment.yml
conda activate sputnik-sr
# Initialize database
sqlite3 data/sputnik.db < data/schema.sql# Fetch yearly charts with soundoffs
python -m crawler --start-year 1960 --end-year 2025 --db data/sputnik.db
# Expand artist discographies
python -m crawler.discography --db data/sputnik.db --batch-size 25
# Expand user ratings
python -m crawler.user_expander --db data/sputnik.db --batch-size 25# Build co-occurrences
python offline_recommender/build_release_pairs.py --database data/sputnik.db
# Build NMF embeddings
python offline_recommender/build_nmf_embeddings.py --database data/sputnik.db
# Build Two Towers embeddings
python offline_recommender/build_two_towers.py --database data/sputnik.db
# Start web app
python -m app.app
# Open http://localhost:5050sputnik-SR/
├── scraper/ # HTML parsing and HTTP client
├── crawler/ # Crawling orchestrators
├── app/ # Flask app and recommender engine
├── offline_recommender/ # Build + evaluation scripts
├── maintenance/ # DB health + optimization scripts
├── data/ # SQL schema and databases
├── models/ # Trained models and vocabularies
├── scripts/ # Bash utilities
├── tests/ # Test suite
├── notebooks/ # EDA + evaluation notebooks
└── docs/ # Detailed documentation
| Document | Description |
|---|---|
| Data Extraction | Scraping, crawling, ingestion flow, monitoring |
| Recommendation Strategies | Algorithms, metrics, configuration, evaluation |
| Maintenance | DB health and optimization scripts |
# Run tests
pytest -q
# Linter
ruff check .
# Pre-commits
pre-commit install
pre-commit run --all-filesMIT License - see LICENSE for details.