Find your new favorite game through game similarity. This algorithm attempts to reward video games that can't afford advertising.
Live Demo: https://nextsteamgame.com/
# Copy the example environment file
cp .env.example .env
# Edit .env and add your OpenAI API key
# Get your key from: https://platform.openai.com/api-keys
nano .env # or use your preferred editor# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install required packages
pip install -r requirements.txtpython app.pyThe app will be available at http://localhost:5000
Ideally this is a one-shot app that gives you exactly what you were looking for first try! If it doesn't, then we have done something wrong.
Steam Recommender creates tags from 3 sources: Steam reviews, professional reviews, and video analysis. It applies intelligent weights to each tag and adds "unique" tags that separate games from others in their genre. All data is stored in an optimized SQLite database for lightning-fast searches.
Hierarchical Genre Tree + Vector Similarity:
- 80% descriptive tags - Core gameplay elements (combat, exploration, story)
- 20% unique-in-genre tags - What makes this game special within its category
Three-tier niche carving:
main_genre → sub_genre → sub_sub_genre
Broad Category → Specific Style → Unique Defining Element
Example: Action → Methodical Combat → Interconnected World (Dark Souls)
Example: Strategy → Turn-Based → Deck Building (Slay the Spire)
Example: Action → Platformer → Stamina-Based Combat (Hollow Knight)
Similarity rewards by niche specificity:
- Same sub_sub_genre: 0.4 bonus (shares unique defining trait)
- Same sub_genre: 0.25 bonus (similar gameplay style)
- Same main_genre: 0.15 bonus (broad category match)
├── frontend/ # Web interface
│ ├── static/ # CSS, images, JS
│ └── templates/ # HTML templates
├── backend/ # Core engine
│ ├── api/ # Flask routes & endpoints
│ ├── core/ # Game search & similarity engine
│ ├── config/ # Dynamic configuration
│ └── database_builder/ # Data pipeline
├── data/ # Databases & models
└── logs/ # Application logs
- Python - Unified language for entire pipeline
- Flask - Web framework for recommendation API
- SQLite - Hierarchical game database with vector storage
- OpenAI GPT-3.5 - AI-powered tag generation from reviews
- scikit-learn - TF-IDF vectorization for similarity matching
- Beautiful Soup & Selenium - Web scraping for professional reviews
- Python 3.8+ with pip
- OpenAI API Account - Required for review analysis
- Chrome/Chromium Browser - Required for IGN scraping (optional)
- 3+ days of runtime - Due to API rate limiting
- $50-100 budget - Estimated OpenAI API costs
# Clone the repository
git clone https://github.com/yourusername/Steam_Reccomender.git
cd Steam_Reccomender
# Install Python dependencies
pip install -r requirements.txt
# Setup environment variables (choose one method):
# Method 1: Use .env file (Recommended)
cp .env.example .env
# Edit .env and add your OpenAI API key
# Method 2: Export directly (temporary)
export OPENAI_API_KEY="your-openai-api-key-here"
export FLASK_SECRET_KEY="your-secure-random-key"The database building process has been completely refactored into a modular, stage-based pipeline with advanced checkpointing, error recovery, and monitoring capabilities.
# Run complete pipeline (NEW - RECOMMENDED)
python database_builder.py
# Run specific stage only (NEW)
python database_builder.py --stage data_collection
python database_builder.py --stage review_analysis
python database_builder.py --stage database_creation
# Check pipeline status (NEW)
python database_builder.py --status
# Reset pipeline if needed (NEW)
python database_builder.py --resetpython database_builder.py --stage data_collectionEnhanced Features:
- Smart checkpointing: Resume from interruptions automatically
- Progress tracking: Real-time progress indicators
- Batch processing: Configurable batch sizes for optimal performance
- Error recovery: Intelligent retry mechanisms with exponential backoff
Outputs: steamspy_all_games.db, steam_api.db
Cost: FREE (only API rate limits)
python database_builder.py --stage review_analysisEnhanced Features:
- Cost estimation: Real-time OpenAI API cost projections
- Granular checkpointing: Resume from exact interruption point
- Quality filtering: Advanced spam and toxicity detection
- Professional reviews: Optional IGN review integration
Outputs: Analysis JSON files, hierarchical classification data Cost: $100-300 (OpenAI API usage)
python database_builder.py --stage database_creationEnhanced Features:
- Integrity validation: Comprehensive database validation
- Performance optimization: Automatic index creation
- Statistics reporting: Detailed completion analytics
- Output verification: Automatic file validation
Outputs: steam_recommendations.db, hierarchical_vectorizer.pkl
Cost: FREE (local processing)
# Get comprehensive status report
python database_builder.py --status
# Validate configuration and dependencies
python database_builder.py --validateThe original orchestrator is still available:
# Legacy interface (still functional)
python -m backend.database_builder.pipeline_orchestrator --stage 1
python -m backend.database_builder.pipeline_orchestrator --stage 2
python -m backend.database_builder.pipeline_orchestrator --stage 3| Component | Estimated Cost | Notes |
|---|---|---|
| SteamSpy API | FREE | Public API, 1 second rate limit |
| Steam Store API | FREE | Public API, respects rate limits |
| OpenAI GPT-3.5 | $100-300 | 500-1000 games × ~500 tokens per analysis |
| IGN Scraping | FREE | Web scraping with delays |
| Total Estimated Cost | $100-300 | Mainly OpenAI API usage |
- Start small: Modify
DATA_COLLECTION['max_games']inbackend/config/settings.py - Use existing data: Skip Stage 2 if you have analysis JSON files
- OpenAI alternatives: Modify the review analyzer to use local models
- Caching: The pipeline saves checkpoints to resume from interruptions
Once you have the database built (or download pre-built databases):
# Start the Flask web application
python app.pyVisit http://localhost:5000 to use the recommender.
If you don't want to spend the time/money building the database:
- Download pre-built databases (if available)
- Place them in the
data/directory:steam_recommendations.db(required)hierarchical_vectorizer.pkl(required)steam_api.db(optional, for images/pricing)
All settings are centralized in backend/config/settings.py:
# Customize data collection
DATA_COLLECTION = {
'max_games': 20000, # Reduce for testing
'reviews_per_game': 100, # Reduce to lower OpenAI costs
'batch_size': 1000,
'checkpoint_interval': 100
}
# Adjust rate limits
RATE_LIMITS = {
'openai_max_retries': 3,
'steam_api_delay': 0.5, # Increase if rate limited
}- 20,000 games in catalog (SteamSpy + Steam Store data)
- 500-1000 games with full AI analysis (Steam reviews, IGN, YouTube)
- 1000-dimensional TF-IDF vectors for similarity
- Sub-second recommendation responses across entire 20k database
- Hierarchical niche carving makes sub_sub_genre matches very valuable at this scale
Stage 1: Data Collection (1-2 hours)
- SteamSpy API → 20k game catalog
- Steam Store API → metadata, pricing, images
Stage 2: Review Analysis (1-2 days)
- Steam Reviews + OpenAI → intelligent tag generation
- IGN Scraping → professional review scores
- Hierarchical classification → genre taxonomy
Stage 3: Database Creation (30 mins)
- JSON → optimized SQLite schema
- TF-IDF vectorization → binary BLOB storage
- Performance indexing → sub-second queries
The data pipeline takes 3+ days due to API rate limiting, so the database is typically 3 months old. This trade-off ensures we can analyze games thoroughly without overwhelming external APIs.
API Rate Limits:
- OpenAI: 3 requests/minute (free tier), 60 requests/minute (paid)
- Steam Store: ~1 request/second (unofficial limit)
- SteamSpy: 1 request/second (official limit)
- Context-aware review analysis (mention previous games)
- Convert Flask app to FastAPI (hitting performance limits)
- Implement ChromaDB for enhanced semantic similarity
- Humble Bundle affiliate integration
If any reviewing companies want their data removed from this program, please let me know. This is a data science project for educational purposes.
I run minimal ads because I'm a broke college student trying to break even on server costs.
- Modular Architecture - Clean separation between frontend, API, core logic, and data pipeline
- Dynamic Configuration - Centralized settings with environment variable support
- Type Hints - Full type annotations for better code quality
- Error Handling - Comprehensive exception management with graceful fallbacks
- Hierarchical Genre Matching - Multi-tier similarity scoring
- Vector Similarity Engine - TF-IDF cosine similarity with tag-based fallback
- Intelligent Rate Limiting - Respects API limits with exponential backoff
- Checkpoint System - Resume long-running processes from interruptions
GET /- Main interfacePOST /search- Game search and preference selectionPOST /recommend- Generate recommendationsGET /api/search- Search suggestions (JSON)GET /debug/stats- Database statisticsGET /health- System health check
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes with proper type hints and documentation
- Test with a small dataset first (
max_games = 50) - Submit a pull request
# Test with a small dataset (reduces costs)
# Edit backend/config/settings.py:
DATA_COLLECTION['max_games'] = 50
# Run quick pipeline test
python -m backend.database_builder.pipeline_orchestrator --stage 1