This directory contains Jupyter notebooks demonstrating analytical workflows, exploratory data analysis, and research applications using Steam Dataset 2025. Notebooks combine narrative exposition with executable code to showcase dataset capabilities while maintaining reproducibility through pre-exported CSV/Parquet data files.
The notebooks directory serves as both documentation and executable analysis examples for Steam Dataset 2025. Each notebook addresses specific research questions using different analytical approaches: temporal market analysis, semantic game discovery, and machine learning applications. Notebooks are designed for immediate execution on Kaggle or local environments without database dependencies, using pre-exported data from the analytics tier.
This section documents available notebooks and their analytical focus.
| Notebook | Topic | Key Techniques | Data Files |
|---|---|---|---|
| 01-platform-evolution | Market trends 1997-2025 | Temporal analysis, genre evolution | 6 CSV files (~3MB) |
| 02-semantic-discovery | Vector-based game search | Embeddings, cosine similarity | 1 CSV + 1 NPY (~25MB) |
| 03-semantic-fingerprint | Genre classification ML | Random Forest, embeddings | 1 Parquet (~180MB) |
Each notebook follows this organizational pattern:
notebook-XX-title/
├── 📓 notebook-XX-title.ipynb # Main Jupyter notebook
├── 📊 data/ # Pre-exported datasets
│ ├── 01_dataset.csv
│ ├── 02_dataset.csv
│ └── embeddings.npy
├── 📋 notebook-XX-title.pdf # Static PDF export
└── 📄 README.md # Notebook-specific documentationVisual representation of notebooks organization:
notebooks/
├── 📈 01-steam-platform-evolution-and-marketplace/
│ ├── notebook-01-steam-platform-evolution.ipynb
│ ├── notebook-01-steam-platform-evolution.pdf
│ ├── data/
│ │ ├── 01_temporal_growth.csv
│ │ ├── 02_genre_evolution.csv
│ │ ├── 03_platform_support.csv
│ │ ├── 04_pricing_strategy.csv
│ │ ├── 05_publisher_portfolios.csv
│ │ └── 06_achievement_evolution.csv
│ └── README.md
├── 🔍 02-semantic-game-discovery/
│ ├── notebook-02-semantic-game-discovery.ipynb
│ ├── notebook-02-semantic-game-discovery.pdf
│ ├── data/
│ │ ├── 01_game_embeddings_sample.csv
│ │ ├── 02_embeddings_appids.csv
│ │ ├── 02_embeddings_vectors.npy
│ │ ├── 02_genre_representatives.csv
│ │ └── 02_semantic_search_examples.json
│ └── README.md
├── 🤖 03-the-semantic-fingerprint/
│ ├── notebook-03-the-semantic-fingerprint.ipynb
│ ├── notebook-03-the-semantic-fingerprint.pdf
│ ├── data/
│ │ ├── 03-the-semantic-fingerprint.parquet
│ │ └── 03-the-semantic-fingerprint-preview.csv
│ └── README.md
└── 📄 README.md # This file- Platform Evolution: 28-year market analysis with temporal trends
- Semantic Discovery: Vector-based game similarity and search
- Semantic Fingerprint: Genre classification using ML and embeddings
- PDF Exports: Static versions for reading without Jupyter
This section connects notebooks to data sources and documentation.
| Category | Relationship | Documentation |
|---|---|---|
| Analytics Data | Source datasets for notebooks | data/04_analytics/README.md |
| Models | ML models trained in notebooks | models/README.md |
| Dataset Documentation | Schema and methodology | steam-dataset-2025-v1/README.md |
| Notebook Generation Scripts | Data export scripts | scripts/12-notebook-generation/README.md |
This section provides guidance for using and running notebooks.
Notebooks are optimized for Kaggle execution:
- Upload Dataset: Add
steam-dataset-2025-notebook-dataas input dataset - Open Notebook: Upload .ipynb file to Kaggle
- Configure Paths: Notebooks auto-detect Kaggle vs local environment
- Run All: Execute cells sequentially or use "Run All"
# Auto-detection pattern used in notebooks
import os
if os.path.exists('/kaggle/input'):
# Kaggle environment
DATA_DIR = Path('/kaggle/input/steam-dataset-2025-notebook-data/01-platform-evolution')
ENV = 'Kaggle'
else:
# Local environment
DATA_DIR = Path('../notebook-data/01-platform-evolution')
ENV = 'Local'
print(f"✅ Environment: {ENV}")
print(f"📁 Data directory: {DATA_DIR}")For local execution without database:
# Clone repository
git clone https://github.com/vintagedon/steam-dataset-2025.git
cd steam-dataset-2025
# Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn
# Launch Jupyter
jupyter notebook notebooks/01-steam-platform-evolution-and-marketplace/All notebooks include proper dataset attribution:
"""
Dataset: Steam Dataset 2025
Author: Donald Fountain (VintageDon)
Citation: Fountain, D. (2025). Steam Dataset 2025. Zenodo.
https://doi.org/10.5281/zenodo.17266923
License: CC BY 4.0
"""This section provides detailed overviews of each notebook's content and findings.
Research Questions:
- How has Steam's content catalog evolved across 28 years?
- Which genres drive platform growth and how have pricing strategies changed?
- What patterns emerge in Windows/Mac/Linux support over time?
- How do publishers balance portfolio diversification and pricing?
- How has Steam's achievement system diffused across the platform?
Key Findings:
- Exponential growth from 1997-2025 with inflection points marking major platform shifts
- Indie genre explosion post-2010 correlates with Steam Greenlight/Direct programs
- Mac support peaked ~2012, Linux remains niche at 15-20% of releases
- Free-to-play economics transformed monetization strategies post-2015
- Top publishers pursue mixed strategies: some specialize, others diversify
Methodology: Temporal aggregation, genre co-occurrence analysis, platform support matrices, pricing tier analysis, publisher portfolio diversity metrics
Data Files: 6 CSV files (29-202K rows each, ~3MB total)
Research Questions:
- How do vector embeddings capture game concepts beyond keyword matching?
- Can semantic search discover games with similar gameplay despite different genres?
- What representative games best exemplify each major genre?
- How does semantic similarity correlate with genre classifications?
Key Findings:
- BGE-M3 embeddings capture gameplay mechanics, narrative themes, and artistic styles
- Semantic search finds conceptual matches across genre boundaries (puzzle-platformers vs pure puzzles)
- Genre representatives identified algorithmically match community consensus choices
- ~73% semantic similarity for same-genre games, ~42% for different genres (clear signal)
Methodology: Cosine similarity search, t-SNE visualization, genre centroid analysis, semantic clustering validation
Data Files: 1 CSV (1000 game sample), 1 NPY (1024-dim vectors for 134K games, ~520MB), semantic search examples JSON
Research Questions:
- Can game descriptions alone predict genres better than metadata?
- How does ML performance scale from balanced sampling to class imbalance?
- What RAM characteristics correlate with genre and description semantics?
- Which genres are most predictable from text descriptions?
Key Findings:
- Embeddings-only model achieves 0.71 F1 (macro) vs 0.58 for metadata-only
- Class imbalance significantly impacts performance (Action 41%, ~1:61 ratio to least common)
- RAM requirements strongly correlate with genre (Indie: 4.2GB avg, Simulation: 8.1GB)
- Strategy and Simulation most predictable (F1 ~0.79), Casual least predictable (F1 ~0.64)
Methodology: Random Forest classification with class rebalancing, embedding-based feature engineering, RAM distribution analysis, genre-specific error analysis
Data Files: 1 Parquet file (50K games with embeddings and metadata, ~180MB)
This section identifies applications for notebook examples.
Notebooks serve as teaching materials:
- Data Science Curricula: Real-world dataset with diverse analytical challenges
- ML Course Examples: Classification, regression, semantic search demonstrations
- Visualization Techniques: Publication-quality plots using matplotlib/seaborn
- Research Methods: Reproducible analysis with documented methodology
Notebooks provide starting points for academic research:
- Reproducible Workflows: Clear data lineage from source to analysis
- Methodology Examples: Temporal analysis, semantic search, ML evaluation patterns
- Citation Standards: Proper attribution of dataset and dependencies
- Extension Points: Notebooks designed for modification and expansion
Notebooks demonstrate dataset capabilities:
- Coverage Assessment: Show breadth of temporal, genre, platform data
- Quality Validation: Demonstrate data completeness and consistency
- Feature Showcase: Highlight unique aspects (embeddings, materialized columns)
- Performance Benchmarks: Establish baseline model performance
Notebooks facilitate dataset adoption:
- Quick Start Examples: Immediate hands-on experience without setup
- Competition Seeds: Kaggle-ready notebooks for ML competitions
- Discussion Starters: Analysis findings spark community conversations
- Contribution Templates: Patterns for community-contributed notebooks
This section documents notebook development standards.
All notebooks adhere to these principles:
✓ Environment Detection: Auto-detect Kaggle vs local paths
✓ Dependency Management: Explicit imports with version notes
✓ Reproducibility: Fixed random seeds (42) for stochastic operations
✓ Error Handling: Try/except blocks for external dependencies
✓ Documentation: Markdown cells explain analysis choices
✓ Validation: Data quality checks before analysis
✓ Visualization: Consistent color schemes and readable plotsNotebook data files are optimized for execution:
| Optimization | Implementation | Benefit |
|---|---|---|
| Size Limits | CSV files <100MB each | Kaggle notebook data size limits |
| Row Filtering | WHERE success = TRUE |
Complete records only |
| Column Selection | Only columns used in analysis | Reduced memory footprint |
| Data Types | Optimal dtype per column | Faster loading and processing |
| Compression | Parquet with Snappy | 2-3x smaller than CSV |
Standard organization pattern:
# [Title]: [Subtitle]
Author: Donald Fountain (VintageDon)
Citation: [Dataset citation]
## Executive Summary
[2-3 sentences: scope, methods, key findings]
## 1. Introduction & Research Questions
[Research questions with numbered list]
## 2. Data Loading & Environment Setup
[Environment detection, library imports, data validation]
## 3. Exploratory Data Analysis
[5+ visualizations with detailed commentary]
## 4. Core Analysis
[Main analytical work addressing research questions]
## 5. Key Findings & Conclusions
[3-5 major findings with implications]
## 6. Limitations & Future Work
[Known limitations and research extensions]
## 7. References
[Dataset citation, documentation links, external resources]This section describes the notebook creation process.
- Define Research Questions: Identify analytical objectives and required data
- SQL Query Development: Write queries to extract necessary data from database
- Data Export: Generate CSV/Parquet files <100MB for notebook usage
- Validation: Verify exported data matches query expectations
- Notebook Development: Build analysis incrementally with validation checks
- Review: Test execution in both local and Kaggle environments
- Documentation: Add markdown cells explaining methodology and findings
- PDF Export: Generate static version for distribution
Each notebook undergoes these checks:
✓ Execution Test: Run "Restart & Run All" successfully
✓ Environment Test: Execute in both local and Kaggle environments
✓ Data Validation: Verify row counts and column presence
✓ Reproducibility: Fixed seeds produce identical results
✓ Documentation: All major sections have explanatory markdown
✓ Citation: Dataset attribution present and correct
✓ Visualization: All plots render correctly and are readable
✓ File Size: Total notebook data <500MB for Kaggle limits- Jupyter Export: Save notebook as .ipynb
- PDF Generation: Export via nbconvert with --to pdf
- Data Packaging: Organize notebook data files in subdirectory
- README Creation: Document notebook purpose and requirements
- Repository Integration: Add to notebooks directory with proper linking
- External Distribution: Upload to Kaggle and link from Zenodo dataset
This section links to related documentation and resources.
| Document | Relevance | Link |
|---|---|---|
| Data Dictionary | Schema reference for notebook data | /steam-dataset-2025-v1/DATA_DICTIONARY.md |
| Dataset Card | Methodology and collection details | /steam-dataset-2025-v1/DATASET_CARD.md |
| Analytics Data | Full analytical datasets | /data/04_analytics/README.md |
| Notebook Development Guide | Creation standards | /docs/notebook-development.md |
| Script | Purpose | Documentation |
|---|---|---|
| generate-notebook-01.py | Export temporal analysis data | scripts/12-notebook-generation/README.md |
| generate-notebook-02.py | Export semantic search data | scripts/12-notebook-generation/README.md |
| export-parquet-notebook-03.py | Export ML training data | scripts/12-notebook-generation/README.md |
| Resource | Description | Link |
|---|---|---|
| Jupyter Documentation | Notebook format specification | https://jupyter.org/documentation |
| Kaggle Notebooks | Platform hosting and execution | https://www.kaggle.com/code |
| NBViewer | Static notebook rendering | https://nbviewer.org/ |
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2025-01-06 | Initial notebooks directory documentation | VintageDon |
Primary Author: VintageDon (Donald Fountain)
GitHub: https://github.com/vintagedon
AI Collaboration: Claude 3.7 Sonnet (Anthropic) - Documentation structure and technical writing assistance
Human Responsibility: All notebook designs, analytical methodologies, and research questions are human-defined. AI assistance was used for documentation organization and clarity enhancement.
Document Information
| Field | Value |
|---|---|
| Author | VintageDon - https://github.com/vintagedon |
| Created | 2025-01-06 |
| Last Updated | 2025-01-06 |
| Version | 1.0 |
Tags: jupyter-notebooks, data-analysis, visualization, machine-learning, reproducible-research, kaggle