Skip to content

Latest commit

 

History

History
480 lines (349 loc) · 16.1 KB

File metadata and controls

480 lines (349 loc) · 16.1 KB

🤖 Models Directory

This directory contains machine learning model artifacts, trained models, evaluation results, and experimental notebooks for Steam Dataset 2025 analysis. Models leverage vector embeddings, structured metadata, and user review data to enable game recommendation systems, genre classification, and requirement prediction tasks.

Overview

The models directory serves as the repository for all machine learning experiments and production models built on Steam Dataset 2025. This includes baseline models for common tasks (genre classification, pricing prediction), semantic models using BGE-M3 embeddings, and experimental approaches combining structured and unstructured data. Each model includes training scripts, saved artifacts, evaluation metrics, and documentation enabling reproducibility and further research.


📁 Directory Contents

This section documents model artifacts and their training configurations.

Model Categories

Model Type Task Status Documentation
Genre Classifier Multi-label genre prediction Experimental genre-classification/README.md
Requirement Predictor Hardware spec estimation Experimental requirement-prediction/README.md
Semantic Recommender Vector-based game similarity Baseline semantic-recommender/README.md
Price Predictor USD pricing estimation Experimental price-prediction/README.md

Expected Artifacts

File Type Purpose Example
model.pkl Serialized trained model genre_classifier_v1.pkl
scaler.pkl Feature scaling parameters minmax_scaler.pkl
vectorizer.pkl Text vectorization config tfidf_vectorizer.pkl
metrics.json Evaluation results evaluation_metrics.json
training_log.txt Training process logs training_20250901.log
config.yaml Hyperparameters and settings model_config.yaml

🗂️ Repository Structure

Visual representation of models directory organization:

models/
├── 🎯 genre-classification/
│   ├── baseline/
│   │   ├── model.pkl
│   │   ├── metrics.json
│   │   └── README.md
│   ├── embeddings/
│   │   ├── model.pkl
│   │   ├── metrics.json
│   │   └── README.md
│   └── README.md
├── 💻 requirement-prediction/
│   ├── ram-predictor/
│   ├── storage-predictor/
│   └── README.md
├── 🔍 semantic-recommender/
│   ├── cosine-similarity/
│   ├── annoy-index/
│   └── README.md
├── 💰 price-prediction/
│   ├── baseline/
│   ├── advanced/
│   └── README.md
├── 📊 evaluation/
│   ├── cross-validation-results/
│   └── benchmark-comparisons/
└── 📄 README.md                              # This file

Navigation Guide:

  • Genre Classification: Multi-label models using descriptions and metadata
  • Requirement Prediction: Regression models for hardware specifications
  • Semantic Recommender: Vector-based similarity and retrieval systems
  • Price Prediction: Pricing models incorporating market features
  • Evaluation: Cross-validation results and benchmark comparisons

🔗 Related Categories

This section connects models to training data and analysis notebooks.

Category Relationship Documentation
Analytics Data Training and evaluation datasets data/04_analytics/README.md
Notebooks Experimental notebooks and analysis notebooks/README.md
Enriched Data Vector embeddings for semantic models data/03_enriched/README.md
Model Training Scripts Automated training pipelines scripts/README.md

🚀 Getting Started

This section provides guidance for training and using models.

Loading a Trained Model

import pickle
import pandas as pd

# Load model artifacts
with open('genre-classification/baseline/model.pkl', 'rb') as f:
    model = pickle.load(f)

with open('genre-classification/baseline/scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

# Load test data
games = pd.read_csv('../data/04_analytics/steam_games.csv')

# Prepare features
features = ['release_year', 'price_usd', 'developer_count', 
            'supports_windows', 'has_achievements', 'multiplayer']
X = games[features].fillna(0)
X_scaled = scaler.transform(X)

# Make predictions
predictions = model.predict(X_scaled)

Semantic Recommendation Example

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load embeddings
embeddings = np.load('../data/03_enriched/embeddings_20250901.npy')

# Load game metadata
games = pd.read_csv('../data/04_analytics/steam_games.csv')

# Find similar games
query_app_id = 440  # Team Fortress 2
query_idx = games[games['app_id'] == query_app_id].index[0]
query_embedding = embeddings[query_idx].reshape(1, -1)

# Calculate similarities
similarities = cosine_similarity(query_embedding, embeddings)[0]

# Get top 10 recommendations
top_indices = np.argsort(similarities)[-11:-1][::-1]  # Exclude self
recommendations = games.iloc[top_indices][['app_id', 'name', 'primary_genre']]
recommendations['similarity'] = similarities[top_indices]

print(recommendations)

Training a New Model

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pickle

# Load training data
games = pd.read_csv('../data/04_analytics/steam_games.csv')

# Prepare features and target
features = ['release_year', 'price_usd', 'developer_count', 
            'supports_windows', 'has_achievements', 'multiplayer']
X = games[features].fillna(0)
y = games['primary_genre']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Save model
with open('my_genre_classifier.pkl', 'wb') as f:
    pickle.dump(model, f)

📚 Technical Documentation

This section provides specifications for model implementations.

Genre Classification Models

Task: Predict game genres from metadata and descriptions
Type: Multi-label classification (games can have multiple genres)
Features:

  • Text: game descriptions (TF-IDF or BGE-M3 embeddings)
  • Numeric: release_year, price_usd, developer_count
  • Categorical: platform support, multiplayer, achievements

Baseline Model:

  • Algorithm: Random Forest Classifier
  • Features: Structured metadata only (no text)
  • Macro F1 Score: 0.58
  • Training Time: ~5 minutes on full dataset

Embedding Model:

  • Algorithm: Logistic Regression with L2 regularization
  • Features: BGE-M3 embeddings (1024-dim) + metadata
  • Macro F1 Score: 0.73 (+26% over baseline)
  • Training Time: ~20 minutes with embeddings

Requirement Prediction Models

Task: Predict hardware requirements from game attributes
Type: Regression (min_ram_mb, rec_ram_mb, min_storage_gb)
Features:

  • Numeric: release_year, price_usd, developer_count
  • Categorical: primary_genre, platform support
  • Text: game descriptions (optional)

RAM Predictor:

  • Algorithm: Gradient Boosting Regressor
  • Target: min_ram_mb (minimum RAM requirement)
  • RMSE: 512 MB
  • R² Score: 0.67
  • Feature Importance: release_year (0.42), primary_genre (0.31), price (0.18)

Semantic Recommender Systems

Task: Find similar games using content-based filtering
Type: Similarity search and ranking
Approach: Vector similarity using BGE-M3 embeddings

Cosine Similarity Baseline:

  • Method: Exact cosine similarity over full embedding matrix
  • Latency: ~200ms for top-10 on 134K games
  • Storage: 520MB (NumPy array)
  • Pros: Perfect recall, simple implementation
  • Cons: Doesn't scale beyond ~1M vectors

ANN Index (Annoy):

  • Method: Approximate nearest neighbors with tree-based index
  • Latency: ~5ms for top-10 on 134K games
  • Storage: 750MB (index file)
  • Pros: Sub-10ms latency, scalable to millions of vectors
  • Cons: ~97% recall vs exact search (acceptable tradeoff)

Price Prediction Models

Task: Estimate game pricing in USD
Type: Regression with domain-specific handling
Challenges:

  • 45% of games are free (price = 0)
  • High price variance across genres
  • Price tiers and psychological pricing patterns

Two-Stage Model:

  1. Classification: Free vs Paid (Random Forest, F1=0.91)
  2. Regression: Price prediction for paid games only (XGBoost, RMSE=$3.45)

🎯 Use Cases

This section identifies applications for trained models.

Game Discovery Platforms

Semantic recommender enables content-based suggestions:

  • "Players who liked X might also enjoy..."
  • Genre exploration beyond exact label matches
  • Niche game discovery using semantic similarity
  • Cross-genre recommendations based on gameplay descriptions

Market Analysis

Pricing and requirement models support business intelligence:

  • Competitive pricing analysis for new releases
  • Hardware requirement forecasting for development planning
  • Genre saturation and opportunity identification
  • Price elasticity studies by genre and feature combinations

Research Applications

Models support academic research in game studies:

  • Genre evolution analysis using predicted vs actual labels
  • Hardware requirement trends over platform generations
  • Pricing strategy analysis across publisher portfolios
  • Content analysis using semantic embeddings

Data Quality Improvement

Models can flag data quality issues:

  • Genre mislabeling detection (high prediction confidence disagreement)
  • Requirement anomaly detection (predictions vs stated specs)
  • Price outliers identification (significant deviation from predicted)

🔍 Model Evaluation

This section documents evaluation methodologies and benchmark results.

Evaluation Protocols

All models follow these evaluation standards:

Train/Val/Test Split:       60% / 20% / 20%
Stratification:             By primary_genre for classification tasks
Random Seed:                42 (for reproducibility)
Cross-Validation:           5-fold stratified CV on train+val
Metric Reporting:           Test set only (never trained or tuned on)

Genre Classification Results

Baseline Model (Metadata Only):
  Accuracy:                 0.62
  Macro F1:                 0.58
  Weighted F1:              0.61
  Training Time:            5m 23s

Embedding Model (BGE-M3 + Metadata):
  Accuracy:                 0.76 (+22%)
  Macro F1:                 0.73 (+26%)
  Weighted F1:              0.75 (+23%)
  Training Time:            19m 47s

Per-Genre Performance (Embedding Model):
  Genre             Precision  Recall  F1
  Action            0.82       0.79    0.80
  Indie             0.71       0.75    0.73
  Adventure         0.68       0.71    0.69
  Casual            0.77       0.73    0.75
  Strategy          0.79       0.76    0.77
  RPG               0.74       0.72    0.73
  Simulation        0.76       0.74    0.75

RAM Prediction Results

Gradient Boosting Model:
  RMSE:                     512 MB
  MAE:                      384 MB
  R²:                       0.67
  MAPE:                     28.3%

Performance by RAM Tier:
  <1GB:    RMSE=256MB, R²=0.71  (72% of dataset)
  1-4GB:   RMSE=512MB, R²=0.65  (22% of dataset)
  >4GB:    RMSE=1024MB, R²=0.53 (6% of dataset)

Feature Importance:
  1. release_year:          0.42
  2. primary_genre:         0.31
  3. price_usd:             0.18
  4. developer_count:       0.09

Semantic Recommender Evaluation

Exact Cosine Similarity:
  Top-1 Accuracy:           0.87  (same genre)
  Top-10 Genre Overlap:     6.4/10 games
  User Study Relevance:     7.2/10 rating
  Latency (134K games):     ~200ms

Annoy ANN Index:
  Recall@10 vs Exact:       0.97
  Top-1 Accuracy:           0.84  (-3% vs exact)
  Top-10 Genre Overlap:     6.2/10 games (-3%)
  User Study Relevance:     7.1/10 rating (-1%)
  Latency (134K games):     ~5ms (-97% latency)

🛠️ Model Training Pipeline

This section describes the end-to-end training workflow.

Data Preparation

  1. Load Datasets: Read CSV/Parquet from analytics tier
  2. Handle Missing Values: Imputation or removal based on model requirements
  3. Feature Engineering: Create derived features (e.g., release_year from release_date)
  4. Train/Test Split: Stratified split maintaining class distributions
  5. Feature Scaling: StandardScaler or MinMaxScaler for numeric features
  6. Encoding: One-hot or label encoding for categorical features

Training Process

  1. Baseline Model: Train simple model to establish performance floor
  2. Hyperparameter Tuning: Grid search or random search with cross-validation
  3. Advanced Models: Train more complex models (embeddings, ensembles)
  4. Evaluation: Test set evaluation with multiple metrics
  5. Error Analysis: Examine failure cases and model limitations

Model Persistence

  1. Serialization: Save model using pickle or joblib
  2. Artifact Packaging: Include scaler, vectorizer, and config files
  3. Metadata: Record training date, dataset version, performance metrics
  4. Versioning: Use semantic versioning (v1.0, v1.1, v2.0)
  5. Documentation: README with usage examples and performance notes

📖 References

This section links to related documentation and resources.

Internal Documentation

Document Relevance Link
Data Dictionary Feature definitions /steam-dataset-2025-v1/DATA_DICTIONARY.md
Analytics Data Training datasets /data/04_analytics/README.md
Vector Embeddings BGE-M3 implementation /docs/methodologies/vector-embeddings.md
Notebooks Experimental model development /notebooks/README.md

External Resources

Resource Description Link
scikit-learn ML library for baseline models https://scikit-learn.org/
XGBoost Gradient boosting framework https://xgboost.readthedocs.io/
Annoy Approximate nearest neighbors https://github.com/spotify/annoy
BGE-M3 Paper Embedding model methodology https://arxiv.org/abs/2402.03216

📜 Documentation Metadata

Change Log

Version Date Changes Author
1.0 2025-01-06 Initial models directory documentation VintageDon

Authorship & Collaboration

Primary Author: VintageDon (Donald Fountain)
GitHub: https://github.com/vintagedon
AI Collaboration: Claude 3.7 Sonnet (Anthropic) - Documentation structure and technical writing assistance

Human Responsibility: All model architectures, evaluation protocols, and performance analysis are human-defined. AI assistance was used for documentation organization and clarity enhancement.


Document Information

Field Value
Author VintageDon - https://github.com/vintagedon
Created 2025-01-06
Last Updated 2025-01-06
Version 1.0

Tags: machine-learning, models, genre-classification, recommendation-systems, regression, semantic-search, embeddings