DataForgeOpenAIHub
diff --git a/‎.flake8‎
Lines changed: 3 additions & 0 deletions b/‎.flake8‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎.github/workflows/autoformat.yml‎
Lines changed: 83 additions & 0 deletions b/‎.github/workflows/autoformat.yml‎
Lines changed: 83 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 14 additions & 1 deletion b/‎README.md‎
Lines changed: 14 additions & 1 deletion
diff --git a/‎dag/flows/healthcheck.py‎
Lines changed: 3 additions & 2 deletions b/‎dag/flows/healthcheck.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎get_version.py‎
Lines changed: 1 addition & 0 deletions b/‎get_version.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎notebooks/data_exploration.ipynb‎
Lines changed: 45 additions & 40 deletions b/‎notebooks/data_exploration.ipynb‎
Lines changed: 45 additions & 40 deletions
@@ -0,0 +1,3 @@
+[flake8]
+max-line-length = 120
+ignore = E402,E302,E305,E266,E203,W503,W504,E722,E712,E721,E713,E714,E731
@@ -0,0 +1,83 @@
+name: Autoformat Code on Push
+
+on:
+  push:
+    branches:
+      - main  # Adjust the branch accordingly
+  pull_request:
+    branches:
+      - main  # Adjust the branch accordingly
+
+permissions:
+    checks: write
+    actions: read
+    contents: write
+
+jobs:
+  format:
+    runs-on: ubuntu-latest
+
+    env:
+      commit_message: "No formatting changes applied"
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          token: ${{ secrets.GITHUB_TOKEN }}  # Use GitHub token to push changes
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.12'  # Adjust to your Python version
+  
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install black black[jupyter] flake8 isort nbstripout pytest pytest-timeout versioneer
+
+      - name: Check import sorting with isort
+        id: isort-check
+        run: |
+          isort --check-only .
+        continue-on-error: true
+
+      - name: Format imports with isort
+        if: steps.isort-check.outcome == 'failure'
+        run: |
+          isort .
+
+      - name: Check code formatting with Black
+        id: black-check
+        run: |
+          black --line-length=120 --preview --enable-unstable-feature=string_processing --check . 
+        continue-on-error: true
+
+      - name: Format code with Black
+        if: steps.black-check.outcome == 'failure'
+        run: |
+          black --line-length=120 --preview --enable-unstable-feature=string_processing .
+      
+      - name: Set commit message
+        id: set-message
+        run: |
+          if [[ "${{ steps.isort-check.outcome }}" == "failure" && "${{ steps.black-check.outcome }}" == "failure" ]]; then
+            echo "commit_message=Sorted imports with isort & Autoformat code with Black" >> $GITHUB_ENV
+          elif [[ "${{ steps.isort-check.outcome }}" == "failure" ]]; then
+            echo "commit_message=Sorted imports with isort" >> $GITHUB_ENV
+          elif [[ "${{ steps.black-check.outcome }}" == "failure" ]]; then
+            echo "commit_message=Autoformat code with Black" >> $GITHUB_ENV
+          fi
+
+      - name: Commit and push changes if formatting is applied
+        if: steps.isort-check.outcome == 'failure' || steps.black-check.outcome == 'failure'
+        run: |
+            git config --local user.name "github-actions[bot]"
+            git config --local user.email "github-actions[bot]@users.noreply.github.com"
+            if [ -n "$(git status --porcelain)" ]; then
+              git add .
+              git commit -m "${{ env.commit_message }}"
+              git push origin ${{ github.ref }}
+            else
+              echo "No changes to commit"
+            fi
@@ -1,11 +1,16 @@
 # Steam Sales Analysis
+[![Package Publish Status](https://img.shields.io/github/actions/workflow/status/DataForgeOpenAIHub/Steam-Sales-Analysis/python-publish.yml?branch=main)](https://github.com/DataForgeOpenAIHub/Steam-Sales-Analysis/actions)
+[![PyPI Downloads](https://img.shields.io/pypi/dm/steamstore_etl)](https://pypi.org/project/steamstore_etl/)
+[![PyPI Python Version](https://img.shields.io/pypi/pyversions/steamstore_etl)](https://pypi.org/project/steamstore_etl/)
+[![PyPI version](https://img.shields.io/pypi/v/steamstore_etl.svg)](https://pypi.org/project/steamstore_etl/)
+![GitHub release (latest by date)](https://img.shields.io/github/v/release/DataForgeOpenAIHub/Steam-Sales-Analysis)
 
 ![banner](assets/imgs/steam_logo_banner.jpg)
 
 ## Overview
 Welcome to **Steam Sales Analysis** – an innovative project designed to harness the power of data for insights into the gaming world. We have meticulously crafted an ETL (Extract, Transform, Load) pipeline that covers every essential step: data retrieval, processing, validation, and ingestion. By leveraging the robust Steamspy and Steam APIs, we collect comprehensive game-related metadata, details, and sales figures.
 
-But we don’t stop there. The culmination of this data journey sees the information elegantly loaded into a MySQL database hosted on Aiven Cloud. From this solid foundation, we take it a step further: the data is analyzed and visualized through dynamic and interactive Tableau dashboards. This transforms raw numbers into actionable insights, offering a clear window into gaming trends and sales performance. Join us as we dive deep into the data and bring the world of gaming to life!
+But we don’t stop there. The culmination of this data journey is the elegant loading of information into a MySQL database hosted on Aiven Cloud. From this solid foundation, we take it a step further: the data is analyzed and visualized through dynamic and interactive Tableau dashboards. This transforms raw numbers into actionable insights, offering a clear window into gaming trends and sales performance. Join us as we dive deep into the data and bring the world of gaming to life!
 
 # `steamstore` CLI
 ![Steamstore ETL Pipeline](assets/imgs/steamstore-etl.drawio.png)
@@ -291,6 +296,14 @@ To execute the ETL pipeline, use the following commands:
 
 This will start the process of retrieving data from the Steamspy and Steam APIs, processing and validating it, and then loading it into the MySQL database.
 
+# Dashboard
+- Explore the interactive [**Tableau dashboard**](https://sudarshanasrao.github.io/portfolio/portfolio-0/).
+
+## Authors
+1. [Kayvan Shah](https://github.com/KayvanShah1) | `MS in Applied Data Science` | `USC`
+2. [Sudarshana S Rao](https://github.com/SudarshanaSRao) | `MS in Electrical Engineering (Machine Learning & Data Science)` | `USC`
+3. [Rohit Veeradhi](https://github.com/Rohit04121998) | `MS in Electrical Engineering (Machine Learning & Data Science)` | `USC`
+
 ## References:
 
 ### API Used:
 
@@ -1,8 +1,9 @@
 import platform
-import prefect
-from prefect import task, flow, get_run_logger
 import sys
 
+import prefect
+from prefect import flow, get_run_logger, task
+
 
 @task
 def log_platform_info():
 
@@ -1,4 +1,5 @@
 import warnings
+
 import versioneer
 
 if __name__ == "__main__":
 
@@ -339,9 +339,9 @@
     }
    ],
    "source": [
-    "with open(os.path.join(Path.sql_queries, 'get_all_game_data.sql'), \"r\") as f:\n",
+    "with open(os.path.join(Path.sql_queries, \"get_all_game_data.sql\"), \"r\") as f:\n",
     "    query = text(f.read())\n",
-    "    \n",
+    "\n",
     "\n",
     "with get_db() as db:\n",
     "    result = db.execute(query)\n",
@@ -371,7 +371,7 @@
     }
    ],
    "source": [
-    "game_data['description'].iloc[10000-4-1]"
+    "game_data[\"description\"].iloc[10000 - 4 - 1]"
    ]
   },
   {
@@ -396,6 +396,7 @@
    "source": [
     "from fuzzywuzzy import process\n",
     "\n",
+    "\n",
     "def get_unique(series):\n",
     "    \"\"\"\n",
     "    Returns a set of unique values from a series of strings.\n",
@@ -407,7 +408,7 @@
     "    set: A set of unique values extracted from the series.\n",
     "\n",
     "    \"\"\"\n",
-    "    return set(list(itertools.chain(*series.apply(lambda x: [c for c in x.split(';')]))))"
+    "    return set(list(itertools.chain(*series.apply(lambda x: [c for c in x.split(\";\")]))))"
    ]
   },
   {
@@ -461,7 +462,7 @@
     }
    ],
    "source": [
-    "geners = get_unique(game_data['genres'])\n",
+    "geners = get_unique(game_data[\"genres\"])\n",
     "geners"
    ]
   },
@@ -494,28 +495,30 @@
     "def standardize_genre(value, genre_list):\n",
     "    # Convert to lowercase for consistent comparison\n",
     "    value_lower = value.lower()\n",
-    "    \n",
+    "\n",
     "    # Define common patterns\n",
-    "    if 'rpg' in value_lower or 'role playing' in value_lower or 'role' in value_lower:\n",
-    "        return 'RPG'\n",
-    "    if 'simulation' in value_lower or 'simulators' in value_lower:\n",
-    "        return 'Simulation'\n",
-    "    if 'adventure' in value_lower:\n",
-    "        return 'Adventure'\n",
+    "    if \"rpg\" in value_lower or \"role playing\" in value_lower or \"role\" in value_lower:\n",
+    "        return \"RPG\"\n",
+    "    if \"simulation\" in value_lower or \"simulators\" in value_lower:\n",
+    "        return \"Simulation\"\n",
+    "    if \"adventure\" in value_lower:\n",
+    "        return \"Adventure\"\n",
+    "\n",
     "\n",
     "# Function to standardize multiple genres\n",
     "def standardize_multiple_genres(genres_str, genre_list):\n",
-    "    genres = genres_str.split(';')\n",
+    "    genres = genres_str.split(\";\")\n",
     "    standardized_genres = [standardize_genre(genre.strip(), genre_list) for genre in genres]\n",
-    "    return ';'.join(sorted(set(standardized_genres)))  # Use sorted(set()) to remove duplicates and sort\n",
-    "    \n",
+    "    return \";\".join(sorted(set(standardized_genres)))  # Use sorted(set()) to remove duplicates and sort\n",
+    "\n",
     "    # Find the best match from the list of unique genres\n",
     "    match, score = process.extractOne(value, genre_list)\n",
     "    return match\n",
     "\n",
+    "\n",
     "# Apply the standardization function to the Genres column\n",
-    "game_data['genres'] = game_data['genres'].apply(lambda x: standardize_multiple_genres(x, geners))\n",
-    "geners = get_unique(game_data['genres'])\n",
+    "game_data[\"genres\"] = game_data[\"genres\"].apply(lambda x: standardize_multiple_genres(x, geners))\n",
+    "geners = get_unique(game_data[\"genres\"])\n",
     "geners"
    ]
   },
@@ -615,7 +618,7 @@
     }
    ],
    "source": [
-    "categories = get_unique(game_data['categories'])\n",
+    "categories = get_unique(game_data[\"categories\"])\n",
     "categories"
    ]
   },
@@ -643,21 +646,22 @@
     "    - score: The calculated rating score as a percentage.\n",
     "\n",
     "    \"\"\"\n",
-    "    pos = row['positive_ratings']\n",
-    "    neg = row['negative_ratings']\n",
+    "    pos = row[\"positive_ratings\"]\n",
+    "    neg = row[\"negative_ratings\"]\n",
     "\n",
     "    total_reviews = pos + neg\n",
-    "    \n",
+    "\n",
     "    if total_reviews > 0:\n",
     "        average = pos / total_reviews\n",
-    "        score = average - (average * 0.5) * 2**(-math.log10(total_reviews + 1))\n",
+    "        score = average - (average * 0.5) * 2 ** (-math.log10(total_reviews + 1))\n",
     "        return score * 100\n",
     "    else:\n",
     "        return 0.0\n",
     "\n",
-    "game_data['total_ratings'] = game_data['positive_ratings'] + game_data['negative_ratings']\n",
-    "game_data['review_score'] = game_data['positive_ratings'] / game_data['total_ratings']\n",
-    "game_data['rating'] = game_data.apply(calc_rating, axis=1)"
+    "\n",
+    "game_data[\"total_ratings\"] = game_data[\"positive_ratings\"] + game_data[\"negative_ratings\"]\n",
+    "game_data[\"review_score\"] = game_data[\"positive_ratings\"] / game_data[\"total_ratings\"]\n",
+    "game_data[\"rating\"] = game_data.apply(calc_rating, axis=1)"
    ]
   },
   {
@@ -996,24 +1000,25 @@
    "source": [
     "def categorize_year(year):\n",
     "    if year < 2020:\n",
-    "        return 'Before 2020'\n",
+    "        return \"Before 2020\"\n",
     "    elif 2020 <= year <= 2022:\n",
-    "        return '2020-2022'\n",
+    "        return \"2020-2022\"\n",
     "    else:\n",
-    "        return 'After 2022'\n",
+    "        return \"After 2022\"\n",
+    "\n",
     "\n",
-    "game_data['year'] = game_data['year'].fillna(0).astype(int) \n",
-    "game_data['Region'] = game_data['year'].apply(categorize_year)\n",
+    "game_data[\"year\"] = game_data[\"year\"].fillna(0).astype(int)\n",
+    "game_data[\"Region\"] = game_data[\"year\"].apply(categorize_year)\n",
     "\n",
     "# Calculate the frequency of each year\n",
-    "yearly_counts = game_data.groupby(['Region', 'year']).size().reset_index(name='Frequency')\n",
+    "yearly_counts = game_data.groupby([\"Region\", \"year\"]).size().reset_index(name=\"Frequency\")\n",
     "\n",
     "# Plotting using Seaborn\n",
     "plt.figure(figsize=(12, 6))\n",
-    "sns.barplot(data=yearly_counts, x='year', y='Frequency', hue='Region')\n",
-    "plt.title('Game Release by Year')\n",
-    "plt.xlabel('Year')\n",
-    "plt.ylabel('Frequency')\n",
+    "sns.barplot(data=yearly_counts, x=\"year\", y=\"Frequency\", hue=\"Region\")\n",
+    "plt.title(\"Game Release by Year\")\n",
+    "plt.xlabel(\"Year\")\n",
+    "plt.ylabel(\"Frequency\")\n",
     "plt.xticks(rotation=45)\n",
     "plt.show()"
    ]
@@ -1031,12 +1036,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "tags = col_row_df['tags']\n",
+    "tags = col_row_df[\"tags\"]\n",
     "parsed_tags = tags.apply(lambda x: literal_eval(x) if x else {})\n",
     "\n",
     "unique_tags = set(itertools.chain(*parsed_tags))\n",
     "\n",
-    "print('Number of unique tags:', len(unique_tags))\n",
+    "print(\"Number of unique tags:\", len(unique_tags))\n",
     "\n",
     "# Create a DataFrame with 15 columns and 30 rows\n",
     "num_columns = 15\n",
@@ -1045,7 +1050,7 @@
     "unique_tags = sorted(list(unique_tags))\n",
     "\n",
     "# Reshape the list into the desired DataFrame shape\n",
-    "ut = [unique_tags[i * num_columns:(i + 1) * num_columns] for i in range(num_rows)]\n",
+    "ut = [unique_tags[i * num_columns : (i + 1) * num_columns] for i in range(num_rows)]\n",
     "\n",
     "# Create the DataFrame\n",
     "utdf = pd.DataFrame(ut)\n",
@@ -1079,8 +1084,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "langs = col_row_df['languages']\n",
-    "langs = langs.apply(lambda x: x.split(', ') if x else [])\n",
+    "langs = col_row_df[\"languages\"]\n",
+    "langs = langs.apply(lambda x: x.split(\", \") if x else [])\n",
     "\n",
     "langc = Counter()\n",
     "\n",
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[flake8]`
	`2`	`+max-line-length = 120`
	`3`	`+ignore = E402,E302,E305,E266,E203,W503,W504,E722,E712,E721,E713,E714,E731`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`import warnings`
	`2`	`+`
`2`	`3`	`import versioneer`
`3`	`4`
`4`	`5`	`if __name__ == "__main__":`