DataExpert-io
diff --git a/‎README.md
Lines changed: 8 additions & 0 deletions b/‎README.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎books.md
Lines changed: 2 additions & 1 deletion b/‎books.md
Lines changed: 2 additions & 1 deletion
diff --git a/‎bootcamp/materials/1-dimensional-data-modeling/README.md
Lines changed: 4 additions & 2 deletions b/‎bootcamp/materials/1-dimensional-data-modeling/README.md
Lines changed: 4 additions & 2 deletions
diff --git a/‎bootcamp/materials/1-dimensional-data-modeling/example.env
Lines changed: 1 addition & 1 deletion b/‎bootcamp/materials/1-dimensional-data-modeling/example.env
Lines changed: 1 addition & 1 deletion
diff --git a/‎bootcamp/materials/3-spark-fundamentals/notebooks/Caching.ipynb
Lines changed: 49 additions & 46 deletions b/‎bootcamp/materials/3-spark-fundamentals/notebooks/Caching.ipynb
Lines changed: 49 additions & 46 deletions
@@ -83,13 +83,18 @@ Top must-join communities for ML:
   - [Hex](https://hex.ai/)
   - [Apache Superset](https://superset.apache.org/)
   - [Evidence](https://evidence.dev)
+  - [Redash](https://redash.io/)
+  - [Lightdash](https://lightdash.com/)
 - Data Integration
   - [Cube](https://cube.dev)
   - [Fivetran](https://www.fivetran.com)
   - [Airbyte](https://airbyte.io)
   - [dlt](https://dlthub.com/)
   - [Sling](https://slingdata.io/)
   - [Meltano](https://meltano.com/)
+ - Semantic Layers
+  - [Cube](https://cube.dev)
+  - [dbt Semantic Layer](https://www.getdbt.com/product/semantic-layer) 
 - Modern OLAP
   - [Apache Druid](https://druid.apache.org/)
   - [ClickHouse](https://clickhouse.com/)
@@ -190,6 +195,9 @@ Here's the mostly comprehensive list of data engineering creators:
 | Arnaud Milleker      |                                                                                                                           | [Arnaud Milleker](https://www.linkedin.com/in/arnaudmilleker/) (7k+)                                                      |                                                                                                                                                                                 |                                                                                                               |                                                                                                                                                                                                     |
 | Soumil Shah      | [Soumil Shah] (https://www.youtube.com/@SoumilShah) (50k) | [Soumil Shah](https://www.linkedin.com/in/shah-soumil/) (8k+) |                                                                                                                                                                                 |                                                                                                               |                                                                                                                                                                                                     |
 | Ananth Packkildurai      |  | [Ananth Packkildurai](https://www.linkedin.com/in/ananthdurai/) (18k+) |                                                                                                                                                                                 |                                                                                                               |                                                                                                                                                                                                     |
+| Dan Kornas            |                                |                                  |                  [dankornas](https://www.twitter.com/dankornas) (66k+)                                                                                                                                                               |                                                                                                               |  
+| Nitin             | https://www.linkedin.com/in/tomernitin29/                          |
+| Manojkumar Vadivel      |  | [Manojkumar Vadivel](https://www.linkedin.com/in/manojvsj/) (12k+) | 
 
 ### Great Podcasts
 
 
@@ -29,4 +29,5 @@
 - [Pandas Cookbook, Third Edition](https://www.amazon.com/Pandas-Cookbook-Practical-scientific-exploratory/dp/1836205872)
 - [Data Pipelines Pocket Reference](https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/)
 - [Stream Processing with Apache Flink](https://www.oreilly.com/library/view/stream-processing-with/9781491974285/)
-- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
+- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
+- [Python for Data Analysis, 3E](https://wesmckinney.com/book/)
@@ -45,10 +45,12 @@ There are two methods to get Postgres running locally.
     - For Mac: Follow this **[tutorial](https://daily-dev-tips.com/posts/installing-postgresql-on-a-mac-with-homebrew/)** (Homebrew is really nice for installing on Mac)
     - For Windows: Follow this **[tutorial](https://www.sqlshack.com/how-to-install-postgresql-on-windows/)**
 2. Run this command after replacing **`<computer-username>`** with your computer's username:
-    
+
     ```bash
-    pg_restore -U <computer-username> postgres data.dump
+    pg_restore -U <computer-username> -d postgres data.dump
     ```
+
+    If you have any issue, the syntax is `pg_restore -U [username] -d [database_name] -h [host] -p [port] [backup_file]`
     
 3. Set up DataGrip, DBeaver, or your VS Code extension to point at your locally running Postgres instance.
 4. Have fun querying!
 
@@ -11,4 +11,4 @@ DOCKER_IMAGE=my-postgres-image
 
 PGADMIN_EMAIL=[email protected]
 PGADMIN_PASSWORD=postgres
-PGADMIN_PORT=5050
+PGADMIN_PORT=5050
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 3,
    "id": "e9ae4c8b-4599-4fbb-a545-76b6e3bcb84d",
    "metadata": {},
    "outputs": [
@@ -12,52 +12,54 @@
      "text": [
       "== Physical Plan ==\n",
       "AdaptiveSparkPlan isFinalPlan=false\n",
-      "+- ObjectHashAggregate(keys=[device_id#937, device_type#940], functions=[collect_list(user_id#907, 0, 0)])\n",
-      "   +- ObjectHashAggregate(keys=[device_id#937, device_type#940], functions=[partial_collect_list(user_id#907, 0, 0)])\n",
-      "      +- Project [device_id#937, device_type#940, user_id#907]\n",
-      "         +- SortMergeJoin [device_id#937], [device_id#908], Inner\n",
-      "            :- Sort [device_id#937 ASC NULLS FIRST], false, 0\n",
-      "            :  +- Exchange hashpartitioning(device_id#937, 4), ENSURE_REQUIREMENTS, [plan_id=1320]\n",
-      "            :     +- Filter isnotnull(device_id#937)\n",
-      "            :        +- FileScan csv [device_id#937,device_type#940] Batched: false, DataFilters: [isnotnull(device_id#937)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/devices.csv], PartitionFilters: [], PushedFilters: [IsNotNull(device_id)], ReadSchema: struct<device_id:int,device_type:string>\n",
-      "            +- Sort [device_id#908 ASC NULLS FIRST], false, 0\n",
-      "               +- Exchange hashpartitioning(device_id#908, 4), ENSURE_REQUIREMENTS, [plan_id=1321]\n",
-      "                  +- Filter isnotnull(device_id#908)\n",
-      "                     +- InMemoryTableScan [user_id#907, device_id#908], [isnotnull(device_id#908)]\n",
-      "                           +- InMemoryRelation [user_id#907, device_id#908, event_counts#945L, host_array#946], StorageLevel(disk, memory, deserialized, 1 replicas)\n",
-      "                                 +- ObjectHashAggregate(keys=[user_id#198, device_id#199], functions=[count(1), collect_list(distinct host#201, 0, 0)])\n",
-      "                                    +- Exchange hashpartitioning(user_id#198, device_id#199, 4), ENSURE_REQUIREMENTS, [plan_id=1338]\n",
-      "                                       +- ObjectHashAggregate(keys=[user_id#198, device_id#199], functions=[merge_count(1), partial_collect_list(distinct host#201, 0, 0)])\n",
-      "                                          +- *(2) HashAggregate(keys=[user_id#198, device_id#199, host#201], functions=[merge_count(1)])\n",
-      "                                             +- Exchange hashpartitioning(user_id#198, device_id#199, host#201, 4), ENSURE_REQUIREMENTS, [plan_id=1333]\n",
-      "                                                +- *(1) HashAggregate(keys=[user_id#198, device_id#199, host#201], functions=[partial_count(1)])\n",
-      "                                                   +- *(1) Filter isnotnull(user_id#198)\n",
-      "                                                      +- FileScan csv [user_id#198,device_id#199,host#201] Batched: false, DataFilters: [isnotnull(user_id#198)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/events.csv], PartitionFilters: [], PushedFilters: [IsNotNull(user_id)], ReadSchema: struct<user_id:int,device_id:int,host:string>\n",
+      "+- ObjectHashAggregate(keys=[device_id#598, device_type#601], functions=[collect_list(user_id#568, 0, 0)])\n",
+      "   +- ObjectHashAggregate(keys=[device_id#598, device_type#601], functions=[partial_collect_list(user_id#568, 0, 0)])\n",
+      "      +- Project [device_id#598, device_type#601, user_id#568]\n",
+      "         +- SortMergeJoin [device_id#598], [device_id#569], Inner\n",
+      "            :- Sort [device_id#598 ASC NULLS FIRST], false, 0\n",
+      "            :  +- Exchange hashpartitioning(device_id#598, 4), ENSURE_REQUIREMENTS, [plan_id=735]\n",
+      "            :     +- Filter isnotnull(device_id#598)\n",
+      "            :        +- FileScan csv [device_id#598,device_type#601] Batched: false, DataFilters: [isnotnull(device_id#598)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/devices.csv], PartitionFilters: [], PushedFilters: [IsNotNull(device_id)], ReadSchema: struct<device_id:int,device_type:string>\n",
+      "            +- Sort [device_id#569 ASC NULLS FIRST], false, 0\n",
+      "               +- Exchange hashpartitioning(device_id#569, 4), ENSURE_REQUIREMENTS, [plan_id=736]\n",
+      "                  +- Filter isnotnull(device_id#569)\n",
+      "                     +- InMemoryTableScan [user_id#568, device_id#569], [isnotnull(device_id#569)]\n",
+      "                           +- InMemoryRelation [user_id#568, device_id#569, event_counts#606L, host_array#607], StorageLevel(disk, memory, deserialized, 1 replicas)\n",
+      "                                 +- AdaptiveSparkPlan isFinalPlan=false\n",
+      "                                    +- ObjectHashAggregate(keys=[user_id#17, device_id#18], functions=[count(1), collect_list(distinct host#20, 0, 0)])\n",
+      "                                       +- Exchange hashpartitioning(user_id#17, device_id#18, 4), ENSURE_REQUIREMENTS, [plan_id=752]\n",
+      "                                          +- ObjectHashAggregate(keys=[user_id#17, device_id#18], functions=[merge_count(1), partial_collect_list(distinct host#20, 0, 0)])\n",
+      "                                             +- HashAggregate(keys=[user_id#17, device_id#18, host#20], functions=[merge_count(1)])\n",
+      "                                                +- Exchange hashpartitioning(user_id#17, device_id#18, host#20, 4), ENSURE_REQUIREMENTS, [plan_id=748]\n",
+      "                                                   +- HashAggregate(keys=[user_id#17, device_id#18, host#20], functions=[partial_count(1)])\n",
+      "                                                      +- Filter isnotnull(user_id#17)\n",
+      "                                                         +- FileScan csv [user_id#17,device_id#18,host#20] Batched: false, DataFilters: [isnotnull(user_id#17)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/events.csv], PartitionFilters: [], PushedFilters: [IsNotNull(user_id)], ReadSchema: struct<user_id:int,device_id:int,host:string>\n",
       "\n",
       "\n",
       "== Physical Plan ==\n",
       "AdaptiveSparkPlan isFinalPlan=false\n",
-      "+- ObjectHashAggregate(keys=[user_id#907], functions=[max(event_counts#945L), collect_list(device_id#908, 0, 0)])\n",
-      "   +- ObjectHashAggregate(keys=[user_id#907], functions=[partial_max(event_counts#945L), partial_collect_list(device_id#908, 0, 0)])\n",
-      "      +- Project [user_id#907, device_id#908, event_counts#945L]\n",
-      "         +- SortMergeJoin [user_id#907], [user_id#953], Inner\n",
-      "            :- Sort [user_id#907 ASC NULLS FIRST], false, 0\n",
-      "            :  +- Exchange hashpartitioning(user_id#907, 4), ENSURE_REQUIREMENTS, [plan_id=1374]\n",
-      "            :     +- Filter isnotnull(user_id#907)\n",
-      "            :        +- FileScan csv [user_id#907,device_id#908] Batched: false, DataFilters: [isnotnull(user_id#907)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/events.csv], PartitionFilters: [], PushedFilters: [IsNotNull(user_id)], ReadSchema: struct<user_id:int,device_id:int>\n",
-      "            +- Sort [user_id#953 ASC NULLS FIRST], false, 0\n",
-      "               +- Exchange hashpartitioning(user_id#953, 4), ENSURE_REQUIREMENTS, [plan_id=1375]\n",
-      "                  +- Filter isnotnull(user_id#953)\n",
-      "                     +- InMemoryTableScan [user_id#953, event_counts#945L], [isnotnull(user_id#953)]\n",
-      "                           +- InMemoryRelation [user_id#953, device_id#954, event_counts#945L, host_array#946], StorageLevel(disk, memory, deserialized, 1 replicas)\n",
-      "                                 +- ObjectHashAggregate(keys=[user_id#198, device_id#199], functions=[count(1), collect_list(distinct host#201, 0, 0)])\n",
-      "                                    +- Exchange hashpartitioning(user_id#198, device_id#199, 4), ENSURE_REQUIREMENTS, [plan_id=1392]\n",
-      "                                       +- ObjectHashAggregate(keys=[user_id#198, device_id#199], functions=[merge_count(1), partial_collect_list(distinct host#201, 0, 0)])\n",
-      "                                          +- *(2) HashAggregate(keys=[user_id#198, device_id#199, host#201], functions=[merge_count(1)])\n",
-      "                                             +- Exchange hashpartitioning(user_id#198, device_id#199, host#201, 4), ENSURE_REQUIREMENTS, [plan_id=1387]\n",
-      "                                                +- *(1) HashAggregate(keys=[user_id#198, device_id#199, host#201], functions=[partial_count(1)])\n",
-      "                                                   +- *(1) Filter isnotnull(user_id#198)\n",
-      "                                                      +- FileScan csv [user_id#198,device_id#199,host#201] Batched: false, DataFilters: [isnotnull(user_id#198)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/events.csv], PartitionFilters: [], PushedFilters: [IsNotNull(user_id)], ReadSchema: struct<user_id:int,device_id:int,host:string>\n",
+      "+- ObjectHashAggregate(keys=[user_id#568], functions=[max(event_counts#606L), collect_list(device_id#569, 0, 0)])\n",
+      "   +- ObjectHashAggregate(keys=[user_id#568], functions=[partial_max(event_counts#606L), partial_collect_list(device_id#569, 0, 0)])\n",
+      "      +- Project [user_id#568, device_id#569, event_counts#606L]\n",
+      "         +- SortMergeJoin [user_id#568], [user_id#614], Inner\n",
+      "            :- Sort [user_id#568 ASC NULLS FIRST], false, 0\n",
+      "            :  +- Exchange hashpartitioning(user_id#568, 4), ENSURE_REQUIREMENTS, [plan_id=788]\n",
+      "            :     +- Filter isnotnull(user_id#568)\n",
+      "            :        +- FileScan csv [user_id#568,device_id#569] Batched: false, DataFilters: [isnotnull(user_id#568)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/events.csv], PartitionFilters: [], PushedFilters: [IsNotNull(user_id)], ReadSchema: struct<user_id:int,device_id:int>\n",
+      "            +- Sort [user_id#614 ASC NULLS FIRST], false, 0\n",
+      "               +- Exchange hashpartitioning(user_id#614, 4), ENSURE_REQUIREMENTS, [plan_id=789]\n",
+      "                  +- Filter isnotnull(user_id#614)\n",
+      "                     +- InMemoryTableScan [user_id#614, event_counts#606L], [isnotnull(user_id#614)]\n",
+      "                           +- InMemoryRelation [user_id#614, device_id#615, event_counts#606L, host_array#607], StorageLevel(disk, memory, deserialized, 1 replicas)\n",
+      "                                 +- AdaptiveSparkPlan isFinalPlan=false\n",
+      "                                    +- ObjectHashAggregate(keys=[user_id#17, device_id#18], functions=[count(1), collect_list(distinct host#20, 0, 0)])\n",
+      "                                       +- Exchange hashpartitioning(user_id#17, device_id#18, 4), ENSURE_REQUIREMENTS, [plan_id=805]\n",
+      "                                          +- ObjectHashAggregate(keys=[user_id#17, device_id#18], functions=[merge_count(1), partial_collect_list(distinct host#20, 0, 0)])\n",
+      "                                             +- HashAggregate(keys=[user_id#17, device_id#18, host#20], functions=[merge_count(1)])\n",
+      "                                                +- Exchange hashpartitioning(user_id#17, device_id#18, host#20, 4), ENSURE_REQUIREMENTS, [plan_id=801]\n",
+      "                                                   +- HashAggregate(keys=[user_id#17, device_id#18, host#20], functions=[partial_count(1)])\n",
+      "                                                      +- Filter isnotnull(user_id#17)\n",
+      "                                                         +- FileScan csv [user_id#17,device_id#18,host#20] Batched: false, DataFilters: [isnotnull(user_id#17)], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/iceberg/data/events.csv], PartitionFilters: [], PushedFilters: [IsNotNull(user_id)], ReadSchema: struct<user_id:int,device_id:int,host:string>\n",
       "\n",
       "\n"
      ]
@@ -73,10 +75,10 @@
        "eventsAggregated: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [user_id: int, device_id: int ... 2 more fields]\n",
        "usersAndDevices: org.apache.spark.sql.DataFrame = [user_id: int, user_id: int ... 2 more fields]\n",
        "devicesOnEvents: org.apache.spark.sql.DataFrame = [device_id: int, device_type: string ... 3 more fields]\n",
-       "res4: Array[org.apache.spark.sql.Row] = Array([-2147470439,-2147470439,3,WrappedArray(378988111, 378988111, 378988111)])\n"
+       "res1: Array[org.apache.spark.sql.Row] = Array([-2147470439,-2147470439,3,WrappedArray(378988111, 378988111, 378988111)])\n"
       ]
      },
-     "execution_count": 5,
+     "execution_count": 3,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -107,6 +109,7 @@
     "//Caching here should be < 5 GBs or used for broadcast join\n",
     "//You need to tune executor memory otherwise it'll spill to disk and be slow\n",
     "//Don't really try using any of the other StorageLevel besides MEMORY_ONLY\n",
+    "\n",
     "val eventsAggregated = spark.sql(f\"\"\"\n",
     "  SELECT user_id, \n",
     "          device_id, \n",
@@ -207,4 +210,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}