DataExpert-io · jcbowyer · Jan 24, 2025 · Jan 24, 2025
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,7 @@
+{
+    "python.testing.pytestArgs": [
+        "bootcamp"
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true
+}
diff --git a/...view/week_0_6week_free_data_engineering_boot_camp_kick_off_and_informational.md b/...view/week_0_6week_free_data_engineering_boot_camp_kick_off_and_informational.md
@@ -0,0 +1,100 @@
+# Free Data Engineering Boot Camp Kickoff Summary
+
+```mermaid
+mindmap
+  root((Free Data Engineering Boot Camp))
+    Program Structure [00:01:01]
+        Six weeks intensive
+        1-2 hours daily commitment
+        Pre-recorded lessons
+        Two components per module
+            Lecture
+            Lab
+        AI-graded homework
+        Discord community support
+    Curriculum [00:06:02]
+        Data Modeling
+            Two weeks coverage
+            Foundation concepts
+            Data product focus
+        Analytical Patterns
+            Growth accounting
+            Advanced SQL
+            Window functions
+        KPIs & Experimentation
+            Metrics definition
+            Product thinking
+        Data Visualization
+            Communication
+            Tableau basics
+            Dashboard types
+        Infrastructure Track
+            Unit testing
+            Pipeline maintenance
+            Apache Spark fundamentals
+            Data quality patterns
+            Real-time pipelines
+    Certification Paths [00:19:40]
+        Watch-only certificate
+            Attendance tracking
+            Basic recognition
+        Full certification
+            Complete all homework
+            Watch all content
+            Expected 3-4% completion rate
+    Paid vs Free Differences [00:27:28]
+        Cloud infrastructure access
+            AWS deployment
+            One year access
+            Paid APIs available
+        Enhanced support
+            Weekly Q&A sessions
+            Industry expert speakers
+            Dedicated TA support
+        Additional content
+            Snowflake
+            Trino
+            DBT
+            Apache Iceberg
+        Capstone project
+            Dedicated feedback
+            Portfolio building
+            Job interview training
+```
+
+
+*A comprehensive 6-week program launching online with daily content releases at 5 PM Pacific.*
+
+**Big picture:** Tech expert Zach is offering free data engineering training to help 1,000 engineers land jobs by February 15, with content available on YouTube until December 2025.
+
+**Key details:**
+- 10,000+ enrolled students
+- 1-2 hours daily commitment recommended
+- All content pre-recorded and uploaded daily
+- Includes AI-graded homework assignments
+- Discord community support available
+
+**Core curriculum:**
+- Data modeling (2 weeks)
+- Analytical patterns and advanced SQL
+- KPIs and experimentation
+- Data visualization
+- Infrastructure and pipeline maintenance
+- Apache Spark fundamentals
+- Real-time pipelines with Flink and Kafka
+
+**Success metrics:** Only about 300-400 out of 10,000 students expected to complete certification, requiring:
+- Watching all videos
+- Completing all homework assignments
+- Active participation in community
+
+**What's different in paid version:**
+- Cloud infrastructure access
+- Weekly Q&As with Zach
+- Industry expert speakers
+- Additional tools: Snowflake, DBT, AWS Glue
+- Direct TA support
+- Capstone project mentorship
+- Job interview training
+
+**Bottom line:** While free version offers substantial technical training, paid version ($) provides more hands-on support and cloud-based tools for job preparation.
diff --git a/...ity-overview/week_1_d1_intro_data_modeling_complex_data_types_and_cumulation.md b/...ity-overview/week_1_d1_intro_data_modeling_complex_data_types_and_cumulation.md
@@ -0,0 +1,113 @@
+# Data Modeling: Complex Types and Cumulative Tables Deep Dive
+
+*A comprehensive look at dimensional data modeling principles, focusing on the balance between data efficiency and usability.*
+
+
+```mermaid
+mindmap
+  root((Dimensional
+    Data
+    Modeling - Intro))
+    (Understanding Dimensions)
+      (Identifier Dimensions)
+        (Uniquely identify entities)
+          (User ID)
+          (Social Security)
+          (Device ID)
+      (Attributes)
+        (Slowly Changing)
+          (Time dependent values)
+          (Can change over time)
+        (Fixed)
+          (Birthday)
+          (Phone manufacturer)
+    (Data Modeling Types)
+      (OLTP)
+        (Online Transaction Processing)
+        (Normalized)
+        (Minimal duplication)
+        (Fast single row operations)
+      (Master Data)
+        (Middle ground)
+        (Complete entity definitions)
+        (Reference data)
+      (OLAP)
+        (Online Analytical Processing)
+        (Denormalized)
+        (Optimized for analysis)
+        (Population level queries)
+    (Cumulative Table Design)
+      (Historical Analysis)
+      (State Transitions)
+      (Uses Today + Yesterday data)
+      (Full outer join approach)
+      (Drawbacks)
+        (Sequential backfilling)
+        (PII management challenges)
+    (Complex Data Types)
+      (Struct)
+        (Table within table)
+        (Different value types)
+      (Array)
+        (Ordered lists)
+        (Same data type)
+      (Map)
+        (Key-value pairs)
+        (Same value type)
+    (Data Consumer Types)
+      (Data Analysts)
+        (Need simple flat data)
+        (Easy to query)
+      (Data Engineers)
+        (Can handle complex types)
+        (Build downstream pipelines)
+      (ML Models)
+        (Need identifier + features)
+        (Flat structure preferred)
+      (Customers)
+        (Need visualizations)
+        (Charts over raw data)
+    (Compactness vs Usability)
+      (Most Compact)
+        (Compressed data)
+        (Online systems)
+      (Middle Ground)
+        (Arrays and structs)
+        (Master data)
+      (Most Usable)
+        (Flat structure)
+        (Analytics focused)
+```
+
+
+**Big picture:** Data modeling strategies vary significantly based on end users' needs, from analysts requiring simple flat tables to engineers working with compressed complex data types.
+
+**Key dimensions types:**
+- Identifier dimensions (unique entity IDs)
+- Slowly changing dimensions (values change over time)
+- Fixed dimensions (unchangeable values)
+
+**Data modeling layers:**
+- OLTP (transactional): Optimized for single-record operations
+- Master data: Middle ground, combines completeness with efficiency
+- OLAP (analytical): Optimized for aggregation and analysis
+- Metrics: Highest level of aggregation
+
+**Cumulative table design benefits:**
+- Maintains complete historical records
+- Enables efficient historical analysis
+- Supports state transition tracking
+- Reduces query complexity
+
+**Watch out for:**
+- Sequential processing requirement limits parallel backfilling
+- Privacy concerns with historical data retention
+- Storage size growth over time
+- Shuffle operations breaking data sorting in distributed systems
+
+**Complex data types tradeoffs:**
+- Arrays: Best for ordered data, same-type elements
+- Structs: Flexible "table within table" approach
+- Maps: Dynamic key-value pairs with type restrictions
+
+**Bottom line:** Success in dimensional modeling requires understanding your data consumers and balancing compression efficiency with query usability. Master data serves as a critical middle ground between transactional and analytical needs.
diff --git a/...-overview/week_1_d1_lab_data_modeling_cumulative_dimensions_struct_and_array.md b/...-overview/week_1_d1_lab_data_modeling_cumulative_dimensions_struct_and_array.md
@@ -0,0 +1,85 @@
+# Building Cumulative Tables with Complex Data Types: Lab Tutorial
+
+*A hands-on demonstration of creating efficient dimensional tables using PostgreSQL arrays and structs to track NBA player statistics over time.*
+
+
+```mermaid
+mindmap
+  root((Dimensional Data Modeling
+        Lab))
+    (Data Structure)
+        [Player Seasons Table]
+            (Temporal Components)
+            (Player Attributes)
+                (Name)
+                (Height)
+                (College)
+                (Country)
+                (Draft Info)
+            (Season Stats)
+                (Games Played)
+                (Points)
+                (Rebounds)
+                (Assists)
+    (Data Types)
+        [Custom Types]
+            (season_stats struct)
+            (scoring_class enum)
+    (Table Design)
+        [Players Table]
+            (Primary Key)
+                (player_name)
+                (current_season)
+            (Non-temporal Columns)
+            (season_stats Array)
+            (Additional Metrics)
+                (years_since_last_season)
+                (scoring_class)
+    (Cumulative Pattern)
+        [Benefits]
+            (Maintains Data History)
+            (Efficient Joins)
+            (No Shuffling Required)
+            (Fast Analytics)
+        [Implementation]
+            (Yesterday Query)
+            (Today Query)
+            (Full Outer Join)
+            (Array Concatenation)
+    (Analytics Capabilities)
+        [Historical Analysis]
+            (Player Progress)
+            (Career Gaps)
+            (Performance Metrics)
+        [Data Transformations]
+            (Unnest Operations)
+            (Array Manipulations)
+```
+
+
+**Big picture:** Converting season-by-season player statistics into a cumulative table using complex data types reduces data duplication and maintains data sorting efficiency while enabling quick historical analysis.
+
+**Key components:**
+- Custom struct type for season statistics
+- Array column to store multiple seasons
+- Tracking columns for scoring class and years since last season
+- Full outer join logic for cumulative updates
+
+**Implementation steps:**
+- Create custom struct type for season stats (points, games, rebounds, assists)
+- Build base table with player attributes and season stats array
+- Implement incremental loading logic using full outer joins
+- Add derived columns for player classification and activity tracking
+
+**Performance benefits:**
+- No GROUP BY operations needed for historical analysis
+- Maintains data sorting after joins
+- Reduces storage through elimination of duplicated data
+- Enables fast parallel processing
+
+**Real-world example:**
+- Tracked Michael Jordan's career gap (1996-1997, returned 2001)
+- Demonstrated scoring progression from first to last season
+- Identified most improved players without expensive aggregations
+
+**Bottom line:** Complex data types with cumulative loading provide significant performance advantages for dimensional data that changes over time, while maintaining data usability through unnesting capabilities.