Skip to content

docs: Add smartbrevity overview materials for Data Engineering Boot … #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"python.testing.pytestArgs": [
"bootcamp"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Free Data Engineering Boot Camp Kickoff Summary

```mermaid
mindmap
root((Free Data Engineering Boot Camp))
Program Structure [00:01:01]
Six weeks intensive
1-2 hours daily commitment
Pre-recorded lessons
Two components per module
Lecture
Lab
AI-graded homework
Discord community support
Curriculum [00:06:02]
Data Modeling
Two weeks coverage
Foundation concepts
Data product focus
Analytical Patterns
Growth accounting
Advanced SQL
Window functions
KPIs & Experimentation
Metrics definition
Product thinking
Data Visualization
Communication
Tableau basics
Dashboard types
Infrastructure Track
Unit testing
Pipeline maintenance
Apache Spark fundamentals
Data quality patterns
Real-time pipelines
Certification Paths [00:19:40]
Watch-only certificate
Attendance tracking
Basic recognition
Full certification
Complete all homework
Watch all content
Expected 3-4% completion rate
Paid vs Free Differences [00:27:28]
Cloud infrastructure access
AWS deployment
One year access
Paid APIs available
Enhanced support
Weekly Q&A sessions
Industry expert speakers
Dedicated TA support
Additional content
Snowflake
Trino
DBT
Apache Iceberg
Capstone project
Dedicated feedback
Portfolio building
Job interview training
```


*A comprehensive 6-week program launching online with daily content releases at 5 PM Pacific.*

**Big picture:** Tech expert Zach is offering free data engineering training to help 1,000 engineers land jobs by February 15, with content available on YouTube until December 2025.

**Key details:**
- 10,000+ enrolled students
- 1-2 hours daily commitment recommended
- All content pre-recorded and uploaded daily
- Includes AI-graded homework assignments
- Discord community support available

**Core curriculum:**
- Data modeling (2 weeks)
- Analytical patterns and advanced SQL
- KPIs and experimentation
- Data visualization
- Infrastructure and pipeline maintenance
- Apache Spark fundamentals
- Real-time pipelines with Flink and Kafka

**Success metrics:** Only about 300-400 out of 10,000 students expected to complete certification, requiring:
- Watching all videos
- Completing all homework assignments
- Active participation in community

**What's different in paid version:**
- Cloud infrastructure access
- Weekly Q&As with Zach
- Industry expert speakers
- Additional tools: Snowflake, DBT, AWS Glue
- Direct TA support
- Capstone project mentorship
- Job interview training

**Bottom line:** While free version offers substantial technical training, paid version ($) provides more hands-on support and cloud-based tools for job preparation.
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Data Modeling: Complex Types and Cumulative Tables Deep Dive

*A comprehensive look at dimensional data modeling principles, focusing on the balance between data efficiency and usability.*


```mermaid
mindmap
root((Dimensional
Data
Modeling - Intro))
(Understanding Dimensions)
(Identifier Dimensions)
(Uniquely identify entities)
(User ID)
(Social Security)
(Device ID)
(Attributes)
(Slowly Changing)
(Time dependent values)
(Can change over time)
(Fixed)
(Birthday)
(Phone manufacturer)
(Data Modeling Types)
(OLTP)
(Online Transaction Processing)
(Normalized)
(Minimal duplication)
(Fast single row operations)
(Master Data)
(Middle ground)
(Complete entity definitions)
(Reference data)
(OLAP)
(Online Analytical Processing)
(Denormalized)
(Optimized for analysis)
(Population level queries)
(Cumulative Table Design)
(Historical Analysis)
(State Transitions)
(Uses Today + Yesterday data)
(Full outer join approach)
(Drawbacks)
(Sequential backfilling)
(PII management challenges)
(Complex Data Types)
(Struct)
(Table within table)
(Different value types)
(Array)
(Ordered lists)
(Same data type)
(Map)
(Key-value pairs)
(Same value type)
(Data Consumer Types)
(Data Analysts)
(Need simple flat data)
(Easy to query)
(Data Engineers)
(Can handle complex types)
(Build downstream pipelines)
(ML Models)
(Need identifier + features)
(Flat structure preferred)
(Customers)
(Need visualizations)
(Charts over raw data)
(Compactness vs Usability)
(Most Compact)
(Compressed data)
(Online systems)
(Middle Ground)
(Arrays and structs)
(Master data)
(Most Usable)
(Flat structure)
(Analytics focused)
```


**Big picture:** Data modeling strategies vary significantly based on end users' needs, from analysts requiring simple flat tables to engineers working with compressed complex data types.

**Key dimensions types:**
- Identifier dimensions (unique entity IDs)
- Slowly changing dimensions (values change over time)
- Fixed dimensions (unchangeable values)

**Data modeling layers:**
- OLTP (transactional): Optimized for single-record operations
- Master data: Middle ground, combines completeness with efficiency
- OLAP (analytical): Optimized for aggregation and analysis
- Metrics: Highest level of aggregation

**Cumulative table design benefits:**
- Maintains complete historical records
- Enables efficient historical analysis
- Supports state transition tracking
- Reduces query complexity

**Watch out for:**
- Sequential processing requirement limits parallel backfilling
- Privacy concerns with historical data retention
- Storage size growth over time
- Shuffle operations breaking data sorting in distributed systems

**Complex data types tradeoffs:**
- Arrays: Best for ordered data, same-type elements
- Structs: Flexible "table within table" approach
- Maps: Dynamic key-value pairs with type restrictions

**Bottom line:** Success in dimensional modeling requires understanding your data consumers and balancing compression efficiency with query usability. Master data serves as a critical middle ground between transactional and analytical needs.
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Building Cumulative Tables with Complex Data Types: Lab Tutorial

*A hands-on demonstration of creating efficient dimensional tables using PostgreSQL arrays and structs to track NBA player statistics over time.*


```mermaid
mindmap
root((Dimensional Data Modeling
Lab))
(Data Structure)
[Player Seasons Table]
(Temporal Components)
(Player Attributes)
(Name)
(Height)
(College)
(Country)
(Draft Info)
(Season Stats)
(Games Played)
(Points)
(Rebounds)
(Assists)
(Data Types)
[Custom Types]
(season_stats struct)
(scoring_class enum)
(Table Design)
[Players Table]
(Primary Key)
(player_name)
(current_season)
(Non-temporal Columns)
(season_stats Array)
(Additional Metrics)
(years_since_last_season)
(scoring_class)
(Cumulative Pattern)
[Benefits]
(Maintains Data History)
(Efficient Joins)
(No Shuffling Required)
(Fast Analytics)
[Implementation]
(Yesterday Query)
(Today Query)
(Full Outer Join)
(Array Concatenation)
(Analytics Capabilities)
[Historical Analysis]
(Player Progress)
(Career Gaps)
(Performance Metrics)
[Data Transformations]
(Unnest Operations)
(Array Manipulations)
```


**Big picture:** Converting season-by-season player statistics into a cumulative table using complex data types reduces data duplication and maintains data sorting efficiency while enabling quick historical analysis.

**Key components:**
- Custom struct type for season statistics
- Array column to store multiple seasons
- Tracking columns for scoring class and years since last season
- Full outer join logic for cumulative updates

**Implementation steps:**
- Create custom struct type for season stats (points, games, rebounds, assists)
- Build base table with player attributes and season stats array
- Implement incremental loading logic using full outer joins
- Add derived columns for player classification and activity tracking

**Performance benefits:**
- No GROUP BY operations needed for historical analysis
- Maintains data sorting after joins
- Reduces storage through elimination of duplicated data
- Enables fast parallel processing

**Real-world example:**
- Tracked Michael Jordan's career gap (1996-1997, returned 2001)
- Demonstrated scoring progression from first to last season
- Identified most improved players without expensive aggregations

**Bottom line:** Complex data types with cumulative loading provide significant performance advantages for dimensional data that changes over time, while maintaining data usability through unnesting capabilities.
Loading