Skip to content

Commit bdfe3b4

Browse files
Merge branch 'DataExpert-io:main' into main
2 parents 3111cd7 + 73b9e3a commit bdfe3b4

File tree

4 files changed

+178
-139
lines changed

4 files changed

+178
-139
lines changed

README.md

Lines changed: 112 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -2,55 +2,39 @@
22

33
This repo has all the resources you need to become an amazing data engineer!
44

5-
Make sure to check out the [projects](projects.md) section for more hands-on examples!
5+
## Getting started
6+
7+
If you are new to data engineering, start by following this [2024 breaking into data engineering roadmap](https://blog.dataengineer.io/p/the-2024-breaking-into-data-engineering)
8+
9+
For more applied learning:
10+
- Check out the [projects](projects.md) section for more hands-on examples!
11+
- Check out the [interviews](interviews.md) section for more advice on how to pass data engineering interviews!
12+
- Check out the [books](books.md) section for a list of high quality data engineering books
13+
- Check out the [communities](communities.md) section for a list of high quality data engineering communities to join
14+
- Check out the [newsletter](newsletters.md) section to learn via email
615

7-
Make sure to check out the [interviews](interviews.md) section for more advice on how to pass data engineering interviews!
816

917
## Resources
1018

11-
Great books:
19+
### Great [list of over 25 books](books.md)
1220

21+
Top 3 must read books are:
1322
- [Fundamentals of Data Engineering](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/)
1423
- [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/)
1524
- [Designing Machine Learning Systems](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969)
16-
- [The Hundred Page Machine Learning Book](https://www.amazon.com/Hundred-Page-Machine-Learning-Book/dp/199957950X)
17-
- [Kimball - The Data Warehouse Toolkit](https://ia801609.us.archive.org/14/items/the-data-warehouse-toolkit-kimball/The%20Data%20Warehouse%20Toolkit%20-%20Kimball.pdf)
18-
- [Data Mesh](https://www.oreilly.com/library/view/data-mesh/9781492092384/)
19-
- [Machine Learning System Design Interview](https://www.amazon.com/Machine-Learning-System-Design-Interview/dp/1736049127)
20-
- [Streaming Systems](https://www.amazon.com/Streaming-Systems-Where-Large-Scale-Processing/dp/1491983876)
21-
- [High Performance Spark](https://www.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203)
22-
- [Building Evolutionary Architectures, 2nd Edition](https://www.oreilly.com/library/view/building-evolutionary-architectures/9781492097532/)
23-
- [Data Management at Scale, 2nd Edition](https://www.oreilly.com/library/view/data-management-at/9781098138851/)
24-
- [Deciphering Data Architectures](https://www.oreilly.com/library/view/deciphering-data-architectures/9781098150754/)
25-
- [97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts](https://www.amazon.com/Things-Every-Data-Engineer-Should/dp/1492062413)
26-
- [Data Governance: The Definitive Guide](https://www.oreilly.com/library/view/data-governance-the/9781492063483/)
27-
- [Trino: The Definitive Guide](https://trino.io/trino-the-definitive-guide.html)
28-
- [Delta Lake: The Definitive Guide](https://www.oreilly.com/library/view/delta-lake-the/9781098151935/)
29-
- [Hadoop: The Definitive Guide](https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/)
30-
- [Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications](https://www.amazon.com/Modern-Engineering-Apache-Spark-Hands/dp/1484274512)
31-
- [Data Engineering with dbt: A practical guide to building a dependable data platform with SQL](https://www.amazon.com/Data-Engineering-dbt-cloud-based-dependable-ebook/dp/B0C4LL19G7)
32-
- [Data Engineering with AWS](https://www.oreilly.com/library/view/data-engineering-with/9781804614426/)
33-
- [Practical DataOps: Delivering Agile Date Science at Scale](https://www.amazon.com/Practical-DataOps-Delivering-Agile-Science/dp/1484251032)
34-
- [Data Engineering Design Patterns](https://www.dedp.online/)
35-
- [Snowflake Data Engineering](https://www.manning.com/books/snowflake-data-engineering)
36-
- [Unlocking dbt](https://www.amazon.com/Unlocking-dbt-Design-Transformations-Warehouse/dp/1484296990/)
37-
- [Learning Spark, Second Edition](https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf)
38-
39-
Communities:
40-
41-
- [Seattle Data Guy Discord](https://discord.gg/ah95MZKkFF)
25+
26+
### Great [list of over 10 communities to join](communities.md):
27+
28+
Top must-join communities for DE:
4229
- [EcZachly Data Engineering Discord](https://discord.gg/JGumAXncAK)
43-
- [AdalFlow Discrod (LLM Library)](https://discord.com/invite/ezzszrRZvT)
44-
- [Chip Huyen MLOps Discord](https://discord.gg/dzh728c5t3)
45-
- [Data Engineer Things Community](https://www.dataengineerthings.org/aboutus/)
46-
- [DBT Community](https://www.getdbt.com/community/join-the-community/)
47-
- [r/dataengineering](https://www.reddit.com/r/dataengineering)
48-
- [Microsoft Fabric Community](https://community.fabric.microsoft.com/)
49-
- [r/MicrosoftFabric](https://www.reddit.com/r/MicrosoftFabric/)
5030
- [Data Talks Club Slack](https://datatalks.club/slack)
51-
- [Data Engineering Wiki](https://dataengineering.wiki/)
31+
- [Data Engineer Things Community](https://www.dataengineerthings.org/aboutus/)
32+
33+
Top must-join communities for ML:
34+
- [AdalFlow Discord](https://discord.com/invite/ezzszrRZvT)
35+
- [Chip Huyen MLOps Discord](https://discord.gg/dzh728c5t3)
5236

53-
Companies:
37+
### Companies:
5438

5539
- Orchestration
5640
- [Mage](https://www.mage.ai)
@@ -106,8 +90,15 @@ Companies:
10690
- [DuckDB](https://duckdb.org/)
10791
- LLM application library
10892
- [AdalFlow](https://github.com/SylphAI-Inc/AdalFlow)
93+
- [LangChain](https://github.com/langchain-ai/langchain)
94+
- [LlamaIndex](https://github.com/run-llama/llama_index)
95+
- Real-Time Data
96+
- [Aggregations.io](https://aggregations.io)
97+
- [Responsive](https://www.responsive.dev/)
98+
- [RisingWave](https://risingwave.com/)
10999

110-
Data Engineering blogs of companies:
100+
101+
### Data Engineering blogs of companies:
111102

112103
- [Netflix](https://netflixtechblog.com/tagged/big-data)
113104
- [Uber](https://www.uber.com/blog/houston/data/?uclick_id=b2f43229-f3f4-4bae-bd5d-10a05db2f70c)
@@ -120,7 +111,7 @@ Data Engineering blogs of companies:
120111
- [Meta](https://engineering.fb.com/category/data-infrastructure/)
121112
- [Onehouse](https://www.onehouse.ai/blog)
122113

123-
Data Engineering Whitepapers:
114+
### Data Engineering Whitepapers:
124115

125116
- [A Five-Layered Business Intelligence Architecture](https://ibimapublishing.com/articles/CIBIMA/2011/695619/695619.pdf)
126117
- [Lakehouse:A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)
@@ -129,11 +120,15 @@ Data Engineering Whitepapers:
129120
- [Spark: Cluster Computing with Working Sets](https://dl.acm.org/doi/10.5555/1863103.1863113)
130121
- [The Google File System](https://research.google/pubs/the-google-file-system/)
131122
- [Building a Universal Data Lakehouse](https://www.onehouse.ai/whitepaper/onehouse-universal-data-lakehouse-whitepaper)
123+
- [XTable in Action: Seamless Interoperability in Data Lakes](https://arxiv.org/abs/2401.09621)
124+
- [MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/)
132125

133-
Great YouTube Channels:
126+
### Great YouTube Channels:
127+
- *you have to have >10k subscribes to be added*
134128

135129
- 100k+ subscribers
136130
- [E-learning Bridge](https://www.youtube.com/@shashank_mishra)
131+
- [Seattle Data Guy](https://www.youtube.com/c/SeattleDataGuy)
137132
- [TrendyTech](https://www.youtube.com/c/TrendytechInsights)
138133
- [Darshil Parmar](https://www.youtube.com/@DarshilParmar)
139134
- [Andreas Kretz](https://www.youtube.com/c/andreaskayy)
@@ -143,26 +138,86 @@ Great YouTube Channels:
143138
- [Adam Marczak](https://www.youtube.com/@AdamMarczakYT)
144139
- [nullQueries](https://www.youtube.com/@nullQueries)
145140
- [TECHTFQ by Thoufiq](https://www.youtube.com/@techTFQ)
141+
- [SQLBI](https://www.youtube.com/@SQLBI)
146142
- 10k+ subscribers
147143
- [Data with Zach](https://www.youtube.com/c/datawithzach)
148-
- [Seattle Data Guy](https://www.youtube.com/c/SeattleDataGuy)
149144
- [Azure Lib](https://www.youtube.com/@azurelib-academy)
150145
- [Advancing Analytics](https://www.youtube.com/@AdvancingAnalytics)
151146
- [Kahan Data Solutions](https://www.youtube.com/@KahanDataSolutions)
152147
- [Ankit Bansal](https://youtube.com/@ankitbansal6)
153148
- [Mr. K Talks Tech](https://www.youtube.com/channel/UCzdOan4AmF65PmLLks8Lmww)
154-
- 1k+ subscribers
155-
- [Eric Roby](https://www.youtube.com/@codingwithroby)
156149

157-
Great Podcasts
150+
### LinkedIn Voices
151+
- *you have to have >5k followers to be added*
152+
153+
- 100k+ Followers
154+
- [Zach Wilson](https://www.linkedin.com/in/eczachly)
155+
- [Ben Rogojan](https://www.linkedin.com/in/benjaminrogojan)
156+
- [Sumit Mittal](https://www.linkedin.com/in/bigdatabysumit/)
157+
- [Shashank Mishra](https://www.linkedin.com/in/shashank219/)
158+
- [Chip Huyen](https://www.linkedin.com/in/chiphuyen/)
159+
- [Alex Xu](https://www.linkedin.com/in/alexxubyte)
160+
- [Deepak Goyal](https://www.linkedin.com/in/deepak-goyal-93805a17/)
161+
- [Andreas Kretz](https://www.linkedin.com/in/andreas-kretz)
162+
- 50k+ Followers
163+
- [Joe Reis](https://www.linkedin.com/in/josephreis)
164+
- [Darshil Parmar](https://www.linkedin.com/in/darshil-parmar/)
165+
- [Ankit Bansal](https://www.linkedin.com/in/ankitbansal6/)
166+
- [Marc Lamberti](https://www.linkedin.com/in/marclamberti)
167+
- [Marco Russo](https://www.linkedin.com/in/sqlbi)
168+
- 10k+ Followers
169+
- [Li Yin](https://www.linkedin.com/in/li-yin-ai/)
170+
- [Joseph Machado](https://www.linkedin.com/in/josephmachado1991/)
171+
- [Eric Roby](https://www.linkedin.com/in/codingwithroby/)
172+
- [Simon Whiteley](https://www.linkedin.com/in/simon-whiteley-uk/)
173+
- [Simon Späti](https://www.linkedin.com/in/sspaeti/)
174+
- 5k+ Followers
175+
- [Dipankar Mazumdar](https://www.linkedin.com/in/dipankar-mazumdar/)
176+
- [Daniel Ciocirlan](https://www.linkedin.com/in/danielciocirlan)
177+
- [Hugo Lu](https://www.linkedin.com/in/hugo-lu-confirmed/)
178+
- [Tobias Macey](https://www.linkedin.com/in/tmacey)
179+
- [Marcos Ortiz](https://www.linkedin.com/in/mlortiz)
180+
- [Julien Hurault](https://www.linkedin.com/in/julienhuraultanalytics/)
181+
182+
### Twitter / X voices
183+
- *you have to have >5k followers to be added*
184+
185+
- 100k+ followers
186+
- [Alex Xu](https://twitter.com/alexxubyte/)
187+
- 10k+ followers
188+
- [Zach Wilson](https://www.twitter.com/EcZachly)
189+
- [Seattle Data Guy](https://www.twitter.com/SeattleDataGuy)
190+
- [Marco Russo](https://x.com/marcorus)
191+
- [Daniel Blanco](https://www.twitter.com/DanielBlancoSWE)
192+
- 5k+ followers
193+
- [Sumit Mittal](https://www.twitter.com/bigdatasumit)
194+
- [Joseph Machado](https://twitter.com/startdataeng)
195+
196+
### Instagram creators
197+
- *you have to have >5k followers to be added*
198+
199+
- 100k+ followers
200+
- [Zach Wilson](https://www.instagram.com/eczachly)
201+
- 5k+ followers
202+
- [Andreas Kretz](https://www.instagram.com/learndataengineering)
203+
204+
TikTok
205+
- *you have to have >10k followers to be added*
206+
207+
- 50k+ followers
208+
- [Zach Wilson](https://www.tiktok.com/@eczachly)
209+
- 10k+ followers
210+
- [Alex The Analyst](https://www.tiktok.com/@alex_the_analyst)
211+
212+
### Great Podcasts
158213

159214
- [The Data Engineering Show](https://www.dataengineeringshow.com/)
160215
- [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
161216
- [DataTopics](https://www.datatopics.io/)
162217
- [The Data Engineering Side Of Data](https://podcasts.apple.com/us/podcast/the-engineering-side-of-data/id1566999533)
163218
- [DataWare](https://www.ascend.io/dataaware-podcast/)
164219
- [The Data Coffee Break Podcast](https://www.deezer.com/us/show/5293247)
165-
- [Thd datastack show](https://datastackshow.com/)
220+
- [The Datastack show](https://datastackshow.com/)
166221
- [Intricity101 Data Sharks Podcast](https://www.intricity.com/learningcenter/podcast)
167222
- [Drill to Detail with Mark Rittman](https://www.rittmananalytics.com/drilltodetail/)
168223
- [Analytics Power Hour](https://analyticshour.io/)
@@ -178,32 +233,15 @@ Great Podcasts
178233
- [Monday Morning Data Chat](https://open.spotify.com/show/3Km3lBNzJpc1nOTJUtbtMh)
179234
- [The Data Chief](https://www.thoughtspot.com/data-chief/podcast)
180235

181-
Newsletters:
236+
### Great [list of 20+ newsletters](newsletters.md)
182237

238+
Top must follow newsletters for data engineering:
183239
- [DataEngineer.io Newsletter](https://blog.dataengineer.io)
184-
- [Seattle Data Guy](https://seattledataguy.substack.com)
185240
- [Joe Reis](https://joereis.substack.com)
186-
- [Data Engineering Weekly](https://www.dataengineeringweekly.com)
187-
- [Data Engineering Central](https://dataengineeringcentral.substack.com)
188-
- [Dutch Engineer](https://dutchengineer.substack.com)
189-
- [ByteByteGo](https://blog.bytebytego.com)
190241
- [Start Data Engineering](https://www.startdataengineering.com)
191-
- [Developing Dev](https://www.developing.dev)
192-
- [High Growth Engineer](https://careercutler.substack.com/)
193-
- [Learn Analytics Engineering](https://learnanalyticsengineering.substack.com/)
194-
- [Marvelous MLOps](https://marvelousmlops.substack.com/)
195-
- [medium Data Engineering Newsletter](https://medium.com/data-engineering-weekly)
196-
- [Benn Stancil](https://benn.substack.com/)
197-
- [Metadata Weekly](https://metadataweekly.substack.com/)
198-
- [Technically](https://technically.substack.com/)
199-
- [Blef.fr Data News](https://www.blef.fr/blog/)
200-
- [All Hands on Data](https://allhandsondata.substack.com/)
201-
- [Modern Data 101](https://moderndata101.substack.com/)
202-
- [SELECT Insights](https://newsletter.ssp.sh/)
203-
- [Interesting Data Gigs](https://newsletter.interestinggigs.com)
204-
- [Ju Data Engineering Weekly](https://juhache.substack.com/)
205-
206-
Glossaries:
242+
- [Data Engineering Weekly](https://www.dataengineeringweekly.com)
243+
244+
### Glossaries:
207245
- [Data Engineering Vault](https://www.ssp.sh/brain/data-engineering/)
208246
- [Airbyte Data Glossary](https://glossary.airbyte.com/)
209247
- [Data Engineering Wiki by Reddit](https://dataengineering.wiki/Index)
@@ -212,75 +250,15 @@ Glossaries:
212250
- [Airtable Glossary](https://airtable.com/shrGh8BqZbkfkbrfk/tbluZ3ayLHC3CKsDb)
213251
- [Data Engineering Glossary by Dagster](https://dagster.io/glossary)
214252

215-
LinkedIn
216-
217-
- 100k+ Followers
218-
- [Zach Wilson](https://www.linkedin.com/in/eczachly)
219-
- [Ben Rogojan](https://www.linkedin.com/in/benjaminrogojan)
220-
- [Sumit Mittal](https://www.linkedin.com/in/bigdatabysumit/)
221-
- [Shashank Mishra](https://www.linkedin.com/in/shashank219/)
222-
- [Chip Huyen](https://www.linkedin.com/in/chiphuyen/)
223-
- [Alex Xu](https://www.linkedin.com/in/alexxubyte)
224-
- [Deepak Goyal](https://www.linkedin.com/in/deepak-goyal-93805a17/)
225-
- [Andreas Kretz](https://www.linkedin.com/in/andreas-kretz)
226-
- 50k+ Followers
227-
- [Joe Reis](https://www.linkedin.com/in/josephreis)
228-
- [Darshil Parmar](https://www.linkedin.com/in/darshil-parmar/)
229-
- [Ankit Bansal](https://www.linkedin.com/in/ankitbansal6/)
230-
- [Marc Lamberti](https://www.linkedin.com/in/marclamberti)
231-
- 10k+ Followers
232-
- [Li Yin](https://www.linkedin.com/in/li-yin-ai/)
233-
- [Joseph Machado](https://www.linkedin.com/in/josephmachado1991/)
234-
- [Eric Roby](https://www.linkedin.com/in/codingwithroby/)
235-
- [Simon Whiteley](https://www.linkedin.com/in/simon-whiteley-uk/)
236-
- [Simon Späti](https://www.linkedin.com/in/sspaeti/)
237-
- 5k+ Followers
238-
- [Dipankar Mazumdar](https://www.linkedin.com/in/dipankar-mazumdar/)
239-
- [Daniel Ciocirlan](https://www.linkedin.com/in/danielciocirlan)
240-
- [Hugo Lu](https://www.linkedin.com/in/hugo-lu-confirmed/)
241-
- [Tobias Macey](https://www.linkedin.com/in/tmacey)
242-
- [Marcos Ortiz](https://www.linkedin.com/in/mlortiz)
243-
- [Julien Hurault](https://www.linkedin.com/in/julienhuraultanalytics/)
244-
- 1k+ Followers
245-
- [Shruti Mantri](https://www.linkedin.com/in/shruti-mantri-88527a67/)
246-
- [Volker Janz](https://www.linkedin.com/in/vjanz/)
247-
248-
Twitter / X
249-
250-
- [Zach Wilson](https://www.twitter.com/EcZachly)
251-
- [Seattle Data Guy](https://www.twitter.com/SeattleDataGuy)
252-
- [Sumit Mittal](https://www.twitter.com/bigdatasumit)
253-
- [Joseph Machado](https://twitter.com/startdataeng)
254-
- [Alex Xu](https://twitter.com/alexxubyte/)
255-
- [Eric Roby](https://twitter.com/codingwithroby)
256-
- [Andreas Kretz](https://twitter.com/andreaskayy)
257-
- [Marc Lamberti](https://twitter.com/marclambertiml)
258-
- [Dipankar Mazumdar](https://twitter.com/Dipankartnt)
259-
- [Start Data Engineering](https://twitter.com/startdataeng)
260-
- [Data Cyborg](https://twitter.com/data_cyborg)
261-
- [Simon Späti](https://twitter.com/sspaeti)
262-
- [Marcos Ortiz](https://twitter.com/marcosluis2186)
263-
264-
Instagram
265-
266-
- [Zach Wilson](https://www.instagram.com/eczachly)
267-
- [Andreas Kretz](https://www.instagram.com/learndataengineering)
268-
- [Seattle Data Guy](https://www.instagram.com/seattledataguy)
269-
270-
TikTok
271-
272-
- [Zach Wilson](https://www.tiktok.com/@eczachly)
273-
- [Alex The Analyst](https://www.tiktok.com/@alex_the_analyst)
274-
- [Marcos Ortiz](https://www.tiktok.com/@marcosluis2186)
275253

276-
Design Patterns
254+
### Design Patterns
277255

278-
- [Cumulative Table Design](https://www.github.com/EcZachly/cumulative-table-design)
256+
- [Cumulative Table Design](https://www.github.com/DataExpert-io/cumulative-table-design)
279257
- [Microbatch Deduplication](https://www.github.com/EcZachly/microbatch-hourly-deduped-tutorial)
280258
- [The Little Book of Pipelines](https://www.github.com/EcZachly/little-book-of-pipelines)
281259
- [Data Developer Platform](https://datadeveloperplatform.org/architecture/)
282260

283-
Courses / Academies
261+
### Courses / Academies
284262

285263
- [DataExpert.io course](https://www.dataexpert.io) use code **HANDBOOK10** for a discount!
286264
- [LearnDataEngineering.com](https://www.learndataengineering.com)
@@ -293,19 +271,14 @@ Courses / Academies
293271
- [Data Engineering Zoomcamp by DataTalksClub](https://datatalks.club/)
294272
- [Efficient Data Processing in Spark](https://josephmachado.podia.com/efficient-data-processing-in-spark)
295273
- [Scaler](https://www.scaler.com/)
274+
- [DataTeams - Data Engingeer hiring platform](https://www.datateams.ai/)
275+
- [Udemy Courses from Daniel Blanco](https://danielblanco.dev/links)
296276

297-
Certifications Courses
277+
### Certifications Courses
298278

299279
- [Google Cloud Certified - Professional Data Engineer](https://cloud.google.com/certification/data-engineer)
300280
- [Databricks - Data Engineer Professional](https://www.databricks.com/learn/certification/data-engineer-professional)
301281
- [Azure Data Engineer Associate](https://learn.microsoft.com/credentials/certifications/azure-data-engineer/)
302282
- [Microsoft Fabric Analytics Engineer Associate](https://learn.microsoft.com/credentials/certifications/fabric-analytics-engineer-associate/)
303283
- [Exam DP-203: Data Engineering on Microsoft Azure](https://learn.microsoft.com/en-us/credentials/certifications/exams/dp-203/?tab=tab-learning-paths)
304284
- [AWS Certified Data Engineer - Associate](https://aws.amazon.com/certification/certified-data-engineer-associate/)
305-
306-
Conferences
307-
308-
- [Trino Summit - December 13-14, 2023 - Virtual](https://www.starburst.io/info/trinosummit2023/)
309-
- [Data Universe - April 10-11, 2024 - New York City](https://www.datauniverseevent.com/)
310-
- [Data Nova @ Data Universe - April 10-11, 2024 - New York City](https://www.starburst.io/datanova/)
311-
- [DataTune Conference - March 8-9, 2024 - Nashville, TN](https://www.datatuneconf.com/)

0 commit comments

Comments
 (0)