Skip to content

Commit 6f1d07e

Browse files
authored
Merge branch 'DataExpert-io:main' into main
2 parents 092a261 + 198e9dd commit 6f1d07e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1344223
-10
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Top 3 must read books are:
3131
### Great [list of over 10 communities to join](communities.md):
3232

3333
Top must-join communities for DE:
34-
- [EcZachly Data Engineering Discord](https://discord.gg/JGumAXncAK)
34+
- [DataExpert.io Community Discord](https://discord.gg/JGumAXncAK)
3535
- [Data Talks Club Slack](https://datatalks.club/slack)
3636
- [Data Engineer Things Community](https://www.dataengineerthings.org/aboutus/)
3737

@@ -80,6 +80,7 @@ Top must-join communities for ML:
8080
- [Looker Studio](https://lookerstudio.google.com/overview)
8181
- [Tableau](https://www.tableau.com/)
8282
- [Power BI](https://powerbi.microsoft.com/)
83+
- [Hex](https://hex.ai/)
8384
- [Apache Superset](https://superset.apache.org/)
8485
- [Evidence](https://evidence.dev)
8586
- Data Integration
@@ -183,6 +184,9 @@ Here's the mostly comprehensive list of data engineering creators:
183184
| Lenny | | [Lenny A](https://www.linkedin.com/in/lennyardiles/) (6k+) | | | |
184185
| Mehdi Ouazza | [Mehdio DataTV](https://www.youtube.com/@mehdio) (3k+) | [Mehdi Ouazza](https://www.linkedin.com/in/mehd-io/) (20k+) | [mehd_io](https://x.com/mehd_io) | | [@mehdio_datatv](https://www.tiktok.com/@mehdio_datatv) |
185186
| ITVersity | [ITVersity](https://www.youtube.com/@itversity) (67k+) | [Durga Gadiraju](https://www.linkedin.com/in/durga0gadiraju/) (48k+) | | |
187+
| Arnaud Milleker | | [Arnaud Milleker](https://www.linkedin.com/in/arnaudmilleker/) (7k+) | | | |
188+
| Soumil Shah | [Soumil Shah] (https://www.youtube.com/@SoumilShah) (50k) | [Soumil Shah](https://www.linkedin.com/in/shah-soumil/) (8k+) | | | |
189+
| Ananth Packkildurai | | [Ananth Packkildurai](https://www.linkedin.com/in/ananthdurai/) (18k+) | | | |
186190

187191
### Great Podcasts
188192

bootcamp/materials/1-dimensional-data-modeling/Makefile

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,21 @@ up:
77
cp example.env .env; \
88
exit 1; \
99
fi
10-
docker-compose up -d;
10+
docker compose up -d;
1111

1212
.PHONY: down
1313
down:
14-
docker-compose down -v
14+
docker compose down -v
1515
@if [[ "$(docker ps -q -f name=${DOCKER_CONTAINER})" ]]; then \
1616
echo "Terminating running container..."; \
1717
docker rm ${DOCKER_CONTAINER}; \
1818
fi
1919

2020
.PHONY: restart
2121
restart:
22-
docker-compose down -v; \
22+
docker compose down -v; \
2323
sleep 5; \
24-
docker-compose up -d;
24+
docker compose up -d;
2525

2626
.PHONY: logs
2727
logs:

bootcamp/materials/1-dimensional-data-modeling/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ There are two methods to get Postgres running locally.
4747
2. Run this command after replacing **`<computer-username>`** with your computer's username:
4848
4949
```bash
50-
psql -U <computer-username> postgres < data.dump
50+
pg_restore -U <computer-username> postgres data.dump
5151
```
5252
5353
3. Set up DataGrip, DBeaver, or your VS Code extension to point at your locally running Postgres instance.
@@ -124,6 +124,10 @@ Where:
124124
- If the test connection is successful, click "Finish" or "Save" to save the connection. You should now be able to use the database client to manage your PostgreSQL database locally.
125125
126126
## **🚨 Tables not loading!? 🚨**
127+
- If you're seeing errors about `error: invalid command \N`, you should use `pg_restore` to load `data.dump`.
128+
```bash
129+
pg_restore -U $POSTGRES_USER -d $POSTGRES_DB data.dump
130+
```
127131
- If you are on Windows and used **`docker compose up`**, table creation and data load will not take place with container creation. Once you have docker container up and verified that you are able to connect to empty postgres database with your own choice of client, follow the following steps:
128132
1. On Docker desktop, connect to my-postgres-container terminal.
129133
2. Run:
@@ -132,7 +136,7 @@ Where:
132136
-v ON_ERROR_STOP=1 \
133137
--username $POSTGRES_USER \
134138
--dbname $POSTGRES_DB \
135-
< /docker-entrypoint-initdb.d/data.dump>
139+
< /docker-entrypoint-initdb.d/data.dump
136140
```
137141
- → This will run the file `data.dump` from inside your docker container.
138142

bootcamp/materials/1-dimensional-data-modeling/docker-compose.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ services:
55
container_name: ${DOCKER_CONTAINER}
66
env_file:
77
- .env
8-
- example.env
98
environment:
109
- POSTGRES_DB=${POSTGRES_SCHEMA}
1110
- POSTGRES_USER=${POSTGRES_USER}

bootcamp/materials/1-dimensional-data-modeling/lecture-lab/players.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
draft_round TEXT,
1919
draft_number TEXT,
2020
seasons season_stats[],
21-
scorer_class scoring_class,
21+
scoring_class scoring_class,
2222
years_since_last_active INTEGER,
2323
is_active BOOLEAN,
2424
current_season INTEGER,

bootcamp/materials/1-dimensional-data-modeling/sql/load_players_table_day2.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ SELECT
6363
WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 15 THEN 'good'
6464
WHEN (seasons[CARDINALITY(seasons)]::season_stats).pts > 10 THEN 'average'
6565
ELSE 'bad'
66-
END::scorer_class AS scorer_class,
66+
END::scoring_class AS scoring_class,
6767
w.season - (seasons[CARDINALITY(seasons)]::season_stats).season as years_since_last_active,
6868
w.season,
6969
(seasons[CARDINALITY(seasons)]::season_stats).season = season AS is_active
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/
130+
131+
dump.sql
132+
133+
# Personal workspace files
134+
.idea/*
135+
.vscode/*
136+
137+
postgres-data/*
138+
homework/your_username
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Week 2 Fact Data Modeling
2+
3+
This repo follows the same setup as week 1. Please go to the dimensional data modeling [README](../1-dimensional-data-modeling/README.md) for instructions.

bootcamp/materials/2-fact-data-modeling/homework/.gitkeep

Whitespace-only changes.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Week 2 Fact Data Modeling
2+
The homework this week will be using the `devices` and `events` dataset
3+
4+
Construct the following eight queries:
5+
6+
- A query to deduplicate `game_details` from Day 1 so there's no duplicates
7+
8+
- A DDL for an `user_devices_cumulated` table that has:
9+
- a `device_activity_datelist` which tracks a users active days by `browser_type`
10+
- data type here should look similar to `MAP<STRING, ARRAY[DATE]>`
11+
- or you could have `browser_type` as a column with multiple rows for each user (either way works, just be consistent!)
12+
13+
- A cumulative query to generate `device_activity_datelist` from `events`
14+
15+
- A `datelist_int` generation query. Convert the `device_activity_datelist` column into a `datelist_int` column
16+
17+
- A DDL for `hosts_cumulated` table
18+
- a `host_activity_datelist` which logs to see which dates each host is experiencing any activity
19+
20+
- The incremental query to generate `host_activity_datelist`
21+
22+
- A monthly, reduced fact table DDL `host_activity_reduced`
23+
- month
24+
- host
25+
- hit_array - think COUNT(1)
26+
- unique_visitors array - think COUNT(DISTINCT user_id)
27+
28+
- An incremental query that loads `host_activity_reduced`
29+
- day-by-day
30+
31+
Please add these queries into a folder, zip them up and submit [here](https://bootcamp.techcreator.io)

0 commit comments

Comments
 (0)