Skip to content

Commit 50595a6

Browse files
committed
Merge remote-tracking branch 'upstream/main'
2 parents 3bf88a0 + fc137c7 commit 50595a6

File tree

16 files changed

+810
-9
lines changed

16 files changed

+810
-9
lines changed

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,13 +83,18 @@ Top must-join communities for ML:
8383
- [Hex](https://hex.ai/)
8484
- [Apache Superset](https://superset.apache.org/)
8585
- [Evidence](https://evidence.dev)
86+
- [Redash](https://redash.io/)
87+
- [Lightdash](https://lightdash.com/)
8688
- Data Integration
8789
- [Cube](https://cube.dev)
8890
- [Fivetran](https://www.fivetran.com)
8991
- [Airbyte](https://airbyte.io)
9092
- [dlt](https://dlthub.com/)
9193
- [Sling](https://slingdata.io/)
9294
- [Meltano](https://meltano.com/)
95+
- Semantic Layers
96+
- [Cube](https://cube.dev)
97+
- [dbt Semantic Layer](https://www.getdbt.com/product/semantic-layer)
9398
- Modern OLAP
9499
- [Apache Druid](https://druid.apache.org/)
95100
- [ClickHouse](https://clickhouse.com/)
@@ -190,6 +195,9 @@ Here's the mostly comprehensive list of data engineering creators:
190195
| Arnaud Milleker | | [Arnaud Milleker](https://www.linkedin.com/in/arnaudmilleker/) (7k+) | | | |
191196
| Soumil Shah | [Soumil Shah] (https://www.youtube.com/@SoumilShah) (50k) | [Soumil Shah](https://www.linkedin.com/in/shah-soumil/) (8k+) | | | |
192197
| Ananth Packkildurai | | [Ananth Packkildurai](https://www.linkedin.com/in/ananthdurai/) (18k+) | | | |
198+
| Dan Kornas | | | [dankornas](https://www.twitter.com/dankornas) (66k+) | |
199+
| Nitin | https://www.linkedin.com/in/tomernitin29/ |
200+
| Manojkumar Vadivel | | [Manojkumar Vadivel](https://www.linkedin.com/in/manojvsj/) (12k+) |
193201

194202
### Great Podcasts
195203

books.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,5 @@
2929
- [Pandas Cookbook, Third Edition](https://www.amazon.com/Pandas-Cookbook-Practical-scientific-exploratory/dp/1836205872)
3030
- [Data Pipelines Pocket Reference](https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/)
3131
- [Stream Processing with Apache Flink](https://www.oreilly.com/library/view/stream-processing-with/9781491974285/)
32-
- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
32+
- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
33+
- [Python for Data Analysis, 3E](https://wesmckinney.com/book/)

bootcamp/materials/1-dimensional-data-modeling/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,12 @@ There are two methods to get Postgres running locally.
4545
- For Mac: Follow this **[tutorial](https://daily-dev-tips.com/posts/installing-postgresql-on-a-mac-with-homebrew/)** (Homebrew is really nice for installing on Mac)
4646
- For Windows: Follow this **[tutorial](https://www.sqlshack.com/how-to-install-postgresql-on-windows/)**
4747
2. Run this command after replacing **`<computer-username>`** with your computer's username:
48-
48+
4949
```bash
50-
pg_restore -U <computer-username> postgres data.dump
50+
pg_restore -U <computer-username> -d postgres data.dump
5151
```
52+
53+
If you have any issue, the syntax is `pg_restore -U [username] -d [database_name] -h [host] -p [port] [backup_file]`
5254
5355
3. Set up DataGrip, DBeaver, or your VS Code extension to point at your locally running Postgres instance.
5456
4. Have fun querying!

bootcamp/materials/1-dimensional-data-modeling/example.env

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@ POSTGRES_USER=postgres
33
POSTGRES_DB=postgres
44
POSTGRES_PASSWORD=postgres
55

6-
HOST_PORT=5434
7-
CONTAINER_PORT=5431
6+
HOST_PORT=5432
7+
CONTAINER_PORT=5432
88

99
DOCKER_CONTAINER=my-postgres-container
1010
DOCKER_IMAGE=my-postgres-image
1111

1212
PGADMIN_EMAIL=[email protected]
1313
PGADMIN_PASSWORD=postgres
14-
PGADMIN_PORT=5050
14+
PGADMIN_PORT=5050

bootcamp/materials/3-spark-fundamentals/notebooks/event_data_pyspark.ipynb

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "code",
5-
"execution_count": 1,
5+
"execution_count": null,
66
"id": "81cca085-dba2-42eb-a13b-fa64b6e86583",
77
"metadata": {},
88
"outputs": [
@@ -53,7 +53,11 @@
5353
"\n",
5454
"spark\n",
5555
"\n",
56-
"df = spark.read.option(\"header\", \"true\").csv(\"/home/iceberg/data/events.csv\").withColumn(\"event_date\", expr(\"DATE_TRUNC('day', event_time)\"))\n",
56+
"events = spark.read.option(\"header\", \"true\").csv(\"/home/iceberg/data/events.csv\").withColumn(\"event_date\", expr(\"DATE_TRUNC('day', event_time)\"))\n",
57+
"devices = spark.read.option(\"header\",\"true\").csv(\"/home/iceberg/data/devices.csv\")\n",
58+
"\n",
59+
"df = events.join(devices,on=\"device_id\",how=\"left\")\n",
60+
"df = df.withColumnsRenamed({'browser_type': 'browser_family', 'os_type': 'os_family'})\n",
5761
"\n",
5862
"df.show()"
5963
]
@@ -570,7 +574,6 @@
570574
{
571575
"cell_type": "code",
572576
"execution_count": null,
573-
"id": "faaed2df",
574577
"metadata": {
575578
"collapsed": false,
576579
"jupyter": {
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
flink-env.env
2+
postgres-data
3+
# Byte-compiled / optimized / DLL files
4+
__pycache__/
5+
*.py[cod]
6+
*$py.class
7+
8+
# C extensions
9+
*.so
10+
11+
# Distribution / packaging
12+
.Python
13+
build/
14+
develop-eggs/
15+
dist/
16+
downloads/
17+
eggs/
18+
.eggs/
19+
lib/
20+
lib64/
21+
parts/
22+
sdist/
23+
var/
24+
wheels/
25+
pip-wheel-metadata/
26+
share/python-wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
MANIFEST
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.nox/
46+
.coverage
47+
.coverage.*
48+
.cache
49+
nosetests.xml
50+
coverage.xml
51+
*.cover
52+
*.py,cover
53+
.hypothesis/
54+
.pytest_cache/
55+
56+
# Translations
57+
*.mo
58+
*.pot
59+
60+
# Django stuff:
61+
*.log
62+
local_settings.py
63+
db.sqlite3
64+
db.sqlite3-journal
65+
66+
# Flask stuff:
67+
instance/
68+
.webassets-cache
69+
70+
# Scrapy stuff:
71+
.scrapy
72+
73+
# Sphinx documentation
74+
docs/_build/
75+
76+
# PyBuilder
77+
target/
78+
79+
# Jupyter Notebook
80+
.ipynb_checkpoints
81+
82+
# IPython
83+
profile_default/
84+
ipython_config.py
85+
86+
# pyenv
87+
.python-version
88+
89+
# pipenv
90+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
91+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
92+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
93+
# install all needed dependencies.
94+
#Pipfile.lock
95+
96+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
97+
__pypackages__/
98+
99+
# Celery stuff
100+
celerybeat-schedule
101+
celerybeat.pid
102+
103+
# SageMath parsed files
104+
*.sage.py
105+
106+
# Environments
107+
.env
108+
.venv
109+
env/
110+
venv/
111+
ENV/
112+
env.bak/
113+
venv.bak/
114+
115+
# Spyder project settings
116+
.spyderproject
117+
.spyproject
118+
119+
# Rope project settings
120+
.ropeproject
121+
122+
# mkdocs documentation
123+
/site
124+
125+
# mypy
126+
.mypy_cache/
127+
.dmypy.json
128+
dmypy.json
129+
130+
# Pyre type checker
131+
.pyre/
132+
133+
dump.sql
134+
135+
# Personal workspace files
136+
.idea/*
137+
.vscode/*
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
FROM --platform=linux/amd64 flink:1.16.2
2+
3+
# install python3: it has updated Python to 3.9 in Debian 11 and so install Python 3.7 from source
4+
# it currently only supports Python 3.6, 3.7 and 3.8 in PyFlink officially.
5+
RUN apt-get update -y && \
6+
apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \
7+
wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \
8+
tar -xvf Python-3.7.9.tgz && \
9+
cd Python-3.7.9 && \
10+
./configure --without-tests --enable-shared && \
11+
make -j6 && \
12+
make install && \
13+
ldconfig /usr/local/lib && \
14+
cd .. && rm -f Python-3.7.9.tgz && rm -rf Python-3.7.9 && \
15+
ln -s /usr/local/bin/python3 /usr/local/bin/python && \
16+
apt-get clean && \
17+
rm -rf /var/lib/apt/lists/*
18+
19+
# install PyFlink
20+
COPY requirements.txt .
21+
RUN python -m pip install --upgrade pip; \
22+
pip3 install -r requirements.txt --no-cache-dir;
23+
24+
# Install Java 11
25+
RUN apt-get update && \
26+
apt-get install -y openjdk-11-jdk && \
27+
apt-get clean;
28+
29+
# Set environment variables
30+
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
31+
32+
33+
# Download connector libraries
34+
RUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-python/1.16.2/flink-python-1.16.2.jar; \
35+
wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/1.16.2/flink-sql-connector-kafka-1.16.2.jar; \
36+
wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc/1.16.2/flink-connector-jdbc-1.16.2.jar; \
37+
wget -P /opt/flink/lib/ https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.26/postgresql-42.2.26.jar;
38+
WORKDIR /opt/flink
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
include flink-env.env
2+
3+
PLATFORM ?= linux/amd64
4+
5+
# COLORS
6+
GREEN := $(shell tput -Txterm setaf 2)
7+
YELLOW := $(shell tput -Txterm setaf 3)
8+
WHITE := $(shell tput -Txterm setaf 7)
9+
RESET := $(shell tput -Txterm sgr0)
10+
11+
12+
TARGET_MAX_CHAR_NUM=20
13+
14+
## Show help with `make help`
15+
help:
16+
@echo ''
17+
@echo 'Usage:'
18+
@echo ' ${YELLOW}make${RESET} ${GREEN}<target>${RESET}'
19+
@echo ''
20+
@echo 'Targets:'
21+
@awk '/^[a-zA-Z\-\_0-9]+:/ { \
22+
helpMessage = match(lastLine, /^## (.*)/); \
23+
if (helpMessage) { \
24+
helpCommand = substr($$1, 0, index($$1, ":")-1); \
25+
helpMessage = substr(lastLine, RSTART + 3, RLENGTH); \
26+
printf " ${YELLOW}%-$(TARGET_MAX_CHAR_NUM)s${RESET} ${GREEN}%s${RESET}\n", helpCommand, helpMessage; \
27+
} \
28+
} \
29+
{ lastLine = $$0 }' $(MAKEFILE_LIST)
30+
31+
.PHONY: build
32+
## Builds the Flink base image with pyFlink and connectors installed
33+
build:
34+
docker build --platform ${PLATFORM} -t ${IMAGE_NAME} .
35+
36+
.PHONY: up
37+
## Builds the base Docker image and starts Flink cluster
38+
up:
39+
docker compose --env-file flink-env.env up --build --remove-orphans -d
40+
41+
.PHONY: down
42+
## Shuts down the Flink cluster
43+
down:
44+
docker compose down --remove-orphans
45+
46+
.PHONY: job
47+
## Submit the Flink job
48+
job:
49+
docker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d
50+
51+
aggregation_job:
52+
docker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d
53+
54+
.PHONY: stop
55+
## Stops all services in Docker compose
56+
stop:
57+
docker compose stop
58+
59+
.PHONY: start
60+
## Starts all services in Docker compose
61+
start:
62+
docker compose start
63+
64+
.PHONY: clean
65+
## Stops and removes the Docker container as well as images with tag `<none>`
66+
clean:
67+
docker compose stop
68+
docker ps -a --format '{{.Names}}' | grep "^${CONTAINER_PREFIX}" | xargs -I {} docker rm {}
69+
docker images | grep "<none>" | awk '{print $3}' | xargs -r docker rmi
70+
# Uncomment line `docker rmi` if you want to remove the Docker image from this set up too
71+
# docker rmi ${IMAGE_NAME}

0 commit comments

Comments
 (0)