Skip to content

Commit db955f9

Browse files
authored
Merge branch 'main' into new-profile
2 parents 4ab6a69 + b3319c4 commit db955f9

28 files changed

+951
-4
lines changed

README.md

Lines changed: 12 additions & 2 deletions
Large diffs are not rendered by default.

books.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
- [Machine Learning System Design Interview](https://www.amazon.com/Machine-Learning-System-Design-Interview/dp/1736049127)
99
- [Streaming Systems](https://www.amazon.com/Streaming-Systems-Where-Large-Scale-Processing/dp/1491983876)
1010
- [High Performance Spark](https://www.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203)
11+
- [Spark: The Definitive Guide](https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/)
12+
- [Learning Spark](https://www.oreilly.com/library/view/learning-spark/9781449359034/)
1113
- [Building Evolutionary Architectures, 2nd Edition](https://www.oreilly.com/library/view/building-evolutionary-architectures/9781492097532/)
1214
- [Data Management at Scale, 2nd Edition](https://www.oreilly.com/library/view/data-management-at/9781098138851/)
1315
- [Deciphering Data Architectures](https://www.oreilly.com/library/view/deciphering-data-architectures/9781098150754/)
@@ -27,5 +29,4 @@
2729
- [Pandas Cookbook, Third Edition](https://www.amazon.com/Pandas-Cookbook-Practical-scientific-exploratory/dp/1836205872)
2830
- [Data Pipelines Pocket Reference](https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/)
2931
- [Stream Processing with Apache Flink](https://www.oreilly.com/library/view/stream-processing-with/9781491974285/)
30-
- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
31-
32+
- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/
130+
131+
dump.sql
132+
133+
# Personal workspace files
134+
.idea/*
135+
.vscode/*
136+
137+
postgres-data/*
138+
homework/your_username
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
include example.env
2+
3+
.PHONY: up
4+
up:
5+
@if [ ! -f .env ]; then \
6+
echo "WARNING: .env file does not exist! 'example.env' copied to '.env'. Please update the configurations in the .env file running this target."; \
7+
cp example.env .env; \
8+
exit 1; \
9+
fi
10+
docker-compose up -d;
11+
12+
.PHONY: down
13+
down:
14+
docker-compose down -v
15+
@if [[ "$(docker ps -q -f name=${DOCKER_CONTAINER})" ]]; then \
16+
echo "Terminating running container..."; \
17+
docker rm ${DOCKER_CONTAINER}; \
18+
fi
19+
20+
.PHONY: restart
21+
restart:
22+
docker-compose down -v; \
23+
sleep 5; \
24+
docker-compose up -d;
25+
26+
.PHONY: logs
27+
logs:
28+
docker logs ${DOCKER_CONTAINER}
29+
30+
31+
.PHONY: inspect
32+
inspect:
33+
docker inspect ${DOCKER_CONTAINER} | grep "Source"
34+
35+
36+
.PHONY: ip
37+
ip:
38+
@if [[ "$$(docker ps -q -f name=${DOCKER_CONTAINER})" ]]; then \
39+
echo "Container ${DOCKER_CONTAINER} running! Forwarding connections from $$(docker port ${DOCKER_CONTAINER})"; \
40+
else \
41+
echo "Container not running. Please start the container and try again."; \
42+
fi
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# 📅 Data Modeling
2+
3+
This repository contains the setup for the data modeling modules in Weeks 1 and 2.
4+
5+
:wrench: **Tech Stack**
6+
7+
- Git
8+
- Postgres
9+
- PSQL CLI
10+
- Database management environment (DataGrip, DBeaver, VS Code with extensions, etc.)
11+
- Docker, Docker Compose, and Docker Desktop
12+
13+
:pencil: **TL;DR**
14+
15+
1. [Clone the repository](https://github.com/DataExpert-io/data-engineer-handbook/edit/main/bootcamp/materials/1-dimensional-data-modeling/README.md).
16+
2. [Start Postgres instance](https://github.com/DataExpert-io/data-engineer-handbook/edit/main/bootcamp/materials/1-dimensional-data-modeling/README.md#2%EF%B8%8F%E2%83%A3run-postgres).
17+
3. [Connect to Postgres](https://github.com/DataExpert-io/data-engineer-handbook/edit/main/bootcamp/materials/1-dimensional-data-modeling/README.md#threeconnect-to-postgres-in-database-client) using your preferred database management tool.
18+
19+
For detailed instructions and more information, please refer to the step-by-step instructions below.
20+
21+
## 1️⃣ **Clone the repository**
22+
23+
- Clone the repo using the SSH link. This will create a new folder in the current directory on your local machine.
24+
25+
```bash
26+
git clone [email protected]:DataExpert-io/data-engineer-handbook.git
27+
```
28+
29+
> ℹ️ To securely interact with GitHub repositories, it is recommended to use SSH keys. Follow the instructions provided **[here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account)** to set up SSH keys on GitHub.
30+
>
31+
32+
- Navigate into the cloned repo using the command line:
33+
34+
```bash
35+
cd data-engineer-handbook/bootcamp/materials/1-dimensional-data-modeling
36+
```
37+
38+
## 2️⃣ **Run Postgres**
39+
40+
There are two methods to get Postgres running locally.
41+
42+
### 💻 **Option 1: Run on local machine**
43+
44+
1. Install Postgres
45+
- For Mac: Follow this **[tutorial](https://daily-dev-tips.com/posts/installing-postgresql-on-a-mac-with-homebrew/)** (Homebrew is really nice for installing on Mac)
46+
- For Windows: Follow this **[tutorial](https://www.sqlshack.com/how-to-install-postgresql-on-windows/)**
47+
2. Run this command after replacing **`<computer-username>`** with your computer's username:
48+
49+
```bash
50+
psql -U <computer-username> postgres < data.dump
51+
```
52+
53+
3. Set up DataGrip, DBeaver, or your VS Code extension to point at your locally running Postgres instance.
54+
4. Have fun querying!
55+
56+
### 🐳 **Option 2: Run Postgres in Docker**
57+
58+
- Install Docker Desktop from **[here](https://www.docker.com/products/docker-desktop/)**.
59+
- Copy **`example.env`** to **`.env`**:
60+
61+
```bash
62+
cp example.env .env
63+
```
64+
65+
- Start the Docker Compose container:
66+
- If you're on Mac:
67+
68+
```bash
69+
make up
70+
```
71+
72+
- If you're on Windows:
73+
74+
```bash
75+
docker compose up -d
76+
```
77+
78+
- A folder named **`postgres-data`** will be created in the root of the repo. The data backing your Postgres instance will be saved here.
79+
- You can check that your Docker Compose stack is running by either:
80+
- Going into Docker Desktop: you should see an entry there with a drop-down for each of the containers running in your Docker Compose stack.
81+
- Running **`docker ps -a`** and looking for the containers with the name **`postgres`**.
82+
- When you're finished with your Postgres instance, you can stop the Docker Compose containers with:
83+
84+
```bash
85+
make down
86+
```
87+
88+
Or if you're on Windows:
89+
90+
```bash
91+
docker compose down -v
92+
```
93+
94+
### :rotating_light: **Need help loading tables?** :rotating_light:
95+
96+
> Refer to the instructions below to resolve the issue when the data dump fails to load tables, displaying the message `PostgreSQL Database directory appears to contain a database; Skipping initialization.`
97+
>
98+
99+
## :three: **Connect to Postgres in Database Client**
100+
101+
- Some options for interacting with your Postgres instance:
102+
- DataGrip - JetBrains; 30-day free trial or paid version.
103+
- VSCode built-in extension (there are a few of these).
104+
- PGAdmin.
105+
- Postbird.
106+
- Using your client of choice, follow the instructions to establish a new PostgreSQL connection.
107+
- The default username is **`postgres`** and corresponds to **`$POSTGRES_USER`** in your **`.env`**.
108+
- The default password is **`postgres`** and corresponds to **`$POSTGRES_PASSWORD`** in your **`.env`**.
109+
- The default database is **`postgres`** and corresponds to **`$POSTGRES_DB`** in your **`.env`**.
110+
- The default host is **`localhost`** or **`0.0.0.0`.** This is the IP address of the Docker container running the PostgreSQL instance.
111+
- The default port for Postgres is **`5432` .** This corresponds to the **`$CONTAINER_PORT`** variable in the **`.env`** file.
112+
113+
&rarr; :bulb: You can edit these values by modifying the corresponding values in **`.env`**.
114+
115+
- If the test connection is successful, click "Finish" or "Save" to save the connection. You should now be able to use the database client to manage your PostgreSQL database locally.
116+
117+
## **🚨 Tables not loading!? 🚨**
118+
- If you are on Windows and used **`docker compose up`**, table creation and data load will not take place with container creation. Once you have docker container up and verified that you are able to connect to empty postgres database with your own choice of client, follow the following steps:
119+
1. On Docker desktop, connect to my-postgres-container terminal.
120+
2. Run:
121+
```bash
122+
psql \
123+
-v ON_ERROR_STOP=1 \
124+
--username $POSTGRES_USER \
125+
--dbname $POSTGRES_DB \
126+
< /docker-entrypoint-initdb.d/data.dump>
127+
```
128+
- → This will run the file `data.dump` from inside your docker container.
129+
130+
- If the tables don't come with the loaded data, follow these steps with manual installation of postgres:
131+
132+
1. Find where your `psql` client is installed (Something like `C:\\Program Files\\PostgreSQL\\13\\runpsql.bat`)
133+
2. Make sure you're in the root of the repo, and launch `psql` by running that `.bat` script
134+
3. Enter your credentials for postgres (described in the connect to postgres section)
135+
- → If the above worked, you should now be inside a psql REPL (It looks like `postgres=#`)
136+
4. Run:
137+
138+
```bash
139+
postgres=# \\i data.dump
140+
```
141+
142+
- → This will run the file `data.dump` from inside your psql REPL.
143+
144+
---
145+
146+
#### 💡 Additional Docker Make commands
147+
148+
- To restart the Postgres instance, you can run **`make restart`**.
149+
- To see logs from the Postgres container, run **`make logs`**.
150+
- To inspect the Postgres container, run **`make inspect`**.
151+
- To find the port Postgres is running on, run **`make ip`**.
Binary file not shown.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
version: "3.10.6"
2+
services:
3+
postgres:
4+
image: postgres:14
5+
restart: on-failure
6+
container_name: ${DOCKER_CONTAINER}
7+
env_file:
8+
- .env
9+
- example.env
10+
environment:
11+
- POSTGRES_DB=${POSTGRES_SCHEMA}
12+
- POSTGRES_USER=${POSTGRES_USER}
13+
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
14+
ports:
15+
- "${CONTAINER_PORT}:5432"
16+
volumes:
17+
- ./:/bootcamp/
18+
- ./data.dump:/docker-entrypoint-initdb.d/data.dump
19+
- ./scripts/init-db.sh:/docker-entrypoint-initdb.d/init-db.sh
20+
- postgres-data:/var/lib/postgresql/data
21+
22+
volumes:
23+
postgres-data:
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
POSTGRES_SCHEMA=postgres
2+
POSTGRES_USER=postgres
3+
POSTGRES_DB=postgres
4+
POSTGRES_PASSWORD=postgres
5+
6+
HOST_PORT=5432
7+
CONTAINER_PORT=5432
8+
9+
DOCKER_CONTAINER=my-postgres-container
10+
DOCKER_IMAGE=my-postgres-image

bootcamp/materials/1-dimensional-data-modeling/homework/.gitkeep

Whitespace-only changes.

0 commit comments

Comments
 (0)