DataExpert-io
diff --git a/‎README.md
Lines changed: 8 additions & 0 deletions b/‎README.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎books.md
Lines changed: 2 additions & 1 deletion b/‎books.md
Lines changed: 2 additions & 1 deletion
diff --git a/‎bootcamp/materials/1-dimensional-data-modeling/README.md
Lines changed: 4 additions & 2 deletions b/‎bootcamp/materials/1-dimensional-data-modeling/README.md
Lines changed: 4 additions & 2 deletions
diff --git a/‎bootcamp/materials/1-dimensional-data-modeling/example.env
Lines changed: 3 additions & 3 deletions b/‎bootcamp/materials/1-dimensional-data-modeling/example.env
Lines changed: 3 additions & 3 deletions
diff --git a/‎bootcamp/materials/3-spark-fundamentals/notebooks/event_data_pyspark.ipynb
Lines changed: 6 additions & 3 deletions b/‎bootcamp/materials/3-spark-fundamentals/notebooks/event_data_pyspark.ipynb
Lines changed: 6 additions & 3 deletions
diff --git a/‎bootcamp/materials/4-apache-flink-training/.gitignore
Lines changed: 137 additions & 0 deletions b/‎bootcamp/materials/4-apache-flink-training/.gitignore
Lines changed: 137 additions & 0 deletions
diff --git a/‎bootcamp/materials/4-apache-flink-training/Dockerfile
Lines changed: 38 additions & 0 deletions b/‎bootcamp/materials/4-apache-flink-training/Dockerfile
Lines changed: 38 additions & 0 deletions
diff --git a/‎bootcamp/materials/4-apache-flink-training/Makefile
Lines changed: 71 additions & 0 deletions b/‎bootcamp/materials/4-apache-flink-training/Makefile
Lines changed: 71 additions & 0 deletions
@@ -83,13 +83,18 @@ Top must-join communities for ML:
   - [Hex](https://hex.ai/)
   - [Apache Superset](https://superset.apache.org/)
   - [Evidence](https://evidence.dev)
+  - [Redash](https://redash.io/)
+  - [Lightdash](https://lightdash.com/)
 - Data Integration
   - [Cube](https://cube.dev)
   - [Fivetran](https://www.fivetran.com)
   - [Airbyte](https://airbyte.io)
   - [dlt](https://dlthub.com/)
   - [Sling](https://slingdata.io/)
   - [Meltano](https://meltano.com/)
+ - Semantic Layers
+  - [Cube](https://cube.dev)
+  - [dbt Semantic Layer](https://www.getdbt.com/product/semantic-layer) 
 - Modern OLAP
   - [Apache Druid](https://druid.apache.org/)
   - [ClickHouse](https://clickhouse.com/)
@@ -190,6 +195,9 @@ Here's the mostly comprehensive list of data engineering creators:
 | Arnaud Milleker      |                                                                                                                           | [Arnaud Milleker](https://www.linkedin.com/in/arnaudmilleker/) (7k+)                                                      |                                                                                                                                                                                 |                                                                                                               |                                                                                                                                                                                                     |
 | Soumil Shah      | [Soumil Shah] (https://www.youtube.com/@SoumilShah) (50k) | [Soumil Shah](https://www.linkedin.com/in/shah-soumil/) (8k+) |                                                                                                                                                                                 |                                                                                                               |                                                                                                                                                                                                     |
 | Ananth Packkildurai      |  | [Ananth Packkildurai](https://www.linkedin.com/in/ananthdurai/) (18k+) |                                                                                                                                                                                 |                                                                                                               |                                                                                                                                                                                                     |
+| Dan Kornas            |                                |                                  |                  [dankornas](https://www.twitter.com/dankornas) (66k+)                                                                                                                                                               |                                                                                                               |  
+| Nitin             | https://www.linkedin.com/in/tomernitin29/                          |
+| Manojkumar Vadivel      |  | [Manojkumar Vadivel](https://www.linkedin.com/in/manojvsj/) (12k+) | 
 
 ### Great Podcasts
 
 
@@ -29,4 +29,5 @@
 - [Pandas Cookbook, Third Edition](https://www.amazon.com/Pandas-Cookbook-Practical-scientific-exploratory/dp/1836205872)
 - [Data Pipelines Pocket Reference](https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/)
 - [Stream Processing with Apache Flink](https://www.oreilly.com/library/view/stream-processing-with/9781491974285/)
-- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
+- [Apache Iceberg The Definitive Guide](https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/)
+- [Python for Data Analysis, 3E](https://wesmckinney.com/book/)
@@ -45,10 +45,12 @@ There are two methods to get Postgres running locally.
     - For Mac: Follow this **[tutorial](https://daily-dev-tips.com/posts/installing-postgresql-on-a-mac-with-homebrew/)** (Homebrew is really nice for installing on Mac)
     - For Windows: Follow this **[tutorial](https://www.sqlshack.com/how-to-install-postgresql-on-windows/)**
 2. Run this command after replacing **`<computer-username>`** with your computer's username:
-    
+
     ```bash
-    pg_restore -U <computer-username> postgres data.dump
+    pg_restore -U <computer-username> -d postgres data.dump
     ```
+
+    If you have any issue, the syntax is `pg_restore -U [username] -d [database_name] -h [host] -p [port] [backup_file]`
     
 3. Set up DataGrip, DBeaver, or your VS Code extension to point at your locally running Postgres instance.
 4. Have fun querying!
 
@@ -3,12 +3,12 @@ POSTGRES_USER=postgres
 POSTGRES_DB=postgres
 POSTGRES_PASSWORD=postgres
 
-HOST_PORT=5434
-CONTAINER_PORT=5431
+HOST_PORT=5432
+CONTAINER_PORT=5432
 
 DOCKER_CONTAINER=my-postgres-container
 DOCKER_IMAGE=my-postgres-image
 
 PGADMIN_EMAIL=[email protected]
 PGADMIN_PASSWORD=postgres
-PGADMIN_PORT=5050
+PGADMIN_PORT=5050
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "81cca085-dba2-42eb-a13b-fa64b6e86583",
    "metadata": {},
    "outputs": [
@@ -53,7 +53,11 @@
     "\n",
     "spark\n",
     "\n",
-    "df = spark.read.option(\"header\", \"true\").csv(\"/home/iceberg/data/events.csv\").withColumn(\"event_date\", expr(\"DATE_TRUNC('day', event_time)\"))\n",
+    "events = spark.read.option(\"header\", \"true\").csv(\"/home/iceberg/data/events.csv\").withColumn(\"event_date\", expr(\"DATE_TRUNC('day', event_time)\"))\n",
+    "devices = spark.read.option(\"header\",\"true\").csv(\"/home/iceberg/data/devices.csv\")\n",
+    "\n",
+    "df = events.join(devices,on=\"device_id\",how=\"left\")\n",
+    "df = df.withColumnsRenamed({'browser_type': 'browser_family', 'os_type': 'os_family'})\n",
     "\n",
     "df.show()"
    ]
@@ -570,7 +574,6 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "faaed2df",
    "metadata": {
     "collapsed": false,
     "jupyter": {
 
@@ -0,0 +1,137 @@
+flink-env.env
+postgres-data
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+dump.sql
+
+# Personal workspace files
+.idea/*
+.vscode/*
@@ -0,0 +1,38 @@
+FROM --platform=linux/amd64 flink:1.16.2
+
+# install python3: it has updated Python to 3.9 in Debian 11 and so install Python 3.7 from source
+# it currently only supports Python 3.6, 3.7 and 3.8 in PyFlink officially.
+RUN apt-get update -y && \
+apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \
+wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \
+tar -xvf Python-3.7.9.tgz && \
+cd Python-3.7.9 && \
+./configure --without-tests --enable-shared && \
+make -j6 && \
+make install && \
+ldconfig /usr/local/lib && \
+cd .. && rm -f Python-3.7.9.tgz && rm -rf Python-3.7.9 && \
+ln -s /usr/local/bin/python3 /usr/local/bin/python && \
+apt-get clean && \
+rm -rf /var/lib/apt/lists/*
+
+# install PyFlink
+COPY requirements.txt .
+RUN python -m pip install --upgrade pip; \
+    pip3 install -r requirements.txt  --no-cache-dir;
+
+# Install Java 11
+RUN apt-get update && \
+    apt-get install -y openjdk-11-jdk && \
+    apt-get clean;
+
+# Set environment variables
+ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+
+
+# Download connector libraries
+RUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-python/1.16.2/flink-python-1.16.2.jar; \
+    wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/1.16.2/flink-sql-connector-kafka-1.16.2.jar; \
+    wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc/1.16.2/flink-connector-jdbc-1.16.2.jar; \
+    wget -P /opt/flink/lib/ https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.26/postgresql-42.2.26.jar;
+WORKDIR /opt/flink
@@ -0,0 +1,71 @@
+include flink-env.env
+
+PLATFORM ?= linux/amd64
+
+# COLORS
+GREEN  := $(shell tput -Txterm setaf 2)
+YELLOW := $(shell tput -Txterm setaf 3)
+WHITE  := $(shell tput -Txterm setaf 7)
+RESET  := $(shell tput -Txterm sgr0)
+
+
+TARGET_MAX_CHAR_NUM=20
+
+## Show help with `make help`
+help:
+	@echo ''
+	@echo 'Usage:'
+	@echo '  ${YELLOW}make${RESET} ${GREEN}<target>${RESET}'
+	@echo ''
+	@echo 'Targets:'
+	@awk '/^[a-zA-Z\-\_0-9]+:/ { \
+		helpMessage = match(lastLine, /^## (.*)/); \
+		if (helpMessage) { \
+			helpCommand = substr($$1, 0, index($$1, ":")-1); \
+			helpMessage = substr(lastLine, RSTART + 3, RLENGTH); \
+			printf "  ${YELLOW}%-$(TARGET_MAX_CHAR_NUM)s${RESET} ${GREEN}%s${RESET}\n", helpCommand, helpMessage; \
+		} \
+	} \
+	{ lastLine = $$0 }' $(MAKEFILE_LIST)
+
+.PHONY: build
+## Builds the Flink base image with pyFlink and connectors installed
+build:
+	docker build --platform ${PLATFORM} -t ${IMAGE_NAME} .
+
+.PHONY: up
+## Builds the base Docker image and starts Flink cluster
+up:
+	docker compose --env-file flink-env.env up --build --remove-orphans  -d
+
+.PHONY: down
+## Shuts down the Flink cluster
+down:
+	docker compose down --remove-orphans
+
+.PHONY: job
+## Submit the Flink job
+job:
+	docker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d
+
+aggregation_job:
+	docker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d
+
+.PHONY: stop
+## Stops all services in Docker compose
+stop:
+	docker compose stop
+
+.PHONY: start
+## Starts all services in Docker compose
+start:
+	docker compose start
+
+.PHONY: clean
+## Stops and removes the Docker container as well as images with tag `<none>`
+clean:
+	docker compose stop
+	docker ps -a --format '{{.Names}}' | grep "^${CONTAINER_PREFIX}" | xargs -I {} docker rm {}
+	docker images | grep "<none>" | awk '{print $3}' | xargs -r docker rmi
+	# Uncomment line `docker rmi` if you want to remove the Docker image from this set up too
+	# docker rmi ${IMAGE_NAME}