Data Platform Deployment

Install Dependencies

brew install terraform
brew install helm

Prepare helm charts:

helm repo add nessie-helm https://charts.projectnessie.org
helm repo add trino https://trinodb.github.io/charts/
helm repo add superset https://apache.github.io/superset
helm repo add dagster https://dagster-io.github.io/helm
helm repo update

Prepare Dagter User Code Docker Image

In the config/dagster/app folder you find a simple dagster DAG which copies the data from one table and stores it in a second table. For dagster we need to create a user code docker image and make it available for the dagster server.

We already did this for you and the docker image is available under ghcr.io/bigdatarepublic/bdr-open-data-platform-code:1

You can also change the pipeline and create your own docker image:

cd ../../config/dagster
docker build --platform linux/amd64 -t dagster_code_amd64:1

Deploy Platform

cd platform

terraform init
# create workspace for local, cyso or scaleway
terraform workspace new (local/cyso/scaleway)
terraform workspace show

terraform apply

You just deployed a data platform to your kubernetes cluster. Do a little dance ♪┏(・o･)┛♪┗ ( ･o･) ┓♪.

Run An Example

Lets get it to work now. First we will add some data via trino:

# run trino in trino pod
TRINO_POD=$(kubectl get pods | grep trino | awk '{print $1}')
kubectl exec -it $TRINO_POD -- trino

In the trino cli you can run SQL statements to add data:

CREATE SCHEMA iceberg.test_schema;

CREATE TABLE iceberg.test_schema.employees_test
(
  name varchar,
  salary decimal(10,2)
)
WITH (
  format = 'PARQUET'
);

INSERT INTO iceberg.test_schema.employees_test (name, salary)  VALUES ('Steven Rogers', 55000);

Lets query the data with superset:

# for local example use http://localhost:8088

# for cloud example get public ip
SUPERSET_IP=$(kubectl get svc | grep "superset.*LoadBalancer" | awk '{print $4}')
echo http://${SUPERSET_IP}:8088 # admin:admin

# add trino as database with url: trino://default@trino:8080/iceberg/test_schema
# run query in sql lab: SELECT * FROM test_schema.employees_test;

Lets run a pipeline with dagster:

# for local example use port forwarding and http://localhost:30089
kubectl port-forward svc/dagster-dagster-webserver 30089:30089

# for cloud example get public ip
DAGSTER_IP=$(kubectl get svc | grep "dagster.*LoadBalancer" | awk '{print $4}')
echo http://${DAGSTER_IP}:30089

# try to materialize the asset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Platform Deployment

Install Dependencies

Prepare Dagter User Code Docker Image

Deploy Platform

Run An Example

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Data Platform Deployment

Install Dependencies

Prepare Dagter User Code Docker Image

Deploy Platform

Run An Example