Skip to content

Latest commit

 

History

History
102 lines (71 loc) · 2.52 KB

File metadata and controls

102 lines (71 loc) · 2.52 KB

Data Platform Deployment

Install Dependencies

brew install terraform
brew install helm

Prepare helm charts:

helm repo add nessie-helm https://charts.projectnessie.org
helm repo add trino https://trinodb.github.io/charts/
helm repo add superset https://apache.github.io/superset
helm repo add dagster https://dagster-io.github.io/helm
helm repo update

Prepare Dagter User Code Docker Image

In the config/dagster/app folder you find a simple dagster DAG which copies the data from one table and stores it in a second table. For dagster we need to create a user code docker image and make it available for the dagster server.

We already did this for you and the docker image is available under ghcr.io/bigdatarepublic/bdr-open-data-platform-code:1

You can also change the pipeline and create your own docker image:

cd ../../config/dagster
docker build --platform linux/amd64 -t dagster_code_amd64:1

Deploy Platform

cd platform

terraform init
# create workspace for local, cyso or scaleway
terraform workspace new (local/cyso/scaleway)
terraform workspace show

terraform apply

You just deployed a data platform to your kubernetes cluster. Do a little dance ♪┏(・o・)┛♪┗ ( ・o・) ┓♪.

Run An Example

Lets get it to work now. First we will add some data via trino:

# run trino in trino pod
TRINO_POD=$(kubectl get pods | grep trino | awk '{print $1}')
kubectl exec -it $TRINO_POD -- trino

In the trino cli you can run SQL statements to add data:

CREATE SCHEMA iceberg.test_schema;

CREATE TABLE iceberg.test_schema.employees_test
(
  name varchar,
  salary decimal(10,2)
)
WITH (
  format = 'PARQUET'
);

INSERT INTO iceberg.test_schema.employees_test (name, salary)  VALUES ('Steven Rogers', 55000);

Lets query the data with superset:

# for local example use http://localhost:8088

# for cloud example get public ip
SUPERSET_IP=$(kubectl get svc | grep "superset.*LoadBalancer" | awk '{print $4}')
echo http://${SUPERSET_IP}:8088 # admin:admin

# add trino as database with url: trino://default@trino:8080/iceberg/test_schema
# run query in sql lab: SELECT * FROM test_schema.employees_test;

Lets run a pipeline with dagster:

# for local example use port forwarding and http://localhost:30089
kubectl port-forward svc/dagster-dagster-webserver 30089:30089

# for cloud example get public ip
DAGSTER_IP=$(kubectl get svc | grep "dagster.*LoadBalancer" | awk '{print $4}')
echo http://${DAGSTER_IP}:30089

# try to materialize the asset