DataStack-in-a-Box

DataStack-in-a-Box is a self-managed, on-premise open-source data platform built entirely using Docker. It brings together key components of a modern data stack in a modular and lightweight architecture. The platform uses MinIO as the storage layer, Trino and DuckDB as execution engines, Airflow for orchestration, dbt for transformation, and Metabase for visualization.

To support streamlined development and deployment, GitLab is integrated for CI/CD. This project was designed to be scalable, cost-effective, and easily reproducible—ideal for teams or individuals looking to run a complete data platform locally or in a controlled environment without relying on fully managed cloud services.

Minio

Minio command:

Communicate to minio container: docker exec -it <container_name> /bin/bash
Make an alias to connect to localhost: mc alias set <alias_name> http://localhost:9000
Show alias that already made: mc alias list
Make a bucket inside minio storage: mc mb alias/

Access Minio via web browser: http://localhost:9000

Trino

Communicate to trino container: docker exec -it <container_name> /bin/bash
By default, when installing Trino and setting it on a specific port (by default, port 8080), it will not let you log in with a password. So you need to follow several steps from this documentation: https://trino.io/docs/current/security/password-file.html After following several steps from that documentation (add a few lines of config and add new https port default is 8443), you can log in to Trino UI with the password

dbt

DBT container likely stops running immediately after starting because it doesn’t have a persistent process keeping it alive. Since dbt is a CLI tool and not a long-running service, the container only runs and then exits unless explicitly kept alive. So it’s a normal thing when you check docker ps and there are no dbt container eventhough you already start your docker compose.

Check if dbt works with trying to extract SQL Server tables and sink it into an iceberg table using dbt:

Check the table in DBeaver:

DuckDB

Besides using dbt Trino, we can also extract the SQL Server table (need to install ODBC Driver for SQL Server) and save it as a file-based format in Minio storage using DuckDB+python.

Airflow

In this project, I'm trying to implement a production scenario using CeleryExecutor airflow instead of Sequential (single node executor).

How It Works (Using CeleryExecutor in Airflow)

The Airflow Scheduler assigns tasks to the CeleryExecutor.
The CeleryExecutor pushes tasks into a queue (e.g., Redis, RabbitMQ).
Multiple Celery Workers Pick up and execute the tasks in parallel.
The workers then report the task status back to Airflow.

This is a simple DAG, to run my python script that extracts sql server table and converts it into parquet using DuckDB

Before you start your scheduler, you need to add the connection in the Admin section to Minio and sql server:

DAG content:

Check the parquet file in minio:

You can also access Flower through web-server for:

Task Monitoring: You can see the real-time status of tasks, including whether they are queued, started, or finished.
Worker Monitoring: View the status of workers, their load, and task completion times.
Real-Time Events: See detailed logs of events such as task failures or retries.
Performance Metrics: Monitor various performance metrics related to your Celery workers.

Metabase

Actually, Trino is not a database but an execution engine. But Metabase can connect to tables from Trino because it exposes table metadata over the JDBC connection. To connect Metabase to Trino, you need to install Starburst driver first and mount it into /plugins directory inside the Metabase container. https://github.com/starburstdata/metabase-driver/releases/download/6.1.0/starburst-6.1.0.metabase-driver.jar

After running the container, access http://localhost:3000. For the first time, you will need to input several information for Metabase credentials, and then you can add a database connection to the Metabase: Because you already downloaded the starburst driver and mounted it into /plugins directory in the metabase container, there will be an option for Starburst in the Database Type.

Enter the following information:

Database type: Select Starburst in the dropdown. If this option does not appear, review the requirements and make sure you have installed the Starburst driver.
Display name: A name for this database in Metabase, such as the catalog name and cluster name (free to decide).
Host: Hostname or IP address of the cluster. You can check your IP database (in this case, Trino) with this command: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container_name/container_id>
Port: Port for the cluster. If the cluster is secured with SSL/TLS, make sure to specify the secure port for that connection. But for now, we still can't use SSL/TLS (using port 8443), and there are still open issues: Connect to Trino via certificates · Issue #126 · starburstdata/metabase-driver. So for now, just use 8080 (without password)
Catalog: The name of the catalog to be used for this database. Use the iceberg catalog.
Schema (optional): A schema within the catalog, limiting data to the subset within that schema.
Username: Username to connect to the cluster.
Password: Password to connect to the cluster. If the cluster is unsecured, leave this field blank. Because use port 8080, no need to enter a password.

After creating a table in trino, check your trino database in the metabase webserver. If there are no tables, you need to manual syncing the database first.

Go to Admin Settings → Databases.
Find your Trino database and click Sync database schema now.
Wait a few minutes and refresh the table list.

GitLab

In the web browser, I’m using external_url = ‘gitlab.local’ to access GitLab instead of "localhost://.....". But before that, there are several steps that must be configured. You need to set up some configuration in your host file: To login into the GitLab you need to enter root as the username. For the initial password you can find with this method: After login with the root username and the initial password, you can change with your desired password from the web-server: Configure SSH in GitLab CI/CD

When is SSH Useful for GitLab

Accessing Git Repositories Remotely
- If you're pushing/pulling code from a different machine (e.g., your laptop to your GitLab server), SSH provides a secure and password-free way to authenticate.
- Even if GitLab is on your own server, you might access it from different devices (workstation, CI/CD runners, etc.).
Using Git Remotely Over SSH
- When you set up an SSH key, GitLab allows you to use URLs like:
```
git clone [email protected]:your-repo.git
```
  Instead of HTTPS, which requires a username and password (or a PAT).
Secure Automation (CI/CD, Deployments, Scripts)
- If your GitLab server is used for deployments or automated tasks (e.g., fetching repositories inside a container or running CI/CD jobs), SSH allows those processes to authenticate securely without storing plain text credentials.

How To Generate SSH Key

Use this command (can be done in Windows or Linux): bash ssh-keygen

After that, go to your Git Lab preferences, SSH Keys, and enter the content of your .pub

Actually, there are several features of GitLab, like container registry and runner, that can be explored. Just for info:

GitLab Container Registry is a Docker container registry that is built into GitLab. It allows you to store, manage, and distribute container images within your GitLab projects. This is particularly useful for CI/CD pipelines, as you can build, push, and pull container images directly within GitLab. It’s like a Docker Hub, but it is embedded in the GitLab.
GitLab Runner is an application that picks up and executes CI/CD jobs for GitLab I also provide the Docker Compose for both of them that you can look up.

Last but not least, there is something called Git-Sync. The git-sync service does exactly what its name suggests - it acts as a one-way synchronization tool:

It continuously watches your GitLab repository for changes
When it detects changes, it pulls them from the repository
It updates your local Airflow DAGs folder with these changes
Airflow then sees and executes these updated DAG files

The workflow becomes:

You develop and commit DAG changes to your GitLab repository
Git-sync automatically pulls those changes to your Airflow DAGs folder
Airflow scheduler picks up these changes and executes the DAGs

This creates a clean separation between:

Development and version control (handled in GitLab)
Execution (handled by Airflow) And the git-sync container handles the synchronization between these two environments automatically, eliminating the need for manual file copying or deployments. I also provide the docker compose for Git-Sync that you can adjust.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
airflow		airflow
dbt		dbt
hive-metastore		hive-metastore
metabase/plugins		metabase/plugins
python		python
trino		trino
README.md		README.md
dataplatform-full.yml		dataplatform-full.yml
envtemplate		envtemplate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataStack-in-a-Box

Minio

Trino

dbt

DuckDB

Airflow

How It Works (Using CeleryExecutor in Airflow)

Metabase

GitLab

When is SSH Useful for GitLab

How To Generate SSH Key

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jovianaditya/open-source-data-platform

Folders and files

Latest commit

History

Repository files navigation

DataStack-in-a-Box

Minio

Trino

dbt

DuckDB

Airflow

How It Works (Using CeleryExecutor in Airflow)

Metabase

GitLab

When is SSH Useful for GitLab

How To Generate SSH Key

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages