🚀 ALICE Data Lakehouse

A distributed data lakehouse system designed for processing and analyzing infrastructure monitoring data from the ALICE experiment at CERN. The platform provides researchers with advanced tools for data collection, storage, and analysis, enabling the processing of high-volume, heterogeneous data streams from grid computing systems.

🌟 Key Features

Distributed Data Collection: Real-time processing of large-scale infrastructure data.
Scalable Architecture: Supports both structured and unstructured data storage and analysis.
Version Control for Datasets: Git-like management enabling collaborative research.
Web-Based SQL Interface: Optimized for scientific queries and research workflows.
Automated ETL Pipelines: Asynchronous data processing with scheduled workflows.
Multi-Level Access: Designed for both technical and non-technical users.

🛠️ Architecture Overview

The system consists of multiple containers, each responsible for specific components of the data lakehouse:

Services Overview

Service	Description
Postgres	Stores metadata for the Hive Metastore.
Metastore	Handles table definitions and metadata for data stored in the data lake.
Trino	SQL query engine for querying data across various storage backends.
Spark	Distributed data processing engine for ETL and machine learning workflows.
Spark Worker	Executes distributed tasks managed by the Spark master node.

🐳 Dockerized Services

Below is the description of the architecture and the Docker Compose setup:

Postgres

Purpose: Metadata storage for Hive Metastore.
Image: postgres:13
Environment Variables:
- POSTGRES_DB=metastore
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres

Hive Metastore

Purpose: Central metadata repository for data lake.
Image: Custom image my-hive-metastore:latest
Environment Variables:
- DB_DRIVER=postgres
- METASTORE_DB_HOSTNAME=postgres
- METASTORE_DB_PORT=5432
- METASTORE_DB_NAME=metastore
- METASTORE_DB_USER=postgres
- METASTORE_DB_PASSWORD=postgres

Trino

Purpose: SQL query engine for large-scale data analysis.
Image: trinodb/trino:426
Environment Variables:
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY}
- AWS_ENDPOINT=https://s3p.cloud.cyfronet.pl
- AWS_DEFAULT_REGION=us-west-2
- TRINO_ENDPOINT=http://trinodb:8080

Spark

Purpose: Distributed data processing engine.
Image: bitnami/spark:3.4.1
Mode: Master and Worker nodes
Environment Variables:
- SPARK_MODE=master (for the master container)
- SPARK_MODE=worker (for worker containers)
- SPARK_MASTER_HOST=spark
- SPARK_MASTER_PORT=7077

Networks

All services communicate over a dedicated Docker network named dldg.

📦 Environment Variables

The system requires the following environment variables to be configured in a .env file:

AWS_ACCESS_KEY=your_access_key
AWS_SECRET_KEY=your_secret_key
AWS_ENDPOINT=https://s3p.cloud.cyfronet.pl
AWS_REGION=us-west-2
METASTORE_DB_USER=postgres
METASTORE_DB_PASSWORD=postgres
TRINO_ENDPOINT=http://trinodb:8080

🚀 Quick Start Guide

Follow these steps to set up and run the ALICE Data Lakehouse platform on your local machine.

⚙️ Step 1: Install Prerequisites

Before starting, ensure the following tools are installed on your system:

Docker: Get Docker
Docker Compose: Install Docker Compose

📝 Step 2: Clone the Repository

Clone the repository to your local machine:

git clone https://github.com/your-repo/alice-data-lakehouse.git
cd alice-data-lakehouse

🔧 Step 3: Configure Environment Variables

To properly configure the environment, you need to create a .env file in the root directory of the project and define the required environment variables. Here's an example:

AWS_ACCESS_KEY=your_access_key
AWS_SECRET_KEY=your_secret_key
AWS_ENDPOINT=https://s3p.cloud.cyfronet.pl
AWS_REGION=us-west-2
METASTORE_DB_USER=postgres
METASTORE_DB_PASSWORD=postgres
TRINO_ENDPOINT=http://trinodb:8080

▶️ Step 4: Start the Dockerized Environment

To launch the ALICE Data Lakehouse platform, navigate to the root directory of the project and start all services using Docker Compose:

docker-compose build
docker-compose up -d

🔍 Step 5: Verify the Setup

After starting the containers, verify that all services are running and accessible.

✅ Check Running Containers

Run the following command to list all running containers:

docker ps

✅ Connect to Trino

docker exec -it trinodb trino

Usage Examples 📊

Trino Queries 🛠️

-- Register a Delta Lake table
CALL delta.system.register_table(
    schema_name => 'default',
    table_name => 'mytable',
    table_location => 's3a://my-bucket/path/'
);

-- Query data
SELECT * FROM delta.default.mytable LIMIT 5;

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
conf		conf
data		data
etc/catalog		etc/catalog
etl		etl
jars		jars
migration_scripts		migration_scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
config.py		config.py
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
logger.py		logger.py
pyproject.toml		pyproject.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 ALICE Data Lakehouse

🌟 Key Features

🛠️ Architecture Overview

Services Overview

🐳 Dockerized Services

Postgres

Hive Metastore

Trino

Spark

Networks

📦 Environment Variables

🚀 Quick Start Guide

⚙️ Step 1: Install Prerequisites

📝 Step 2: Clone the Repository

🔧 Step 3: Configure Environment Variables

▶️ Step 4: Start the Dockerized Environment

🔍 Step 5: Verify the Setup

✅ Check Running Containers

✅ Connect to Trino

Usage Examples 📊

Trino Queries 🛠️

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 ALICE Data Lakehouse

🌟 Key Features

🛠️ Architecture Overview

Services Overview

🐳 Dockerized Services

Postgres

Hive Metastore

Trino

Spark

Networks

📦 Environment Variables

🚀 Quick Start Guide

⚙️ Step 1: Install Prerequisites

📝 Step 2: Clone the Repository

🔧 Step 3: Configure Environment Variables

▶️ Step 4: Start the Dockerized Environment

🔍 Step 5: Verify the Setup

✅ Check Running Containers

✅ Connect to Trino

Usage Examples 📊

Trino Queries 🛠️

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages