🧬🚀 Genomic Data Storage Architecture

This project is a Proof of Concept (POC) exploring how to integrate distributed storage, blockchain, and bioinformatics to manage genomic data in a secure, scalable, and auditable way.

Genomic datasets are massive — generating 📂 terabytes of data — which require storage in systems like HDFS, protection with 🔐 cryptographic proofs, and orchestration through a ⚡ flexible, event-driven microservice architecture.

A key goal of this POC is to demonstrate how blockchain technology can be applied in contexts where data integrity is critical, ensuring that genomic files are tamper-proof, traceable, and verifiable throughout the pipeline.

This POC implements a loosely coupled ecosystem, where each event drives processes of data ingestion 📥, metadata validation ✅, and blockchain notarization ⛓️, guaranteeing integrity, traceability, and resilience while preserving privacy.

It demonstrates the technical feasibility of combining 🧬 bioinformatics, 📡 big data, and ⛓️ blockchain to create a trustworthy genomic data management workflow.

⚠️ Disclaimer

Genomics Data Storage Architecture is a Proof of Concept (POC) project developed for learning, experimentation, and demonstration purposes only. This project was created to explore and integrate a microservices-based architecture for genomic data processing using technologies such as Python, Kafka, Hive, Trino, HDFS, MinIO, Blockchain (Solidity/Web3/Polygon), Flask, and Docker.

It is not intended for production deployment. Any genomic data, identifiers, or blockchain interactions used during development are simulated or anonymized.

While inspired by real-world architectures, all development was carried out independently as a technical exercise.

The project name, logos, and any example data are purely for educational and illustrative purposes and do not represent actual brands.

Key Features 🚀

Event-Driven Microservices ⚡
Loosely coupled services handling ingestion, validation, and notarization of genomic data, ensuring flexibility and scalability.
Secure & Auditable Storage 🔐
Genomic files are stored in HDFS with metadata registered in Hive, providing traceability and auditability.
Trino is used as a high-performance query layer on top of Hive for fast API responses.
Data Integrity with Blockchain ⛓️
File hashes and critical metadata are notarized on a blockchain, ensuring tamper-proof records and verifiable integrity.
Scalable Message Bus 📨
Uses Kafka for asynchronous communication between microservices, enabling horizontal scaling and resilient pipelines.
MinIO Object Storage 🗄️
Temporary storage for metadata and files before HDFS persistence, providing efficient staging and retrieval.
Web API 🌐
A Flask-based REST API provides secure endpoints to query genomic files.
Queries to Hive are executed via Trino for higher performance and reduced latency.
Deadletter Handling ❌
Failed events are captured in deadletter topics, enabling monitoring, debugging, and operational transparency.
Blockchain-Ready Architecture 🧬
Designed to illustrate how blockchain can be applied to genomic workflows, highlighting traceability, audit, and integrity.
Extensible & Modular 🛠️
Each microservice can be extended or replaced independently, facilitating experimentation with new technologies or algorithms.

🌐 Architecture Overview

This Proof of Concept demonstrates how distributed storage, blockchain, and event-driven microservices can be combined to securely manage large-scale genomic data.
The architecture is designed to ensure data integrity, traceability, and resilience, which are critical when handling sensitive genomic information.

Genomic projects generate massive volumes of data (terabytes per genome). This POC illustrates how such data can be ingested, validated, stored, and notarized in a modular pipeline that can scale horizontally, maintain privacy, and provide an auditable trail of every operation.

High-Level Flow

Ingestion 📥
- Raw genomic files and associated metadata are uploaded to MinIO (object storage).
- Each upload triggers a Kafka event that starts the processing pipeline.
Validation ✅
- Microservices consume Kafka events to extract and validate metadata.
- Invalid events are sent to deadletter topics, allowing monitoring and error handling.
Persistence 💾
- Validated files are persisted in HDFS for long-term storage.
- Metadata is registered in Hive, enabling structured queries.
- Trino provides a high-performance SQL query layer on top of Hive for API consumption.
Blockchain Notarization ⛓️
- Critical information such as file hashes and patient identifiers is registered on a blockchain (Web3 / Polygon).
- This provides a tamper-proof, verifiable record to guarantee integrity and traceability.
Notification & Webhooks 🔔
- Once files are fully processed, events are published to a completed topic.
- A microservice listens to this topic and triggers webhooks to notify external systems.
- Failed events are captured in deadletter topics for operational review.
API Layer 🌐
- A Flask-based API provides secure endpoints to query genomic files, fetch patient-specific data, and integrate with external applications.
- Queries are executed via Trino for faster response times compared to direct Hive queries.
- JWT authentication ensures that only authorized users can access sensitive information.
Monitoring & Deadletter Handling ❌
- Each stage in the pipeline is monitored via Kafka topics.
- Failed or invalid events are routed to deadletter queues to ensure traceability and debugging capability.

Technologies Used 🛠️

This Proof of Concept leverages a mix of big data, blockchain, cloud storage, and microservices technologies to demonstrate secure, auditable genomic data processing.

🐍 Python – Main programming language for microservices and helpers
☕ Apache Kafka – Event streaming and message bus
🗄️ HDFS / Hadoop / YARN – Distributed storage and resource management
🐝 Apache Hive – Metadata storage and queries
🔹 Trino – High-performance SQL query engine for Hive data
🏗️ MinIO – Object storage for genomic files and metadata
🔐 Blockchain & Solidity / Web3 / Polygon – File hash notarization and integrity proofs
🐘 PostgreSQL – Persistent relational storage for service metadata
🔑 Keycloak – Authentication and JWT-based authorization
🐳 Docker – Containerized microservices
⚡ Event-driven Microservice Architecture – Loosely coupled services for ingestion, validation, persistence, notarization, and webhook notifications
🌐 Flask – API microservices for querying genomic data
🛠️ Supporting tools – SHA256 hashing, logging, and custom helpers for Kafka, HDFS, Hive, Trino, MinIO, and blockchain

Why Event-Driven Architecture with Kafka? ⚡

For a genomic data platform, flexibility, scalability, and resilience are critical. An event-driven architecture (EDA) orchestrated with Apache Kafka is particularly well-suited for this kind of workload. Here’s why:

1. Loose Coupling 🧩

Each microservice focuses on a single responsibility: ingestion, validation, persistence, notarization, or notifications.

Services communicate asynchronously via events.
Adding or modifying a microservice does not break the rest of the system.

2. Scalability 📈

Genomic datasets can reach terabytes per patient.

Kafka allows high-throughput streaming, handling spikes of data efficiently.
Individual microservices can scale independently depending on workload.

3. Flexibility & Extensibility 🔧

New processes (e.g., analytics, AI pipelines, or additional blockchain checks) can subscribe to existing topics without changing existing services.
The system adapts naturally to new data types or processing steps, making it ideal for POC exploration and experimentation.

4. Reliability & Fault Tolerance 💪

Kafka persists messages, ensuring no data is lost if a service fails temporarily.
Dead-letter topics capture errors for later inspection, enabling robust error handling and auditing.

5. Event Traceability 🕵️‍♂️

Every action on a genomic file (ingestion, validation, notarization) is logged as an event.
This provides a complete audit trail, which is essential for compliance and scientific reproducibility.

Why Not REST or Synchronous Pipelines? ❌

REST: Requires tightly coupled services; failures propagate immediately, reducing resilience.
Batch pipelines: Harder to extend and less responsive; not suitable for real-time notifications or scaling with bursts of genomic data.

Why Apache Hive & Trino? 🐝⚡

This POC leverages Apache Hive as the core metadata storage and batch query engine for genomic files, while Trino acts as a high-performance query layer to provide fast, interactive analytical queries.

1. Apache Hive – Scalable Metadata & Batch Analytics 📊

Genomic projects generate massive volumes of metadata for patients, samples, and sequencing runs.
Hive enables SQL-like queries over large datasets stored in HDFS, making it ideal for batch analytics, reporting, and exploratory analysis.
Supports dynamic partitions and schema evolution, allowing the platform to adapt to new data types or fields without major rewrites.
Integrates seamlessly with Hadoop/YARN, HDFS, and other big data tools, enabling horizontal scaling on commodity hardware at a low cost.

2. Trino – High-Performance, OLTP-Like Query Layer ⚡

Trino sits on top of Hive and provides interactive, low-latency SQL queries on the same metadata.
Enables the API layer and analytics dashboards to perform efficient, near-real-time queries, even over terabytes of genomic metadata.
Supports federated queries, allowing seamless integration with other data sources if needed.
Acts as a bridge between batch-oriented Hive storage and fast analytical access, effectively providing an OLTP-like experience for end-users without changing the underlying big data architecture.

3. Combined Benefits 🧬

Researchers and developers can use standard SQL for both exploratory batch analytics (Hive) and interactive queries (Trino).
The separation of storage (Hive/HDFS) and query layer (Trino) ensures the system is scalable, flexible, and performant, accommodating both large-scale batch processing and responsive API-driven analytics.

📚 Microservices Overview

This project implements a modular microservices architecture for processing genomic data.
Each service is focused on a single responsibility, communicating asynchronously via Kafka topics.
This design allows for scalability, fault tolerance, and easy maintenance.

The workflow follows a clear pipeline of data processing:

Ingestion & Validation – Metadata is validated before further processing.
File Processing & Persistence – Genomic files are downloaded, hashed, and stored in HDFS.
Hash Notarization – File integrity is notarized in a distributed ledger.
Metadata Registration – File metadata is recorded in Hive for analytics.
Notifier – Sends ingestion results to external systems.
Deadletter – Captures and persists failed events.

Each service handles deadletter events for errors, ensuring that failed messages do not block the pipeline.

📥 1. Genomics Ingest Validator

Validates incoming genomic metadata from MinIO before it enters the processing pipeline. Ensures that all required fields are correct and complete.

Technologies & Tools:

🐍 Python
☕ Kafka (consume genomics.ingest.requested, produce genomics.ingest.validated)
🏗️ MinIO (metadata storage)
✅ Validation logic (custom helpers)

Flow:

Consume event from Kafka input topic.
Fetch metadata.json from MinIO.
Validate fields and format.
Publish validated event to genomics.ingest.validated or send to Deadletter if invalid.

💾 2. Genomics Ingest Processor

Downloads genomic files, computes hashes, uploads to HDFS, and prepares them for further processing.

Technologies & Tools:

🐍 Python
☕ Kafka (consume genomics.ingest.validated, produce genomics.ingest.persisted)
🏗️ MinIO (file storage)
🗄️ HDFS (persistent storage)
🔒 SHA256 hashing

Flow:

Consume validated event.
Download file from MinIO.
Compute SHA256 hash.
Upload file to HDFS.
Delete file from MinIO staging.
Publish persisted event or send to Deadletter on failure.

📝 3. Genomics Hash Notarizer

Registers file hashes in a distributed ledger to notarize genomic files, ensuring integrity and traceability.

Technologies & Tools:

🐍 Python
☕ Kafka (consume genomics.ingest.persisted, produce genomics.hash.notarized)
🔗 Web3 / Smart Contract interface
🔒 SHA256 hashing for user/file IDs

Flow:

Consume persisted event from Kafka.
Compute SHA256 hash of patient ID.
Register file hash in the notary contract.
Publish notarized event or send to Deadletter on failure.

📊 4. Genomics Metadata Registry

Registers genomic file metadata into a Hive table for querying and downstream analytics.

Technologies & Tools:

🐍 Python
☕ Kafka (consume genomics.ingest.notarized, produce genomics.ingest.completed)
🐝 Hive (metadata storage)
🧮 UUID & timestamps for record tracking

Flow:

Consume notarized event from Kafka.
Build Hive INSERT query with metadata.
Execute query.
Publish registered event or send to Deadletter on failure.

📡 5. Genomics Notifier

Receives completed ingestion events and sends them to external systems via webhook.

Technologies & Tools:

🐍 Python
☕ Kafka (consume genomics.ingest.completed)
🌐 Webhooks / HTTP requests

Flow:

Consume completed ingestion event from Kafka.
Send event payload to configured webhook URL.

🗑️ 6. Genomics Deadletter

Central service that manages failed messages from any stage of the pipeline.
Its main purpose is to isolate errors, store them reliably, and allow platform engineers to inspect and reprocess them later.

Technologies & Tools:

🐍 Python
☕ Kafka (consume .deadletter topic, persist events)
📑 Structured logging for observability

Flow:

Consume events from .deadletter Kafka topic.
Persist event to a Deadletter store (MongoDB) for traceability.

🌐 Genomics API Microservices (Facade with Trino)

The Genomics API services provide a unified, high-performance interface to query genomic metadata.
While the underlying storage is managed by Hive on HDFS, Trino enables interactive, low-latency SQL queries, offering an OLTP-like experience for API clients.

Key Features & Tech:

🐍 Python + Flask
🔑 Keycloak for authentication & JWT validation
⚡ Trino for fast, interactive query access
⚖️ HAProxy for load balancing across multiple replicas

Some Endpoints:

POST /api/genomics/login → Authenticate user and return JWT
GET /api/genomics/patients/<patient_id>/files → Retrieve files for a specific patient using Trino
GET /api/genomics/genomic-files?limit=10&offset=0 → Retrieve all genomic files with pagination via Trino

Deployment:

3 replicas for high availability
Load-balanced via HAProxy (9008 for API, 1937 for HAProxy stats)
Shared helper code mounted from ./common
Connected to genomic-data-storage-net

Notes:

The facade does not process raw genomic data; it queries persisted metadata from the pipeline.
Using Trino on top of Hive ensures efficient, interactive queries, enabling fast response times for analytics dashboards, reporting, and API requests.

Screenshots 📷

Here are some screenshots that demonstrate the functionality of Genomic Data Store Architecture:

Hive Server adming page

Hadoop YARN admin page

Hadoop HDFS admin page

Kafka admin page

HAProxy admin page

Docker Compose

Keycloack Admin page

Trino Admin Page

Rake Tasks 🧬

This project leverages Rake to manage infrastructure, data ingestion, Hive schemas, MinIO uploads, and blockchain DApp tasks. Below is a table summarizing the main tasks and their purposes.

Task	Description
`genomic_data_storage:deploy`	Deploys the full architecture and launches all services and daemons.
`genomic_data_storage:undeploy`	Undeploys the architecture and stops all containers.
`genomic_data_storage:start`	Starts all containers after checking Docker and deployment files.
`genomic_data_storage:stop`	Stops and removes running containers.
`genomic_data_storage:status`	Shows the status of all containers.
`genomic_data_storage:clean_environment`	Cleans Docker environment by pruning images and volumes.
`genomic_data_storage:check_docker`	Checks if Docker and Docker Compose are installed and accessible.
`genomic_data_storage:login`	Authenticates with Docker credentials.
`genomic_data_storage:check_deployment_file`	Verifies that `docker-compose.yml` exists.
`genomic_data_storage:build_and_push_genomic_api_image`	Builds and pushes the Genomic API Docker image to DockerHub.

Hive Tasks

Task	Description
`genomic_data_storage:hive:setup_metadata_registry_schema`	Copies and applies Hive SQL schema for metadata registry.
`genomic_data_storage:hive:check_metadata_registry_schema`	Checks if `genomic_files` table exists in Hive.
`genomic_data_storage:hive:count_genomic_files_data`	Counts the records in the `genomic_files` table.
`genomic_data_storage:hive:display_genomic_files_data`	Displays all data in the `genomic_files` table.

MinIO Tasks

Task	Description
`genomic_data_storage:minio:upload_genomic_zip['./path/to/zip']`	Uploads a genomic dataset ZIP file to MinIO.
`genomic_data_storage:minio:upload_contract_abi`	Uploads the blockchain contract ABI JSON to MinIO.
`genomic_data_storage:minio:delete_contract_abi`	Deletes the contract ABI JSON from MinIO.
`genomic_data_storage:minio:check_contract_abi`	Checks if the contract ABI JSON exists in MinIO.

DApp Tasks

Task	Description
`genomic_data_storage:dapp:install_dependencies`	Installs dependencies for the GenomicNotaryDApp.
`genomic_data_storage:dapp:run_tests`	Runs Hardhat tests for the GenomicNotaryDApp.
`genomic_data_storage:dapp:deploy_contracts`	Deploys smart contracts to the Amoy network using Hardhat.

Exposed Services in the Genomic Data POC

This table summarizes the main services and endpoints exposed in the Genomic Data Storage Proof of Concept (POC). It includes the services that can be accessed directly (via browser, API, or client), their ports, and a brief description of their role. Internal microservices and background workers are listed without exposed ports as they communicate via the internal network or Kafka.

Service / Container	Exposed Port(s)	Role / Description
HDFS NameNode (namenode)	8089	HDFS web interface for monitoring the NameNode
YARN ResourceManager (resourcemanager)	8081	Web UI for resource and job management in Hadoop
Hive Metastore (hive-metastore)	9083	Thrift service for Hive metadata queries
HiveServer2 (hive-server)	10000, 10002	JDBC/ODBC/Beeline query service and Hive web UI
MinIO HAProxy (genomic_staging_minio_haproxy)	9000, 36305, 1936	Unified access to the MinIO object storage cluster for staging, plus HAProxy dashboard
Kafka (kafka)	9092	Kafka broker for event streaming
ZooKeeper (zookeeper)	2181	Coordination service for distributed systems (used by Kafka)
AKHQ (akhq)	8088	Web UI to explore and manage Kafka topics
Trino (trino)	8189	Distributed SQL query engine for fast analytics over Hive/HDFS
PostgreSQL (postgres)	5432	Relational database for microservices and Keycloak
pgAdmin (pgadmin)	8085	Web interface to manage PostgreSQL
Keycloak (keycloak)	8080	Identity management and JWT authentication
Genomics API Service replicas	5000 (internal)	REST API to query genomic metadata with Keycloak authentication
Genomics API HAProxy	9008, 1937	Load balancer distributing requests across the API replicas
Genomics Ingest Validator	-	Validates genomic data before ingestion (consumes Kafka events)
Genomics Ingest Processor	-	Processes validated genomic data and publishes to Kafka
Genomics Hash Notarizer	-	Generates file hashes and ensures data integrity
Genomics Metadata Registry	-	Manages and updates genomic metadata
Genomics Notifier	-	Sends notifications via webhooks when genomic data processing is complete
Genomics Deadletter Service	-	Handles failed messages and ingestion errors
MongoDB (mongo)	27017	Stores deadletter messages and temporary data

⚠️ Disclaimer

Genomics Data Storage Architecture is a Proof of Concept (POC) project developed for learning, experimentation, and demonstration purposes only. This project was created to explore and integrate a microservices-based architecture for genomic data processing using technologies such as Python, Kafka, Hive, Trino, HDFS, MinIO, Blockchain (Solidity/Web3/Polygon), Flask, and Docker.

It is not intended for production deployment. Any genomic data, identifiers, or blockchain interactions used during development are simulated or anonymized.

While inspired by real-world architectures, all development was carried out independently as a technical exercise.

The project name, logos, and any example data are purely for educational and illustrative purposes and do not represent actual brands.

📝 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
blockchain		blockchain
common		common
doc		doc
genomics-api-service		genomics-api-service
genomics-deadletter		genomics-deadletter
genomics-hash-notarizer		genomics-hash-notarizer
genomics-ingest-processor		genomics-ingest-processor
genomics-ingest-validator		genomics-ingest-validator
genomics-metadata-registry		genomics-metadata-registry
genomics-notifier		genomics-notifier
haproxy		haproxy
hive		hive
minio		minio
trino/etc/catalog		trino/etc/catalog
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
docker-compose.yml		docker-compose.yml

License

sergio11/genomic_data_storage_architecture

Folders and files

Latest commit

History

Repository files navigation

🧬🚀 Genomic Data Storage Architecture

⚠️ Disclaimer

Key Features 🚀

🌐 Architecture Overview

High-Level Flow

Technologies Used 🛠️

Why Event-Driven Architecture with Kafka? ⚡

1. Loose Coupling 🧩

2. Scalability 📈

3. Flexibility & Extensibility 🔧

4. Reliability & Fault Tolerance 💪

5. Event Traceability 🕵️‍♂️

Why Not REST or Synchronous Pipelines? ❌

Why Apache Hive & Trino? 🐝⚡

1. Apache Hive – Scalable Metadata & Batch Analytics 📊

2. Trino – High-Performance, OLTP-Like Query Layer ⚡

3. Combined Benefits 🧬

📚 Microservices Overview

📥 1. Genomics Ingest Validator

💾 2. Genomics Ingest Processor

📝 3. Genomics Hash Notarizer

📊 4. Genomics Metadata Registry

📡 5. Genomics Notifier

🗑️ 6. Genomics Deadletter

🌐 Genomics API Microservices (Facade with Trino)

Screenshots 📷

Hive Server adming page

Hadoop YARN admin page

Hadoop HDFS admin page

Kafka admin page

HAProxy admin page

Docker Compose

Keycloack Admin page

Trino Admin Page

Rake Tasks 🧬

Hive Tasks

MinIO Tasks

DApp Tasks

Exposed Services in the Genomic Data POC

⚠️ Disclaimer

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages