Skip to content

๐Ÿงฌ Genomic Data Storage Architecture: A proof of concept for securely managing and auditing massive genomic datasets by combining distributed storage ๐Ÿ“‚, event-driven microservices โšก, and blockchain โ›“๏ธ (or equivalent notarization) for tamper-proof, traceable, and scalable genomic data workflows.

License

Notifications You must be signed in to change notification settings

sergio11/genomic_data_storage_architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

69 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ๐Ÿš€ Genomic Data Storage Architecture

This project is a Proof of Concept (POC) exploring how to integrate distributed storage, blockchain, and bioinformatics to manage genomic data in a secure, scalable, and auditable way.

Genomic datasets are massive โ€” generating ๐Ÿ“‚ terabytes of data โ€” which require storage in systems like HDFS, protection with ๐Ÿ” cryptographic proofs, and orchestration through a โšก flexible, event-driven microservice architecture.

A key goal of this POC is to demonstrate how blockchain technology can be applied in contexts where data integrity is critical, ensuring that genomic files are tamper-proof, traceable, and verifiable throughout the pipeline.

This POC implements a loosely coupled ecosystem, where each event drives processes of data ingestion ๐Ÿ“ฅ, metadata validation โœ…, and blockchain notarization โ›“๏ธ, guaranteeing integrity, traceability, and resilience while preserving privacy.

It demonstrates the technical feasibility of combining ๐Ÿงฌ bioinformatics, ๐Ÿ“ก big data, and โ›“๏ธ blockchain to create a trustworthy genomic data management workflow.

โš ๏ธ Disclaimer

Genomics Data Storage Architecture is a Proof of Concept (POC) project developed for learning, experimentation, and demonstration purposes only. This project was created to explore and integrate a microservices-based architecture for genomic data processing using technologies such as Python, Kafka, Hive, Trino, HDFS, MinIO, Blockchain (Solidity/Web3/Polygon), Flask, and Docker.

It is not intended for production deployment. Any genomic data, identifiers, or blockchain interactions used during development are simulated or anonymized.

While inspired by real-world architectures, all development was carried out independently as a technical exercise.

The project name, logos, and any example data are purely for educational and illustrative purposes and do not represent actual brands.

Key Features ๐Ÿš€

  • Event-Driven Microservices โšก
    Loosely coupled services handling ingestion, validation, and notarization of genomic data, ensuring flexibility and scalability.

  • Secure & Auditable Storage ๐Ÿ”
    Genomic files are stored in HDFS with metadata registered in Hive, providing traceability and auditability.
    Trino is used as a high-performance query layer on top of Hive for fast API responses.

  • Data Integrity with Blockchain โ›“๏ธ
    File hashes and critical metadata are notarized on a blockchain, ensuring tamper-proof records and verifiable integrity.

  • Scalable Message Bus ๐Ÿ“จ
    Uses Kafka for asynchronous communication between microservices, enabling horizontal scaling and resilient pipelines.

  • MinIO Object Storage ๐Ÿ—„๏ธ
    Temporary storage for metadata and files before HDFS persistence, providing efficient staging and retrieval.

  • Web API ๐ŸŒ
    A Flask-based REST API provides secure endpoints to query genomic files.
    Queries to Hive are executed via Trino for higher performance and reduced latency.

  • Deadletter Handling โŒ
    Failed events are captured in deadletter topics, enabling monitoring, debugging, and operational transparency.

  • Blockchain-Ready Architecture ๐Ÿงฌ
    Designed to illustrate how blockchain can be applied to genomic workflows, highlighting traceability, audit, and integrity.

  • Extensible & Modular ๐Ÿ› ๏ธ
    Each microservice can be extended or replaced independently, facilitating experimentation with new technologies or algorithms.

๐ŸŒ Architecture Overview

This Proof of Concept demonstrates how distributed storage, blockchain, and event-driven microservices can be combined to securely manage large-scale genomic data.
The architecture is designed to ensure data integrity, traceability, and resilience, which are critical when handling sensitive genomic information.

Genomic projects generate massive volumes of data (terabytes per genome). This POC illustrates how such data can be ingested, validated, stored, and notarized in a modular pipeline that can scale horizontally, maintain privacy, and provide an auditable trail of every operation.

High-Level Flow

  1. Ingestion ๐Ÿ“ฅ

    • Raw genomic files and associated metadata are uploaded to MinIO (object storage).
    • Each upload triggers a Kafka event that starts the processing pipeline.
  2. Validation โœ…

    • Microservices consume Kafka events to extract and validate metadata.
    • Invalid events are sent to deadletter topics, allowing monitoring and error handling.
  3. Persistence ๐Ÿ’พ

    • Validated files are persisted in HDFS for long-term storage.
    • Metadata is registered in Hive, enabling structured queries.
    • Trino provides a high-performance SQL query layer on top of Hive for API consumption.
  4. Blockchain Notarization โ›“๏ธ

    • Critical information such as file hashes and patient identifiers is registered on a blockchain (Web3 / Polygon).
    • This provides a tamper-proof, verifiable record to guarantee integrity and traceability.
  5. Notification & Webhooks ๐Ÿ””

    • Once files are fully processed, events are published to a completed topic.
    • A microservice listens to this topic and triggers webhooks to notify external systems.
    • Failed events are captured in deadletter topics for operational review.
  6. API Layer ๐ŸŒ

    • A Flask-based API provides secure endpoints to query genomic files, fetch patient-specific data, and integrate with external applications.
    • Queries are executed via Trino for faster response times compared to direct Hive queries.
    • JWT authentication ensures that only authorized users can access sensitive information.
  7. Monitoring & Deadletter Handling โŒ

    • Each stage in the pipeline is monitored via Kafka topics.
    • Failed or invalid events are routed to deadletter queues to ensure traceability and debugging capability.

Technologies Used ๐Ÿ› ๏ธ

This Proof of Concept leverages a mix of big data, blockchain, cloud storage, and microservices technologies to demonstrate secure, auditable genomic data processing.

  • ๐Ÿ Python โ€“ Main programming language for microservices and helpers
  • โ˜• Apache Kafka โ€“ Event streaming and message bus
  • ๐Ÿ—„๏ธ HDFS / Hadoop / YARN โ€“ Distributed storage and resource management
  • ๐Ÿ Apache Hive โ€“ Metadata storage and queries
  • ๐Ÿ”น Trino โ€“ High-performance SQL query engine for Hive data
  • ๐Ÿ—๏ธ MinIO โ€“ Object storage for genomic files and metadata
  • ๐Ÿ” Blockchain & Solidity / Web3 / Polygon โ€“ File hash notarization and integrity proofs
  • ๐Ÿ˜ PostgreSQL โ€“ Persistent relational storage for service metadata
  • ๐Ÿ”‘ Keycloak โ€“ Authentication and JWT-based authorization
  • ๐Ÿณ Docker โ€“ Containerized microservices
  • โšก Event-driven Microservice Architecture โ€“ Loosely coupled services for ingestion, validation, persistence, notarization, and webhook notifications
  • ๐ŸŒ Flask โ€“ API microservices for querying genomic data
  • ๐Ÿ› ๏ธ Supporting tools โ€“ SHA256 hashing, logging, and custom helpers for Kafka, HDFS, Hive, Trino, MinIO, and blockchain

Why Event-Driven Architecture with Kafka? โšก

For a genomic data platform, flexibility, scalability, and resilience are critical. An event-driven architecture (EDA) orchestrated with Apache Kafka is particularly well-suited for this kind of workload. Hereโ€™s why:

1. Loose Coupling ๐Ÿงฉ

Each microservice focuses on a single responsibility: ingestion, validation, persistence, notarization, or notifications.

  • Services communicate asynchronously via events.
  • Adding or modifying a microservice does not break the rest of the system.

2. Scalability ๐Ÿ“ˆ

Genomic datasets can reach terabytes per patient.

  • Kafka allows high-throughput streaming, handling spikes of data efficiently.
  • Individual microservices can scale independently depending on workload.

3. Flexibility & Extensibility ๐Ÿ”ง

  • New processes (e.g., analytics, AI pipelines, or additional blockchain checks) can subscribe to existing topics without changing existing services.
  • The system adapts naturally to new data types or processing steps, making it ideal for POC exploration and experimentation.

4. Reliability & Fault Tolerance ๐Ÿ’ช

  • Kafka persists messages, ensuring no data is lost if a service fails temporarily.
  • Dead-letter topics capture errors for later inspection, enabling robust error handling and auditing.

5. Event Traceability ๐Ÿ•ต๏ธโ€โ™‚๏ธ

  • Every action on a genomic file (ingestion, validation, notarization) is logged as an event.
  • This provides a complete audit trail, which is essential for compliance and scientific reproducibility.

Why Not REST or Synchronous Pipelines? โŒ

  • REST: Requires tightly coupled services; failures propagate immediately, reducing resilience.
  • Batch pipelines: Harder to extend and less responsive; not suitable for real-time notifications or scaling with bursts of genomic data.

Why Apache Hive & Trino? ๐Ÿโšก

This POC leverages Apache Hive as the core metadata storage and batch query engine for genomic files, while Trino acts as a high-performance query layer to provide fast, interactive analytical queries.

1. Apache Hive โ€“ Scalable Metadata & Batch Analytics ๐Ÿ“Š

  • Genomic projects generate massive volumes of metadata for patients, samples, and sequencing runs.
  • Hive enables SQL-like queries over large datasets stored in HDFS, making it ideal for batch analytics, reporting, and exploratory analysis.
  • Supports dynamic partitions and schema evolution, allowing the platform to adapt to new data types or fields without major rewrites.
  • Integrates seamlessly with Hadoop/YARN, HDFS, and other big data tools, enabling horizontal scaling on commodity hardware at a low cost.

2. Trino โ€“ High-Performance, OLTP-Like Query Layer โšก

  • Trino sits on top of Hive and provides interactive, low-latency SQL queries on the same metadata.
  • Enables the API layer and analytics dashboards to perform efficient, near-real-time queries, even over terabytes of genomic metadata.
  • Supports federated queries, allowing seamless integration with other data sources if needed.
  • Acts as a bridge between batch-oriented Hive storage and fast analytical access, effectively providing an OLTP-like experience for end-users without changing the underlying big data architecture.

3. Combined Benefits ๐Ÿงฌ

  • Researchers and developers can use standard SQL for both exploratory batch analytics (Hive) and interactive queries (Trino).
  • The separation of storage (Hive/HDFS) and query layer (Trino) ensures the system is scalable, flexible, and performant, accommodating both large-scale batch processing and responsive API-driven analytics.

๐Ÿ“š Microservices Overview

This project implements a modular microservices architecture for processing genomic data.
Each service is focused on a single responsibility, communicating asynchronously via Kafka topics.
This design allows for scalability, fault tolerance, and easy maintenance.

The workflow follows a clear pipeline of data processing:

  1. Ingestion & Validation โ€“ Metadata is validated before further processing.
  2. File Processing & Persistence โ€“ Genomic files are downloaded, hashed, and stored in HDFS.
  3. Hash Notarization โ€“ File integrity is notarized in a distributed ledger.
  4. Metadata Registration โ€“ File metadata is recorded in Hive for analytics.
  5. Notifier โ€“ Sends ingestion results to external systems.
  6. Deadletter โ€“ Captures and persists failed events.

Each service handles deadletter events for errors, ensuring that failed messages do not block the pipeline.

๐Ÿ“ฅ 1. Genomics Ingest Validator

Validates incoming genomic metadata from MinIO before it enters the processing pipeline. Ensures that all required fields are correct and complete.

Technologies & Tools:

  • ๐Ÿ Python
  • โ˜• Kafka (consume genomics.ingest.requested, produce genomics.ingest.validated)
  • ๐Ÿ—๏ธ MinIO (metadata storage)
  • โœ… Validation logic (custom helpers)

Flow:

  1. Consume event from Kafka input topic.
  2. Fetch metadata.json from MinIO.
  3. Validate fields and format.
  4. Publish validated event to genomics.ingest.validated or send to Deadletter if invalid.

๐Ÿ’พ 2. Genomics Ingest Processor

Downloads genomic files, computes hashes, uploads to HDFS, and prepares them for further processing.

Technologies & Tools:

  • ๐Ÿ Python
  • โ˜• Kafka (consume genomics.ingest.validated, produce genomics.ingest.persisted)
  • ๐Ÿ—๏ธ MinIO (file storage)
  • ๐Ÿ—„๏ธ HDFS (persistent storage)
  • ๐Ÿ”’ SHA256 hashing

Flow:

  1. Consume validated event.
  2. Download file from MinIO.
  3. Compute SHA256 hash.
  4. Upload file to HDFS.
  5. Delete file from MinIO staging.
  6. Publish persisted event or send to Deadletter on failure.

๐Ÿ“ 3. Genomics Hash Notarizer

Registers file hashes in a distributed ledger to notarize genomic files, ensuring integrity and traceability.

Technologies & Tools:

  • ๐Ÿ Python
  • โ˜• Kafka (consume genomics.ingest.persisted, produce genomics.hash.notarized)
  • ๐Ÿ”— Web3 / Smart Contract interface
  • ๐Ÿ”’ SHA256 hashing for user/file IDs

Flow:

  1. Consume persisted event from Kafka.
  2. Compute SHA256 hash of patient ID.
  3. Register file hash in the notary contract.
  4. Publish notarized event or send to Deadletter on failure.

๐Ÿ“Š 4. Genomics Metadata Registry

Registers genomic file metadata into a Hive table for querying and downstream analytics.

Technologies & Tools:

  • ๐Ÿ Python
  • โ˜• Kafka (consume genomics.ingest.notarized, produce genomics.ingest.completed)
  • ๐Ÿ Hive (metadata storage)
  • ๐Ÿงฎ UUID & timestamps for record tracking

Flow:

  1. Consume notarized event from Kafka.
  2. Build Hive INSERT query with metadata.
  3. Execute query.
  4. Publish registered event or send to Deadletter on failure.

๐Ÿ“ก 5. Genomics Notifier

Receives completed ingestion events and sends them to external systems via webhook.

Technologies & Tools:

  • ๐Ÿ Python
  • โ˜• Kafka (consume genomics.ingest.completed)
  • ๐ŸŒ Webhooks / HTTP requests

Flow:

  1. Consume completed ingestion event from Kafka.
  2. Send event payload to configured webhook URL.

๐Ÿ—‘๏ธ 6. Genomics Deadletter

Central service that manages failed messages from any stage of the pipeline.
Its main purpose is to isolate errors, store them reliably, and allow platform engineers to inspect and reprocess them later.

Technologies & Tools:

  • ๐Ÿ Python
  • โ˜• Kafka (consume .deadletter topic, persist events)
  • ๐Ÿ“‘ Structured logging for observability

Flow:

  1. Consume events from .deadletter Kafka topic.
  2. Persist event to a Deadletter store (MongoDB) for traceability.

๐ŸŒ Genomics API Microservices (Facade with Trino)

The Genomics API services provide a unified, high-performance interface to query genomic metadata.
While the underlying storage is managed by Hive on HDFS, Trino enables interactive, low-latency SQL queries, offering an OLTP-like experience for API clients.

Key Features & Tech:

  • ๐Ÿ Python + Flask
  • ๐Ÿ”‘ Keycloak for authentication & JWT validation
  • โšก Trino for fast, interactive query access
  • โš–๏ธ HAProxy for load balancing across multiple replicas

Some Endpoints:

  • POST /api/genomics/login โ†’ Authenticate user and return JWT
  • GET /api/genomics/patients/<patient_id>/files โ†’ Retrieve files for a specific patient using Trino
  • GET /api/genomics/genomic-files?limit=10&offset=0 โ†’ Retrieve all genomic files with pagination via Trino

Deployment:

  • 3 replicas for high availability
  • Load-balanced via HAProxy (9008 for API, 1937 for HAProxy stats)
  • Shared helper code mounted from ./common
  • Connected to genomic-data-storage-net

Notes:

  • The facade does not process raw genomic data; it queries persisted metadata from the pipeline.
  • Using Trino on top of Hive ensures efficient, interactive queries, enabling fast response times for analytics dashboards, reporting, and API requests.

Screenshots ๐Ÿ“ท

Here are some screenshots that demonstrate the functionality of Genomic Data Store Architecture:

Hive Server adming page

Screenshot 1

Hadoop YARN admin page

Screenshot 2 Screenshot 3

Hadoop HDFS admin page

Screenshot 5 Screenshot 6

Kafka admin page

Screenshot 4 Screenshot 8

HAProxy admin page

Screenshot 9

Docker Compose

Screenshot 7

Keycloack Admin page

Screenshot 10

Trino Admin Page

Screenshot 11

Rake Tasks ๐Ÿงฌ

This project leverages Rake to manage infrastructure, data ingestion, Hive schemas, MinIO uploads, and blockchain DApp tasks. Below is a table summarizing the main tasks and their purposes.

Task Description
genomic_data_storage:deploy Deploys the full architecture and launches all services and daemons.
genomic_data_storage:undeploy Undeploys the architecture and stops all containers.
genomic_data_storage:start Starts all containers after checking Docker and deployment files.
genomic_data_storage:stop Stops and removes running containers.
genomic_data_storage:status Shows the status of all containers.
genomic_data_storage:clean_environment Cleans Docker environment by pruning images and volumes.
genomic_data_storage:check_docker Checks if Docker and Docker Compose are installed and accessible.
genomic_data_storage:login Authenticates with Docker credentials.
genomic_data_storage:check_deployment_file Verifies that docker-compose.yml exists.
genomic_data_storage:build_and_push_genomic_api_image Builds and pushes the Genomic API Docker image to DockerHub.

Hive Tasks

Task Description
genomic_data_storage:hive:setup_metadata_registry_schema Copies and applies Hive SQL schema for metadata registry.
genomic_data_storage:hive:check_metadata_registry_schema Checks if genomic_files table exists in Hive.
genomic_data_storage:hive:count_genomic_files_data Counts the records in the genomic_files table.
genomic_data_storage:hive:display_genomic_files_data Displays all data in the genomic_files table.

MinIO Tasks

Task Description
genomic_data_storage:minio:upload_genomic_zip['./path/to/zip'] Uploads a genomic dataset ZIP file to MinIO.
genomic_data_storage:minio:upload_contract_abi Uploads the blockchain contract ABI JSON to MinIO.
genomic_data_storage:minio:delete_contract_abi Deletes the contract ABI JSON from MinIO.
genomic_data_storage:minio:check_contract_abi Checks if the contract ABI JSON exists in MinIO.

DApp Tasks

Task Description
genomic_data_storage:dapp:install_dependencies Installs dependencies for the GenomicNotaryDApp.
genomic_data_storage:dapp:run_tests Runs Hardhat tests for the GenomicNotaryDApp.
genomic_data_storage:dapp:deploy_contracts Deploys smart contracts to the Amoy network using Hardhat.

Exposed Services in the Genomic Data POC

This table summarizes the main services and endpoints exposed in the Genomic Data Storage Proof of Concept (POC). It includes the services that can be accessed directly (via browser, API, or client), their ports, and a brief description of their role. Internal microservices and background workers are listed without exposed ports as they communicate via the internal network or Kafka.

Service / Container Exposed Port(s) Role / Description
HDFS NameNode (namenode) 8089 HDFS web interface for monitoring the NameNode
YARN ResourceManager (resourcemanager) 8081 Web UI for resource and job management in Hadoop
Hive Metastore (hive-metastore) 9083 Thrift service for Hive metadata queries
HiveServer2 (hive-server) 10000, 10002 JDBC/ODBC/Beeline query service and Hive web UI
MinIO HAProxy (genomic_staging_minio_haproxy) 9000, 36305, 1936 Unified access to the MinIO object storage cluster for staging, plus HAProxy dashboard
Kafka (kafka) 9092 Kafka broker for event streaming
ZooKeeper (zookeeper) 2181 Coordination service for distributed systems (used by Kafka)
AKHQ (akhq) 8088 Web UI to explore and manage Kafka topics
Trino (trino) 8189 Distributed SQL query engine for fast analytics over Hive/HDFS
PostgreSQL (postgres) 5432 Relational database for microservices and Keycloak
pgAdmin (pgadmin) 8085 Web interface to manage PostgreSQL
Keycloak (keycloak) 8080 Identity management and JWT authentication
Genomics API Service replicas 5000 (internal) REST API to query genomic metadata with Keycloak authentication
Genomics API HAProxy 9008, 1937 Load balancer distributing requests across the API replicas
Genomics Ingest Validator - Validates genomic data before ingestion (consumes Kafka events)
Genomics Ingest Processor - Processes validated genomic data and publishes to Kafka
Genomics Hash Notarizer - Generates file hashes and ensures data integrity
Genomics Metadata Registry - Manages and updates genomic metadata
Genomics Notifier - Sends notifications via webhooks when genomic data processing is complete
Genomics Deadletter Service - Handles failed messages and ingestion errors
MongoDB (mongo) 27017 Stores deadletter messages and temporary data

โš ๏ธ Disclaimer

Genomics Data Storage Architecture is a Proof of Concept (POC) project developed for learning, experimentation, and demonstration purposes only. This project was created to explore and integrate a microservices-based architecture for genomic data processing using technologies such as Python, Kafka, Hive, Trino, HDFS, MinIO, Blockchain (Solidity/Web3/Polygon), Flask, and Docker.

It is not intended for production deployment. Any genomic data, identifiers, or blockchain interactions used during development are simulated or anonymized.

While inspired by real-world architectures, all development was carried out independently as a technical exercise.

The project name, logos, and any example data are purely for educational and illustrative purposes and do not represent actual brands.

๐Ÿ“ License

This project is licensed under the MIT License.

About

๐Ÿงฌ Genomic Data Storage Architecture: A proof of concept for securely managing and auditing massive genomic datasets by combining distributed storage ๐Ÿ“‚, event-driven microservices โšก, and blockchain โ›“๏ธ (or equivalent notarization) for tamper-proof, traceable, and scalable genomic data workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published