This project is a Proof of Concept (POC) exploring how to integrate distributed storage, blockchain, and bioinformatics to manage genomic data in a secure, scalable, and auditable way.
Genomic datasets are massive โ generating ๐ terabytes of data โ which require storage in systems like HDFS, protection with ๐ cryptographic proofs, and orchestration through a โก flexible, event-driven microservice architecture.
A key goal of this POC is to demonstrate how blockchain technology can be applied in contexts where data integrity is critical, ensuring that genomic files are tamper-proof, traceable, and verifiable throughout the pipeline.
This POC implements a loosely coupled ecosystem, where each event drives processes of data ingestion ๐ฅ, metadata validation โ , and blockchain notarization โ๏ธ, guaranteeing integrity, traceability, and resilience while preserving privacy.
It demonstrates the technical feasibility of combining ๐งฌ bioinformatics, ๐ก big data, and โ๏ธ blockchain to create a trustworthy genomic data management workflow.
Genomics Data Storage Architecture is a Proof of Concept (POC) project developed for learning, experimentation, and demonstration purposes only. This project was created to explore and integrate a microservices-based architecture for genomic data processing using technologies such as Python, Kafka, Hive, Trino, HDFS, MinIO, Blockchain (Solidity/Web3/Polygon), Flask, and Docker.
It is not intended for production deployment. Any genomic data, identifiers, or blockchain interactions used during development are simulated or anonymized.
While inspired by real-world architectures, all development was carried out independently as a technical exercise.
The project name, logos, and any example data are purely for educational and illustrative purposes and do not represent actual brands.
-
Event-Driven Microservices โก
Loosely coupled services handling ingestion, validation, and notarization of genomic data, ensuring flexibility and scalability. -
Secure & Auditable Storage ๐
Genomic files are stored in HDFS with metadata registered in Hive, providing traceability and auditability.
Trino is used as a high-performance query layer on top of Hive for fast API responses. -
Data Integrity with Blockchain โ๏ธ
File hashes and critical metadata are notarized on a blockchain, ensuring tamper-proof records and verifiable integrity. -
Scalable Message Bus ๐จ
Uses Kafka for asynchronous communication between microservices, enabling horizontal scaling and resilient pipelines. -
MinIO Object Storage ๐๏ธ
Temporary storage for metadata and files before HDFS persistence, providing efficient staging and retrieval. -
Web API ๐
A Flask-based REST API provides secure endpoints to query genomic files.
Queries to Hive are executed via Trino for higher performance and reduced latency. -
Deadletter Handling โ
Failed events are captured in deadletter topics, enabling monitoring, debugging, and operational transparency. -
Blockchain-Ready Architecture ๐งฌ
Designed to illustrate how blockchain can be applied to genomic workflows, highlighting traceability, audit, and integrity. -
Extensible & Modular ๐ ๏ธ
Each microservice can be extended or replaced independently, facilitating experimentation with new technologies or algorithms.
This Proof of Concept demonstrates how distributed storage, blockchain, and event-driven microservices can be combined to securely manage large-scale genomic data.
The architecture is designed to ensure data integrity, traceability, and resilience, which are critical when handling sensitive genomic information.
Genomic projects generate massive volumes of data (terabytes per genome). This POC illustrates how such data can be ingested, validated, stored, and notarized in a modular pipeline that can scale horizontally, maintain privacy, and provide an auditable trail of every operation.
-
Ingestion ๐ฅ
- Raw genomic files and associated metadata are uploaded to MinIO (object storage).
- Each upload triggers a Kafka event that starts the processing pipeline.
-
Validation โ
- Microservices consume Kafka events to extract and validate metadata.
- Invalid events are sent to deadletter topics, allowing monitoring and error handling.
-
Persistence ๐พ
- Validated files are persisted in HDFS for long-term storage.
- Metadata is registered in Hive, enabling structured queries.
- Trino provides a high-performance SQL query layer on top of Hive for API consumption.
-
Blockchain Notarization โ๏ธ
- Critical information such as file hashes and patient identifiers is registered on a blockchain (Web3 / Polygon).
- This provides a tamper-proof, verifiable record to guarantee integrity and traceability.
-
Notification & Webhooks ๐
- Once files are fully processed, events are published to a completed topic.
- A microservice listens to this topic and triggers webhooks to notify external systems.
- Failed events are captured in deadletter topics for operational review.
-
API Layer ๐
- A Flask-based API provides secure endpoints to query genomic files, fetch patient-specific data, and integrate with external applications.
- Queries are executed via Trino for faster response times compared to direct Hive queries.
- JWT authentication ensures that only authorized users can access sensitive information.
-
Monitoring & Deadletter Handling โ
- Each stage in the pipeline is monitored via Kafka topics.
- Failed or invalid events are routed to deadletter queues to ensure traceability and debugging capability.
This Proof of Concept leverages a mix of big data, blockchain, cloud storage, and microservices technologies to demonstrate secure, auditable genomic data processing.
- ๐ Python โ Main programming language for microservices and helpers
- โ Apache Kafka โ Event streaming and message bus
- ๐๏ธ HDFS / Hadoop / YARN โ Distributed storage and resource management
- ๐ Apache Hive โ Metadata storage and queries
- ๐น Trino โ High-performance SQL query engine for Hive data
- ๐๏ธ MinIO โ Object storage for genomic files and metadata
- ๐ Blockchain & Solidity / Web3 / Polygon โ File hash notarization and integrity proofs
- ๐ PostgreSQL โ Persistent relational storage for service metadata
- ๐ Keycloak โ Authentication and JWT-based authorization
- ๐ณ Docker โ Containerized microservices
- โก Event-driven Microservice Architecture โ Loosely coupled services for ingestion, validation, persistence, notarization, and webhook notifications
- ๐ Flask โ API microservices for querying genomic data
- ๐ ๏ธ Supporting tools โ SHA256 hashing, logging, and custom helpers for Kafka, HDFS, Hive, Trino, MinIO, and blockchain
For a genomic data platform, flexibility, scalability, and resilience are critical. An event-driven architecture (EDA) orchestrated with Apache Kafka is particularly well-suited for this kind of workload. Hereโs why:
Each microservice focuses on a single responsibility: ingestion, validation, persistence, notarization, or notifications.
- Services communicate asynchronously via events.
- Adding or modifying a microservice does not break the rest of the system.
Genomic datasets can reach terabytes per patient.
- Kafka allows high-throughput streaming, handling spikes of data efficiently.
- Individual microservices can scale independently depending on workload.
- New processes (e.g., analytics, AI pipelines, or additional blockchain checks) can subscribe to existing topics without changing existing services.
- The system adapts naturally to new data types or processing steps, making it ideal for POC exploration and experimentation.
- Kafka persists messages, ensuring no data is lost if a service fails temporarily.
- Dead-letter topics capture errors for later inspection, enabling robust error handling and auditing.
- Every action on a genomic file (ingestion, validation, notarization) is logged as an event.
- This provides a complete audit trail, which is essential for compliance and scientific reproducibility.
- REST: Requires tightly coupled services; failures propagate immediately, reducing resilience.
- Batch pipelines: Harder to extend and less responsive; not suitable for real-time notifications or scaling with bursts of genomic data.
This POC leverages Apache Hive as the core metadata storage and batch query engine for genomic files, while Trino acts as a high-performance query layer to provide fast, interactive analytical queries.
- Genomic projects generate massive volumes of metadata for patients, samples, and sequencing runs.
- Hive enables SQL-like queries over large datasets stored in HDFS, making it ideal for batch analytics, reporting, and exploratory analysis.
- Supports dynamic partitions and schema evolution, allowing the platform to adapt to new data types or fields without major rewrites.
- Integrates seamlessly with Hadoop/YARN, HDFS, and other big data tools, enabling horizontal scaling on commodity hardware at a low cost.
- Trino sits on top of Hive and provides interactive, low-latency SQL queries on the same metadata.
- Enables the API layer and analytics dashboards to perform efficient, near-real-time queries, even over terabytes of genomic metadata.
- Supports federated queries, allowing seamless integration with other data sources if needed.
- Acts as a bridge between batch-oriented Hive storage and fast analytical access, effectively providing an OLTP-like experience for end-users without changing the underlying big data architecture.
- Researchers and developers can use standard SQL for both exploratory batch analytics (Hive) and interactive queries (Trino).
- The separation of storage (Hive/HDFS) and query layer (Trino) ensures the system is scalable, flexible, and performant, accommodating both large-scale batch processing and responsive API-driven analytics.
This project implements a modular microservices architecture for processing genomic data.
Each service is focused on a single responsibility, communicating asynchronously via Kafka topics.
This design allows for scalability, fault tolerance, and easy maintenance.
The workflow follows a clear pipeline of data processing:
- Ingestion & Validation โ Metadata is validated before further processing.
- File Processing & Persistence โ Genomic files are downloaded, hashed, and stored in HDFS.
- Hash Notarization โ File integrity is notarized in a distributed ledger.
- Metadata Registration โ File metadata is recorded in Hive for analytics.
- Notifier โ Sends ingestion results to external systems.
- Deadletter โ Captures and persists failed events.
Each service handles deadletter events for errors, ensuring that failed messages do not block the pipeline.
Validates incoming genomic metadata from MinIO before it enters the processing pipeline. Ensures that all required fields are correct and complete.
Technologies & Tools:
- ๐ Python
- โ Kafka (consume
genomics.ingest.requested
, producegenomics.ingest.validated
) - ๐๏ธ MinIO (metadata storage)
- โ Validation logic (custom helpers)
Flow:
- Consume event from Kafka input topic.
- Fetch
metadata.json
from MinIO. - Validate fields and format.
- Publish validated event to
genomics.ingest.validated
or send to Deadletter if invalid.
Downloads genomic files, computes hashes, uploads to HDFS, and prepares them for further processing.
Technologies & Tools:
- ๐ Python
- โ Kafka (consume
genomics.ingest.validated
, producegenomics.ingest.persisted
) - ๐๏ธ MinIO (file storage)
- ๐๏ธ HDFS (persistent storage)
- ๐ SHA256 hashing
Flow:
- Consume validated event.
- Download file from MinIO.
- Compute SHA256 hash.
- Upload file to HDFS.
- Delete file from MinIO staging.
- Publish persisted event or send to Deadletter on failure.
Registers file hashes in a distributed ledger to notarize genomic files, ensuring integrity and traceability.
Technologies & Tools:
- ๐ Python
- โ Kafka (consume
genomics.ingest.persisted
, producegenomics.hash.notarized
) - ๐ Web3 / Smart Contract interface
- ๐ SHA256 hashing for user/file IDs
Flow:
- Consume persisted event from Kafka.
- Compute SHA256 hash of patient ID.
- Register file hash in the notary contract.
- Publish notarized event or send to Deadletter on failure.
Registers genomic file metadata into a Hive table for querying and downstream analytics.
Technologies & Tools:
- ๐ Python
- โ Kafka (consume
genomics.ingest.notarized
, producegenomics.ingest.completed
) - ๐ Hive (metadata storage)
- ๐งฎ UUID & timestamps for record tracking
Flow:
- Consume notarized event from Kafka.
- Build Hive
INSERT
query with metadata. - Execute query.
- Publish registered event or send to Deadletter on failure.
Receives completed ingestion events and sends them to external systems via webhook.
Technologies & Tools:
- ๐ Python
- โ Kafka (consume
genomics.ingest.completed
) - ๐ Webhooks / HTTP requests
Flow:
- Consume completed ingestion event from Kafka.
- Send event payload to configured webhook URL.
Central service that manages failed messages from any stage of the pipeline.
Its main purpose is to isolate errors, store them reliably, and allow platform engineers to inspect and reprocess them later.
Technologies & Tools:
- ๐ Python
- โ Kafka (consume
.deadletter
topic, persist events) - ๐ Structured logging for observability
Flow:
- Consume events from
.deadletter
Kafka topic. - Persist event to a Deadletter store (MongoDB) for traceability.
The Genomics API services provide a unified, high-performance interface to query genomic metadata.
While the underlying storage is managed by Hive on HDFS, Trino enables interactive, low-latency SQL queries, offering an OLTP-like experience for API clients.
Key Features & Tech:
- ๐ Python + Flask
- ๐ Keycloak for authentication & JWT validation
- โก Trino for fast, interactive query access
- โ๏ธ HAProxy for load balancing across multiple replicas
Some Endpoints:
POST /api/genomics/login
โ Authenticate user and return JWTGET /api/genomics/patients/<patient_id>/files
โ Retrieve files for a specific patient using TrinoGET /api/genomics/genomic-files?limit=10&offset=0
โ Retrieve all genomic files with pagination via Trino
Deployment:
- 3 replicas for high availability
- Load-balanced via HAProxy (
9008
for API,1937
for HAProxy stats) - Shared helper code mounted from
./common
- Connected to
genomic-data-storage-net
Notes:
- The facade does not process raw genomic data; it queries persisted metadata from the pipeline.
- Using Trino on top of Hive ensures efficient, interactive queries, enabling fast response times for analytics dashboards, reporting, and API requests.
Here are some screenshots that demonstrate the functionality of Genomic Data Store Architecture:
This project leverages Rake to manage infrastructure, data ingestion, Hive schemas, MinIO uploads, and blockchain DApp tasks. Below is a table summarizing the main tasks and their purposes.
Task | Description |
---|---|
genomic_data_storage:deploy |
Deploys the full architecture and launches all services and daemons. |
genomic_data_storage:undeploy |
Undeploys the architecture and stops all containers. |
genomic_data_storage:start |
Starts all containers after checking Docker and deployment files. |
genomic_data_storage:stop |
Stops and removes running containers. |
genomic_data_storage:status |
Shows the status of all containers. |
genomic_data_storage:clean_environment |
Cleans Docker environment by pruning images and volumes. |
genomic_data_storage:check_docker |
Checks if Docker and Docker Compose are installed and accessible. |
genomic_data_storage:login |
Authenticates with Docker credentials. |
genomic_data_storage:check_deployment_file |
Verifies that docker-compose.yml exists. |
genomic_data_storage:build_and_push_genomic_api_image |
Builds and pushes the Genomic API Docker image to DockerHub. |
Task | Description |
---|---|
genomic_data_storage:hive:setup_metadata_registry_schema |
Copies and applies Hive SQL schema for metadata registry. |
genomic_data_storage:hive:check_metadata_registry_schema |
Checks if genomic_files table exists in Hive. |
genomic_data_storage:hive:count_genomic_files_data |
Counts the records in the genomic_files table. |
genomic_data_storage:hive:display_genomic_files_data |
Displays all data in the genomic_files table. |
Task | Description |
---|---|
genomic_data_storage:minio:upload_genomic_zip['./path/to/zip'] |
Uploads a genomic dataset ZIP file to MinIO. |
genomic_data_storage:minio:upload_contract_abi |
Uploads the blockchain contract ABI JSON to MinIO. |
genomic_data_storage:minio:delete_contract_abi |
Deletes the contract ABI JSON from MinIO. |
genomic_data_storage:minio:check_contract_abi |
Checks if the contract ABI JSON exists in MinIO. |
Task | Description |
---|---|
genomic_data_storage:dapp:install_dependencies |
Installs dependencies for the GenomicNotaryDApp. |
genomic_data_storage:dapp:run_tests |
Runs Hardhat tests for the GenomicNotaryDApp. |
genomic_data_storage:dapp:deploy_contracts |
Deploys smart contracts to the Amoy network using Hardhat. |
This table summarizes the main services and endpoints exposed in the Genomic Data Storage Proof of Concept (POC). It includes the services that can be accessed directly (via browser, API, or client), their ports, and a brief description of their role. Internal microservices and background workers are listed without exposed ports as they communicate via the internal network or Kafka.
Service / Container | Exposed Port(s) | Role / Description |
---|---|---|
HDFS NameNode (namenode) | 8089 | HDFS web interface for monitoring the NameNode |
YARN ResourceManager (resourcemanager) | 8081 | Web UI for resource and job management in Hadoop |
Hive Metastore (hive-metastore) | 9083 | Thrift service for Hive metadata queries |
HiveServer2 (hive-server) | 10000, 10002 | JDBC/ODBC/Beeline query service and Hive web UI |
MinIO HAProxy (genomic_staging_minio_haproxy) | 9000, 36305, 1936 | Unified access to the MinIO object storage cluster for staging, plus HAProxy dashboard |
Kafka (kafka) | 9092 | Kafka broker for event streaming |
ZooKeeper (zookeeper) | 2181 | Coordination service for distributed systems (used by Kafka) |
AKHQ (akhq) | 8088 | Web UI to explore and manage Kafka topics |
Trino (trino) | 8189 | Distributed SQL query engine for fast analytics over Hive/HDFS |
PostgreSQL (postgres) | 5432 | Relational database for microservices and Keycloak |
pgAdmin (pgadmin) | 8085 | Web interface to manage PostgreSQL |
Keycloak (keycloak) | 8080 | Identity management and JWT authentication |
Genomics API Service replicas | 5000 (internal) | REST API to query genomic metadata with Keycloak authentication |
Genomics API HAProxy | 9008, 1937 | Load balancer distributing requests across the API replicas |
Genomics Ingest Validator | - | Validates genomic data before ingestion (consumes Kafka events) |
Genomics Ingest Processor | - | Processes validated genomic data and publishes to Kafka |
Genomics Hash Notarizer | - | Generates file hashes and ensures data integrity |
Genomics Metadata Registry | - | Manages and updates genomic metadata |
Genomics Notifier | - | Sends notifications via webhooks when genomic data processing is complete |
Genomics Deadletter Service | - | Handles failed messages and ingestion errors |
MongoDB (mongo) | 27017 | Stores deadletter messages and temporary data |
Genomics Data Storage Architecture is a Proof of Concept (POC) project developed for learning, experimentation, and demonstration purposes only. This project was created to explore and integrate a microservices-based architecture for genomic data processing using technologies such as Python, Kafka, Hive, Trino, HDFS, MinIO, Blockchain (Solidity/Web3/Polygon), Flask, and Docker.
It is not intended for production deployment. Any genomic data, identifiers, or blockchain interactions used during development are simulated or anonymized.
While inspired by real-world architectures, all development was carried out independently as a technical exercise.
The project name, logos, and any example data are purely for educational and illustrative purposes and do not represent actual brands.
This project is licensed under the MIT License.