This is an end-to-end system built around SciDB 19.11 that supports distributed ingestion, multi-layer query processing, with a form of multi-tenant query isolation, load simulation, variant-based benchmarking, and a prototype machine learning framework for adaptive query optimization.
The system is fully containerized and provides a reproducible workflow for handling large scientific datasets, generating parallel load, and experimenting with learned cost models and scheduling policies.
The project consists of four major layers:
Reproducible SciDB and Spark environment with all dependencies rebuilt and configured.
Support for ingesting large datasets into SciDB using Accelerated I/O with interoperability through Arrow and Parquet.
A custom canonicalization and variant generation pipeline implemented in C++ that standardizes incoming queries and produces alternate execution formulations.
A Python benchmarking engine that collects execution metrics and an experimental ML module that uses those metrics to evaluate and improve query performance.
This architecture allows experiments on workload behavior, plan selection, tail latency, and multi-tenant scheduling.
+------------------------------+
| Client Queries |
+---------------+--------------+
|
v
+------------------------------------+
| Query Control Plane |
| Canonicalizer + Variant Generator |
+----------------+--------------------+
|
AFL Variants + Metadata
|
v
+----------------------------------------+
| Benchmark Execution Layer |
| SciDB Runner + Metrics Collector |
+----------------+------------------------+
|
Metrics + Features
|
v
+-------------------------------+
| ML Analysis Layer |
| SVM + DQN (Prototype) |
+-------------------------------+
|
v
+-------------------------------+
| Performance Insights |
+-------------------------------+
The system runs on two coordinated containers:
- SciDB 19.11
- Shim service
- Accelerated I/O support
- Arrow and Parquet runtime libraries
- Spark 3.3.2
- Java 17
- Python 3.10
- Used to generate synthetic data and simulate concurrent workloads
Both containers share a mounted directory that acts as the I/O workspace. This avoids distributed filesystem dependencies and keeps ingest paths fast and simple.
- Conversion utilities for NetCDF to long-form CSV
- Ingestion scripts for structured CSV and dimensional redimensioning in SciDB
SciDB Accelerated I/O (AIO) is used to:
- Import CSV at scale
- Export arrays as Parquet shards
- Maintain parallelism by allowing each SciDB instance to read or write its region independently
Arrow C++ and Parquet C++ libraries are built into the environment so that:
- Data exported by SciDB can be consumed by Spark and Python directly
- Future ingestion formats can be evaluated for performance benefits
A standalone C++ module handles query processing before execution on SciDB. It includes:
- Parses AQL/AFL queries
- Produces a deterministic AST format
- Normalizes ordering of joins, filters, projections, and identifiers
- Generates alternate logical formulations of the same query
- Supports transformations like join reordering, predicate pushdown, and projection pruning
- Produces executable AFL strings through a grammar-based emitter
- Assigns lightweight tags for tracking
- Stores canonical and variant forms for benchmarking
This layer ensures all queries entering the system follow a consistent, analyzable format suitable for ML-driven evaluation.
A Python-based execution module performs:
- Executes each AFL variant against a SciDB sandbox instance
- Tracks per-run information with unique identifiers
Measured values include:
- Latency
- CPU usage
- Memory footprint
- Disk I/O
- CPU skew
All records are stored in a PostgreSQL schema designed for:
- Benchmark aggregation
- ML model training
- Drift analysis and reproducibility
A prototype ML layer evaluates and predicts plan performance.
- Uses benchmark data to classify plan variants
- Identifies variants likely to perform well under given constraints
- Accepts live system indicators
- Suggests a variant to execute based on expected latency behavior
- Focuses on reducing tail latency and maintaining plan quality under changing load
This module is an early-stage research component intended for experimentation rather than production deployment.
The system provides observability across all components through:
- Each service exposes metrics on an HTTP endpoint
- Collects periodic system and query performance metrics
- Dashboards visualize latency, resource utilization, and workload trends
- Used when deploying on EKS or EC2
- Supports autoscaling experiments and deeper system insights
A common workflow looks like:
- Use Spark to generate a dataset or load an existing one
- Write CSV or Parquet into the shared workspace
- Use AIO to ingest data into SciDB
- Convert long CSV into an n-dimensional array through redimension
- Generate canonical AST and variants
- Execute variants using the benchmark runner
- Collect metrics and store them in the feature store
- Analyze with ML or visualization dashboards
This end-to-end flow is reproducible and designed for iterative experimentation.
- Containerized environment
- Distributed I/O integration
- Spark workload simulation
- Canonicalizer and variant engine
- Benchmark runner with full metrics
- ML prototype
- Observability stack
- Integrating ML inference directly into runtime scheduling
- Expanding workload generation tools
- Optional Kubernetes deployment scripts
- Build or pull the SciDB and Spark containers
- Start both containers with a shared I/O mount
- Generate data or place datasets in the shared directory
- Use ingestion scripts to import into SciDB
- Run canonicalizer to prepare variants
- Use the benchmark runner to execute and capture metrics
- Explore metrics in the dashboards or feed them into ML modules
Scripts and examples are provided within the respective folders.