Flux ETL is a Data Quality Management platform designed to ingest exploratory data and transform it into rigorously structured schemas for analysis. It enforces schema contracts via SQL Service Broker, ensuring deterministic, reliable data flows and maintaining integrity across environments. Flux ETL also provides built-in analytics, including statistical models and graphical reports, to validate and monitor data quality continuously.
By leveraging an API-first architecture combined with schema contracts, Flux ETL creates idempotent pipelines that are production-ready. These pipelines enforce consistency, enable automated orchestration, and maintain strict adherence to contract definitions from development to full-scale production.
- API-first ETL with enforceable schema contracts and integrated data quality analytics.
- Deterministic, production-ready pipelines with end-to-end contract enforcement.
- Schema contracts, automated flows, and embedded statistical validation in a single engine.
- API-first ETL design with strict schema contracts
- Interactive UI with charting and statistical analysis
- Containerized local development (Docker)
- Production-ready builds with versioned releases
- Deterministic, reproducible pipelines
- Docker (v20+)
- Docker Compose (optional for multi-container setups)
- Node.js or Python environment (if contributing to the UI or scripts)
- Clone the repository:
git clone https://github.com/your-org/data-plasma.git cd data-plasma
Currently, Flux ETL consists of three logically seperate layers.
./demo/- (Py, Go)
./zOS- (C, Java)
./zenbase- (PG, JS)
This project uses pnpm exclusively. The specific version is pinned in package.json:
- Version:
pnpm@10.24.0 - Never use
npmoryarnfor JavaScript dependencies - Docker builder stage enables corepack for pnpm
The zOS project is a separate framework for federal mainframe data modernization:
- Purpose: Extract data from z/OS mainframes (DB2, VSAM, IMS) via secure TN3270/JCL
- Architecture: TN3270 TLS connector → JCL submission → DB2 unloads → DuckDB → PostgreSQL → APIs
- Mission: Cross-agency data layer for federal government without changing legacy mainframes
- Deployment: Designed for LTOD (Limited Tour of Duty) engineering teams
- Scope: 20-30 agencies/year with 7-8 engineers via automation
This is conceptual/planning stage work. The demo/ project serves as a reference implementation of the data pipeline architecture.
Flux ETL is a dual-purpose repository containing:
- demo/ - Multi-tier polyglot ETL service for eCFR analytics (pnpm monorepo)
- zOS/ - Mainframe integration framework for federal data modernization
brew install --cask docker
./demo.shcd demo
# Install dependencies (uses pnpm workspaces)
pnpm install
# Build all apps
pnpm build
# Build specific apps
pnpm build:web # Next.js frontend
pnpm build:api # Express API# Run web app (Next.js) in dev mode
pnpm dev:web
# Run API in dev mode
pnpm dev:api
# Run individual apps from their directories
cd apps/web && pnpm dev
cd apps/api && pnpm dev# Web app tests (Vitest)
cd apps/web
pnpm test
pnpm test:watch
# Cypress E2E tests
pnpm cypress:open
pnpm cypress
# Python lake tests
cd apps/lake
python test_pipeline.py
python test_postgres.py# Web app
cd apps/web
pnpm lint
pnpm typecheck
pnpm format
# API (TypeScript)
cd apps/api
pnpm build # Runs tsc which checks typescd demo
# Quick start all services (recommended)
./demo.sh
# Or use docker compose
sudo docker compose up --build
# Access services:
# - Web UI: http://localhost:3000
# - API: http://localhost:4000
# - Lake: http://localhost:8000
# - Postgres: localhost:5432
# Stop services
sudo docker compose down
# Full reset
sudo docker compose down -v
sudo docker compose up --buildcd demo/apps/lake
# Create virtualenv
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run ingestion and ETL
python ingestion.py
python etl_to_postgres.py
# Run lake service
gunicorn app:app --bind 0.0.0.0:8000The zOS directory contains mainframe integration tools but has minimal executable code. Refer to zOS/README.md for architecture details.
This is a pnpm monorepo with three main applications:
- Framework: Next.js 15 with App Router
- Language: TypeScript
- Styling: Tailwind CSS
- Charts: amCharts4
- Testing: Vitest, Testing Library, Cypress
- Structure:
app/- Next.js app router pages (dashboard, agencies, corrections, trends, reports)components/- Reusable React components (BarChart, LineChart, etc.)
- Framework: Express.js
- Language: TypeScript
- Database: PostgreSQL (via
pgclient) - Previously: Used DuckDB, migrated to PostgreSQL for production
- Endpoints: 11 REST endpoints for agencies, corrections, and trends data
- Port: 4000 (configurable via
API_PORT)
- Framework: Flask + Gunicorn
- Language: Python
- Analytics Engine: DuckDB for in-memory analytics
- Data Pipeline:
ingestion.py- Downloads eCFR data, validates checksums (SHA-256), stores in DuckDBanalytics.py- Calculates RVI (Regulatory Volatility Index) and other metricsetl_to_postgres.py- Migrates transformed data from DuckDB to PostgreSQL
- Schemas:
duckdb_schema.sqlandpostgres_schema.sql - Port: 8000
packages/tailwind-config- Shared Tailwind configuration
eCFR API → ingestion.py → DuckDB (analytics) → etl_to_postgres.py → PostgreSQL → Express API → Next.js Web
- Ingestion: Python fetches eCFR corrections and agency data, validates checksums
- Analytics: DuckDB performs analytical transformations (RVI calculation, aggregations)
- ETL: Data loaded into PostgreSQL for API consumption
- API Layer: Express serves REST endpoints with PostgreSQL queries
- Frontend: Next.js fetches from API and renders charts/dashboards
The Dockerfile uses a sophisticated multi-stage build:
- deps (node:20-alpine) - Installs pnpm dependencies
- builder - Builds Next.js (standalone output) and API (TypeScript), creates self-contained API bundle
- lake-deps (python:3.10-slim) - Creates Python virtualenv with lake dependencies
- runner (python:3.10-slim) - Final image with all services
supervisord manages all services in a single container:
- lake_pipeline (priority 5) - Runs ingestion + ETL once at startup
- web (priority 20) - Next.js standalone server on port 3000
- api (priority 20) - Express API on port 4000
- lake (priority 20) - Gunicorn Flask app on port 8000
The priority system ensures ETL completes before web services start.
Preferred for development:
- postgres service with health checks
- etl one-shot service (depends on postgres health)
- api service (depends on postgres + etl completion)
- web service (depends on api)
DATABASE_URL- PostgreSQL connection string (default:postgresql://stafferfi:stafferfi_dev@localhost:5432/ecfr_analytics)API_PORT- API listen port (default: 4000)NODE_ENV- Node environment (production/development)
PORT- Web server port (default: 3000)HOSTNAME- Bind address (default: 0.0.0.0)API_URL- Internal API URL for SSR (default: http://api:4000)NEXT_PUBLIC_API_URL- Client-side API URL (default: http://localhost:4000)NEXT_TELEMETRY_DISABLED- Set to 1 in production
DATABASE_URL- PostgreSQL connection string for ETL
- Tool: pnpm with workspaces
- Rationale: Efficient disk usage, strict dependency resolution, fast
- Config:
pnpm-workspace.yamldefines workspace structure
- Web app built with
output: 'standalone'in next.config.ts - Produces self-contained server bundle (no external dependencies)
- Reduces runtime image size significantly
The API build process creates an isolated bundle:
- TypeScript compiled to
dist/ - Copied to
/tmp/apiwith package.json - Production dependencies installed via npm (not pnpm)
- Decouples runtime from monorepo structure
- DuckDB: Used for analytical transformations (columnar, fast aggregations)
- PostgreSQL: Used for API queries (ACID, connection pooling)
- Why Both: DuckDB excels at ETL analytics, PostgreSQL serves web requests
Custom metric calculated in analytics.py:
- Measures frequency and impact of regulatory corrections
- Combines correction count, recency, and magnitude
- Core business logic for the eCFR analytics platform
- Unit Tests: Vitest + Testing Library
- E2E Tests: Cypress
- Storybook: Component development and visual testing
- TypeScript compilation serves as type checking
- No explicit test suite (integration tests via E2E)
test_pipeline.py- Data integrity, checksum verification, analytics validationtest_postgres.py- PostgreSQL schema and ETL verification- Run with:
python test_pipeline.py
- Always run
pnpm installfrom repo root - Use workspace filters:
pnpm --filter @stafferfi/web <command> - Workspace names:
@stafferfi/web,@stafferfi/api
- Postgres runs in Docker (non-persistent tmpfs for MVP)
- Schema changes: Edit
apps/lake/postgres_schema.sqland rebuild ETL - DuckDB file:
apps/lake/ecfr_analytics.duckdb(gitignored)
- Use
docker composefor development (orchestrates dependencies) - Use
./demo.shfor quick demos - The single-container
stafferfi-allimage requires external Postgres - Always check
docker compose logs -fwhen debugging
Inside the container:
supervisorctl -c /etc/supervisord.conf status
supervisorctl -c /etc/supervisord.conf tail <service> stdout
supervisorctl -c /etc/supervisord.conf restart <service>The lake_pipeline program runs ingestion AND ETL sequentially in one command to prevent multiple processes from accessing DuckDB simultaneously. Never run these as separate supervisor programs.
The API (apps/api/src/index.ts) exposes:
/- API metadata and endpoint list/health- Health check/api/stats- Aggregate statistics/api/agencies- List agencies (supports pagination)/api/agencies/:slug- Agency details/api/agencies/top/corrections- Top agencies by correction count/api/agencies/top/rvi- Top agencies by RVI/api/corrections- List corrections (filterable by year, title)/api/corrections/recent- Recent corrections/api/trends/yearly- Yearly trend data/api/trends/monthly- Monthly trend data/api/trends/titles- Top CFR titles/api/reports/word-count- Word count report (if implemented)/api/reports/scorecard- Scorecard report (if implemented)
app/page.tsx- Dashboard with stats cards and chartsapp/agencies/page.tsx- Sortable, searchable agency listapp/agencies/[slug]/page.tsx- Agency detail pageapp/corrections/page.tsx- Corrections listapp/trends/page.tsx- Trend visualizationsapp/reports/*- Report pages