Development Guide

This guide covers setting up your development environment, running migrations, and common development workflows for the ETL project.

Prerequisites
Quick Start
Database Setup
- Using the Setup Script
- Manual Setup
Database Migrations
- ETL API Migrations
- ETL Replicator Migrations
Running the Services
Kubernetes Setup
Common Development Tasks

Prerequisites

Before starting, ensure you have the following installed:

Required Tools

Rust (latest stable): Install Rust
PostgreSQL client (psql): Required for database operations
Docker Compose: For running PostgreSQL and other services
kubectl: For Kubernetes operations
SQLx CLI: For database migrations

Install SQLx CLI:

cargo install --version='~0.8.6' sqlx-cli --no-default-features --features rustls,postgres

Optional Tools

OrbStack: Recommended for local Kubernetes development (alternative to Docker Desktop)
- Install OrbStack
- Enable Kubernetes in OrbStack settings

Quick Start

The fastest way to get started is using the setup script:

# From the project root
./scripts/init.sh

This script will:

Start PostgreSQL via Docker Compose
Run etl-api migrations
Seed the default replicator image
Configure the Kubernetes environment (OrbStack)

Database Setup

Using the Setup Script

The scripts/init.sh script provides a complete development environment setup:

# Use default settings (Postgres on port 5430)
./scripts/init.sh

# Customize database settings
POSTGRES_PORT=5432 POSTGRES_DB=mydb ./scripts/init.sh

# Skip Docker if you already have Postgres running
SKIP_DOCKER=1 ./scripts/init.sh

# Use persistent storage
POSTGRES_DATA_VOLUME=/path/to/data ./scripts/init.sh

Environment Variables:

Variable	Default	Description
`POSTGRES_USER`	`postgres`	Database user
`POSTGRES_PASSWORD`	`postgres`	Database password
`POSTGRES_DB`	`postgres`	Database name
`POSTGRES_PORT`	`5430`	Database port
`POSTGRES_HOST`	`localhost`	Database host
`SKIP_DOCKER`	(empty)	Skip Docker Compose if set
`POSTGRES_DATA_VOLUME`	(empty)	Path for persistent storage
`REPLICATOR_IMAGE`	`ramsup/etl-replicator:latest`	Default replicator image

PostgreSQL 18+ containers store data under /var/lib/postgresql/<major>/data, so the Docker Compose setup mounts the parent /var/lib/postgresql directory to keep upgrades compatible.

Manual Setup

If you prefer manual setup or have an existing PostgreSQL instance:

Important: The etl-api and etl-replicator migrations can run on separate databases. You might have:

The etl-api using its own dedicated Postgres instance for the control plane
The etl-replicator state store on the same database you're replicating from (source database)
Or both on the same database (for simpler local development setups)

Single Database Setup

If using one database for both the API and replicator state:

export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB

# Run both migrations on the same database
./etl-api/scripts/run_migrations.sh
./etl-replicator/scripts/run_migrations.sh

Separate Database Setup

If using separate databases (recommended for production):

# API migrations on the control plane database
export DATABASE_URL=postgres://USER:PASSWORD@API_HOST:PORT/API_DB
./etl-api/scripts/run_migrations.sh

# Replicator migrations on the source database
export DATABASE_URL=postgres://USER:PASSWORD@SOURCE_HOST:PORT/SOURCE_DB
./etl-replicator/scripts/run_migrations.sh

This separation allows you to:

Scale the control plane independently from replication workloads
Keep the replicator state close to the source data
Isolate concerns between infrastructure management and data replication

Database Migrations

The project uses SQLx for database migrations. There are two sets of migrations:

ETL API Migrations

Located in etl-api/migrations/, these create the control plane schema (app schema) for managing tenants, sources, destinations, and pipelines.

Running API migrations:

# From project root
./etl-api/scripts/run_migrations.sh

# Or manually with SQLx CLI
sqlx migrate run --source etl-api/migrations

Creating a new API migration:

cd etl-api
sqlx migrate add <migration_name>

Resetting the API database:

cd etl-api
sqlx migrate revert

Updating SQLx metadata after schema changes:

cd etl-api
cargo sqlx prepare

ETL Replicator Migrations

Located in etl-replicator/migrations/, these create the replicator's state store schema (etl schema) for tracking replication state, table schemas, and mappings.

Running replicator migrations:

# From project root
./etl-replicator/scripts/run_migrations.sh

# Or manually with SQLx CLI (requires setting search_path)
psql $DATABASE_URL -c "create schema if not exists etl;"
sqlx migrate run --source etl-replicator/migrations --database-url "${DATABASE_URL}?options=-csearch_path%3Detl"

Important: Migrations are run automatically when using the etl-replicator binary (see etl-replicator/src/migrations.rs:16). However, if you integrate the etl crate directly into your own application as a library, you should run these migrations manually before starting your pipeline. This design decision ensures:

The standalone replicator binary works out-of-the-box
Library users have explicit control over when migrations run
CI/CD pipelines can pre-apply migrations independently

When to run migrations manually:

Integrating etl as a library in your own application
Pre-creating the state store schema before deployment
Testing migrations independently
CI/CD pipelines that separate migration and deployment steps

Creating a new replicator migration:

cd etl-replicator
sqlx migrate add <migration_name>

Running the Services

Both etl-api and etl-replicator binaries use hierarchical configuration loading from the configuration/ directory within each crate. Configuration is loaded in this order:

Base configuration: configuration/base.yaml (always loaded)
Environment-specific: configuration/{environment}.yaml (e.g., dev.yaml, prod.yaml)
Environment variable overrides: Prefixed with APP_ (e.g., APP_DATABASE__URL)

Environment Selection:

The environment is determined by the APP_ENVIRONMENT variable:

Default: prod (if APP_ENVIRONMENT is not set)
Available: dev, staging, prod

# Run with dev environment
APP_ENVIRONMENT=dev cargo run

# Run with production environment (default)
cargo run

# Override specific config values
APP_ENVIRONMENT=dev APP_DATABASE__URL=postgres://localhost/mydb cargo run

ETL API

Running from Source

cd etl-api
APP_ENVIRONMENT=dev cargo run

The API loads configuration from etl-api/configuration/{environment}.yaml. See etl-api/README.md for available configuration options.

Running with Docker

Docker images are available for the etl-api. You must mount the configuration files and can override settings via environment variables:

docker run \
  -v $(pwd)/etl-api/configuration/base.yaml:/app/configuration/base.yaml \
  -v $(pwd)/etl-api/configuration/dev.yaml:/app/configuration/dev.yaml \
  -e APP_ENVIRONMENT=dev \
  -p 8080:8080 \
  ramsup/etl-api:latest

Configuration requirements:

Mount both base.yaml and your environment-specific config file (e.g., dev.yaml)
Set APP_ENVIRONMENT to match your mounted environment file
Override specific values using APP_ prefixed environment variables

Kubernetes Setup (ETL API Only)

The etl-api manages replicator deployments on Kubernetes by dynamically creating StatefulSets, Secrets, and ConfigMaps. The etl-api requires Kubernetes, but the etl-replicator binary can run independently without any Kubernetes setup.

Prerequisites:

OrbStack with Kubernetes enabled (or another local Kubernetes cluster)
kubectl configured with the orbstack context
Pre-defined Kubernetes resources (see below)

Required Pre-Defined Resources:

The etl-api expects these resources to exist before it can deploy replicators:

Namespace: etl-data-plane - Where all replicator pods and related resources are created
ConfigMap: trusted-root-certs-config - Provides trusted root certificates for TLS connections

These are defined in scripts/ and should be applied before running the API:

kubectl --context orbstack apply -f scripts/etl-data-plane.yaml
kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yaml

Note: For the complete list of expected Kubernetes resources and their specifications, refer to the constants and resource creation logic in etl-api/src/k8s/http.rs.

ETL Replicator

The replicator can run as a standalone binary without Kubernetes.

Running from Source

cd etl-replicator
APP_ENVIRONMENT=dev cargo run

The replicator loads configuration from etl-replicator/configuration/{environment}.yaml.

Running with Docker

Docker images are available for the etl-replicator. You must mount the configuration files and can override settings via environment variables:

docker run \
  -v $(pwd)/etl-replicator/configuration/base.yaml:/app/configuration/base.yaml \
  -v $(pwd)/etl-replicator/configuration/dev.yaml:/app/configuration/dev.yaml \
  -e APP_ENVIRONMENT=dev \
  etl-replicator:latest

Configuration requirements:

Mount both base.yaml and your environment-specific config file (e.g., dev.yaml)
Set APP_ENVIRONMENT to match your mounted environment file
Override specific values using APP_ prefixed environment variables

Note: While the replicator is typically deployed as a Kubernetes pod managed by the etl-api, it does not require Kubernetes to function. You can run it as a standalone process on any machine with the appropriate configuration.

Running Tests

The project includes comprehensive test suites that require a PostgreSQL database. Tests use environment variables for database configuration to ensure isolation and reproducibility.

Test Environment Variables

PostgreSQL Test Variables

All tests that interact with PostgreSQL require the following environment variables to be set:

Variable	Required	Description
`TESTS_DATABASE_HOST`	Yes	PostgreSQL server hostname (e.g., `localhost`)
`TESTS_DATABASE_PORT`	Yes	PostgreSQL server port (e.g., `5430`)
`TESTS_DATABASE_USERNAME`	Yes	Database user (e.g., `postgres`)
`TESTS_DATABASE_PASSWORD`	No	Database password (optional)

Note: Each test creates a unique database with a UUID-based name to ensure test isolation. The test databases are automatically cleaned up after tests complete.

BigQuery Test Variables

BigQuery destination tests require Google Cloud credentials:

Variable	Required	Description
`TESTS_BIGQUERY_PROJECT_ID`	Yes	GCP project ID for BigQuery
`TESTS_BIGQUERY_SA_KEY_PATH`	Yes	Path to service account JSON key file

Note: BigQuery tests are only run when the bigquery and test-utils features are enabled. Each test creates a unique dataset with a UUID-based name for isolation.

Iceberg Test Variables

Iceberg destination tests use local MinIO and Lakekeeper instances. The following services must be running:

Lakekeeper: http://localhost:8182 (REST catalog)
MinIO: http://localhost:9010 (S3-compatible storage)
- Username: minio-admin
- Password: minio-admin-password

Note: Iceberg tests are only run when the iceberg and test-utils features are enabled. These use hardcoded local URLs and do not require environment variables.

Test Output and Logging

Variable	Description
`ENABLE_TRACING=1`	Enable tracing output during test execution (useful for debugging)
`RUST_LOG`	Control log level (e.g., `debug`, `info`, `warn`, `error`)

Example:

# Run tests with debug output
ENABLE_TRACING=1 RUST_LOG=debug cargo test test_name -- --nocapture

Setting Up Test Environment

Option 1: Inline Environment Variables (Recommended)

The most reliable way is to set environment variables directly in the test command:

TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api

Option 2: Export in Current Shell Session

Export variables in your current shell session, then run tests:

# PostgreSQL test configuration
export TESTS_DATABASE_HOST=localhost
export TESTS_DATABASE_PORT=5430
export TESTS_DATABASE_USERNAME=postgres
export TESTS_DATABASE_PASSWORD=postgres

# BigQuery test configuration (optional - only needed for BigQuery tests)
export TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
export TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json

# Enable test output (optional)
export ENABLE_TRACING=1
export RUST_LOG=info

# Now run tests
cargo test -p etl-api

Option 3: Use a `.env` File

Create a .env.test file and source it:

# .env.test

# PostgreSQL (required for most tests)
TESTS_DATABASE_HOST=localhost
TESTS_DATABASE_PORT=5430
TESTS_DATABASE_USERNAME=postgres
TESTS_DATABASE_PASSWORD=postgres

# BigQuery (optional - only for BigQuery tests)
TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json

# Test output (optional)
ENABLE_TRACING=1
RUST_LOG=info

# Source the file and run tests
source .env.test
cargo test -p etl-api

Running Tests

Important: Environment variables must be set in the same command as cargo test, or exported in your current shell session before running tests.

# Run all tests (requires env variables)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test

# Run tests for a specific package
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api

# Run tests for packages with test-utils feature (etl, etl-postgres, etl-destinations)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl --features test-utils

# Run a specific test
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api --test tenants tenant_can_be_created

# Run tests with tracing output for debugging
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres ENABLE_TRACING=1 RUST_LOG=info cargo test -p etl-api --test tenants tenant_can_be_created -- --nocapture

Packages requiring --features test-utils:

etl
etl-postgres
etl-destinations

Packages that don't require feature flags:

etl-api
etl-config
etl-telemetry
etl-replicator

Note: Ensure PostgreSQL is running and accessible at the configured host and port before running tests. The test suite will fail if it cannot connect to the database or if the required environment variables are not set.

Troubleshooting

Database Connection Issues

If you encounter connection issues:

Verify PostgreSQL is running:

docker-compose -f scripts/docker-compose.yaml ps

Check the connection:
```
psql $DATABASE_URL -c "SELECT 1;"
```
Ensure the correct port is used (default: 5430)

Migration Issues

If migrations fail:

Check if the database exists:
```
psql $DATABASE_URL -c "\l"
```
Verify SQLx CLI is installed:
```
sqlx --version
```

Check migration history:

psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations;"

Kubernetes Issues

If Kubernetes resources aren't deploying:

Verify context:
```
kubectl config current-context
```
Check cluster status:
```
kubectl cluster-info
```

View events:

kubectl get events -n etl-control-plane --sort-by='.lastTimestamp'

Uh oh!

FilesExpand file tree

DEVELOPMENT.md

Latest commit

History

DEVELOPMENT.md

File metadata and controls

Development Guide

Table of Contents

Prerequisites

Required Tools

Optional Tools

Quick Start

Database Setup

Using the Setup Script

Manual Setup

Single Database Setup

Separate Database Setup

Database Migrations

ETL API Migrations

ETL Replicator Migrations

Running the Services

ETL API

Running from Source

Running with Docker

Kubernetes Setup (ETL API Only)

ETL Replicator

Running from Source

Running with Docker

Running Tests

Test Environment Variables

PostgreSQL Test Variables

BigQuery Test Variables

Iceberg Test Variables

Test Output and Logging

Setting Up Test Environment

Option 1: Inline Environment Variables (Recommended)

Option 2: Export in Current Shell Session

Option 3: Use a .env File

Running Tests

Troubleshooting

Database Connection Issues

Migration Issues

Kubernetes Issues

Option 3: Use a `.env` File