This guide covers setting up your development environment, running migrations, and common development workflows for the ETL project.
- Prerequisites
- Quick Start
- Database Setup
- Database Migrations
- Running the Services
- Kubernetes Setup
- Common Development Tasks
Before starting, ensure you have the following installed:
- Rust (latest stable): Install Rust
- PostgreSQL client (
psql): Required for database operations - Docker Compose: For running PostgreSQL and other services
- kubectl: For Kubernetes operations
- SQLx CLI: For database migrations
Install SQLx CLI:
cargo install --version='~0.8.6' sqlx-cli --no-default-features --features rustls,postgres- OrbStack: Recommended for local Kubernetes development (alternative to Docker Desktop)
- Install OrbStack
- Enable Kubernetes in OrbStack settings
The fastest way to get started is using the setup script:
# From the project root
./scripts/init.shThis script will:
- Start PostgreSQL via Docker Compose
- Run etl-api migrations
- Seed the default replicator image
- Configure the Kubernetes environment (OrbStack)
The scripts/init.sh script provides a complete development environment setup:
# Use default settings (Postgres on port 5430)
./scripts/init.sh
# Customize database settings
POSTGRES_PORT=5432 POSTGRES_DB=mydb ./scripts/init.sh
# Skip Docker if you already have Postgres running
SKIP_DOCKER=1 ./scripts/init.sh
# Use persistent storage
POSTGRES_DATA_VOLUME=/path/to/data ./scripts/init.shEnvironment Variables:
| Variable | Default | Description |
|---|---|---|
POSTGRES_USER |
postgres |
Database user |
POSTGRES_PASSWORD |
postgres |
Database password |
POSTGRES_DB |
postgres |
Database name |
POSTGRES_PORT |
5430 |
Database port |
POSTGRES_HOST |
localhost |
Database host |
SKIP_DOCKER |
(empty) | Skip Docker Compose if set |
POSTGRES_DATA_VOLUME |
(empty) | Path for persistent storage |
REPLICATOR_IMAGE |
ramsup/etl-replicator:latest |
Default replicator image |
PostgreSQL 18+ containers store data under /var/lib/postgresql/<major>/data, so the Docker Compose setup mounts the parent /var/lib/postgresql directory to keep upgrades compatible.
If you prefer manual setup or have an existing PostgreSQL instance:
Important: The etl-api and etl-replicator migrations can run on separate databases. You might have:
- The etl-api using its own dedicated Postgres instance for the control plane
- The etl-replicator state store on the same database you're replicating from (source database)
- Or both on the same database (for simpler local development setups)
If using one database for both the API and replicator state:
export DATABASE_URL=postgres://USER:PASSWORD@HOST:PORT/DB
# Run both migrations on the same database
./etl-api/scripts/run_migrations.sh
./etl-replicator/scripts/run_migrations.shIf using separate databases (recommended for production):
# API migrations on the control plane database
export DATABASE_URL=postgres://USER:PASSWORD@API_HOST:PORT/API_DB
./etl-api/scripts/run_migrations.sh
# Replicator migrations on the source database
export DATABASE_URL=postgres://USER:PASSWORD@SOURCE_HOST:PORT/SOURCE_DB
./etl-replicator/scripts/run_migrations.shThis separation allows you to:
- Scale the control plane independently from replication workloads
- Keep the replicator state close to the source data
- Isolate concerns between infrastructure management and data replication
The project uses SQLx for database migrations. There are two sets of migrations:
Located in etl-api/migrations/, these create the control plane schema (app schema) for managing tenants, sources, destinations, and pipelines.
Running API migrations:
# From project root
./etl-api/scripts/run_migrations.sh
# Or manually with SQLx CLI
sqlx migrate run --source etl-api/migrationsCreating a new API migration:
cd etl-api
sqlx migrate add <migration_name>Resetting the API database:
cd etl-api
sqlx migrate revertUpdating SQLx metadata after schema changes:
cd etl-api
cargo sqlx prepareLocated in etl-replicator/migrations/, these create the replicator's state store schema (etl schema) for tracking replication state, table schemas, and mappings.
Running replicator migrations:
# From project root
./etl-replicator/scripts/run_migrations.sh
# Or manually with SQLx CLI (requires setting search_path)
psql $DATABASE_URL -c "create schema if not exists etl;"
sqlx migrate run --source etl-replicator/migrations --database-url "${DATABASE_URL}?options=-csearch_path%3Detl"Important: Migrations are run automatically when using the etl-replicator binary (see etl-replicator/src/migrations.rs:16). However, if you integrate the etl crate directly into your own application as a library, you should run these migrations manually before starting your pipeline. This design decision ensures:
- The standalone replicator binary works out-of-the-box
- Library users have explicit control over when migrations run
- CI/CD pipelines can pre-apply migrations independently
When to run migrations manually:
- Integrating
etlas a library in your own application - Pre-creating the state store schema before deployment
- Testing migrations independently
- CI/CD pipelines that separate migration and deployment steps
Creating a new replicator migration:
cd etl-replicator
sqlx migrate add <migration_name>Both etl-api and etl-replicator binaries use hierarchical configuration loading from the configuration/ directory within each crate. Configuration is loaded in this order:
- Base configuration:
configuration/base.yaml(always loaded) - Environment-specific:
configuration/{environment}.yaml(e.g.,dev.yaml,prod.yaml) - Environment variable overrides: Prefixed with
APP_(e.g.,APP_DATABASE__URL)
Environment Selection:
The environment is determined by the APP_ENVIRONMENT variable:
- Default:
prod(ifAPP_ENVIRONMENTis not set) - Available:
dev,staging,prod
# Run with dev environment
APP_ENVIRONMENT=dev cargo run
# Run with production environment (default)
cargo run
# Override specific config values
APP_ENVIRONMENT=dev APP_DATABASE__URL=postgres://localhost/mydb cargo runcd etl-api
APP_ENVIRONMENT=dev cargo runThe API loads configuration from etl-api/configuration/{environment}.yaml. See etl-api/README.md for available configuration options.
Docker images are available for the etl-api. You must mount the configuration files and can override settings via environment variables:
docker run \
-v $(pwd)/etl-api/configuration/base.yaml:/app/configuration/base.yaml \
-v $(pwd)/etl-api/configuration/dev.yaml:/app/configuration/dev.yaml \
-e APP_ENVIRONMENT=dev \
-p 8080:8080 \
ramsup/etl-api:latestConfiguration requirements:
- Mount both
base.yamland your environment-specific config file (e.g.,dev.yaml) - Set
APP_ENVIRONMENTto match your mounted environment file - Override specific values using
APP_prefixed environment variables
The etl-api manages replicator deployments on Kubernetes by dynamically creating StatefulSets, Secrets, and ConfigMaps. The etl-api requires Kubernetes, but the etl-replicator binary can run independently without any Kubernetes setup.
Prerequisites:
- OrbStack with Kubernetes enabled (or another local Kubernetes cluster)
kubectlconfigured with theorbstackcontext- Pre-defined Kubernetes resources (see below)
Required Pre-Defined Resources:
The etl-api expects these resources to exist before it can deploy replicators:
- Namespace:
etl-data-plane- Where all replicator pods and related resources are created - ConfigMap:
trusted-root-certs-config- Provides trusted root certificates for TLS connections
These are defined in scripts/ and should be applied before running the API:
kubectl --context orbstack apply -f scripts/etl-data-plane.yaml
kubectl --context orbstack apply -f scripts/trusted-root-certs-config.yamlNote: For the complete list of expected Kubernetes resources and their specifications, refer to the constants and resource creation logic in etl-api/src/k8s/http.rs.
The replicator can run as a standalone binary without Kubernetes.
cd etl-replicator
APP_ENVIRONMENT=dev cargo runThe replicator loads configuration from etl-replicator/configuration/{environment}.yaml.
Docker images are available for the etl-replicator. You must mount the configuration files and can override settings via environment variables:
docker run \
-v $(pwd)/etl-replicator/configuration/base.yaml:/app/configuration/base.yaml \
-v $(pwd)/etl-replicator/configuration/dev.yaml:/app/configuration/dev.yaml \
-e APP_ENVIRONMENT=dev \
etl-replicator:latestConfiguration requirements:
- Mount both
base.yamland your environment-specific config file (e.g.,dev.yaml) - Set
APP_ENVIRONMENTto match your mounted environment file - Override specific values using
APP_prefixed environment variables
Note: While the replicator is typically deployed as a Kubernetes pod managed by the etl-api, it does not require Kubernetes to function. You can run it as a standalone process on any machine with the appropriate configuration.
The project includes comprehensive test suites that require a PostgreSQL database. Tests use environment variables for database configuration to ensure isolation and reproducibility.
All tests that interact with PostgreSQL require the following environment variables to be set:
| Variable | Required | Description |
|---|---|---|
TESTS_DATABASE_HOST |
Yes | PostgreSQL server hostname (e.g., localhost) |
TESTS_DATABASE_PORT |
Yes | PostgreSQL server port (e.g., 5430) |
TESTS_DATABASE_USERNAME |
Yes | Database user (e.g., postgres) |
TESTS_DATABASE_PASSWORD |
No | Database password (optional) |
Note: Each test creates a unique database with a UUID-based name to ensure test isolation. The test databases are automatically cleaned up after tests complete.
BigQuery destination tests require Google Cloud credentials:
| Variable | Required | Description |
|---|---|---|
TESTS_BIGQUERY_PROJECT_ID |
Yes | GCP project ID for BigQuery |
TESTS_BIGQUERY_SA_KEY_PATH |
Yes | Path to service account JSON key file |
Note: BigQuery tests are only run when the bigquery and test-utils features are enabled. Each test creates a unique dataset with a UUID-based name for isolation.
Iceberg destination tests use local MinIO and Lakekeeper instances. The following services must be running:
- Lakekeeper:
http://localhost:8182(REST catalog) - MinIO:
http://localhost:9010(S3-compatible storage)- Username:
minio-admin - Password:
minio-admin-password
- Username:
Note: Iceberg tests are only run when the iceberg and test-utils features are enabled. These use hardcoded local URLs and do not require environment variables.
| Variable | Description |
|---|---|
ENABLE_TRACING=1 |
Enable tracing output during test execution (useful for debugging) |
RUST_LOG |
Control log level (e.g., debug, info, warn, error) |
Example:
# Run tests with debug output
ENABLE_TRACING=1 RUST_LOG=debug cargo test test_name -- --nocaptureThe most reliable way is to set environment variables directly in the test command:
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-apiExport variables in your current shell session, then run tests:
# PostgreSQL test configuration
export TESTS_DATABASE_HOST=localhost
export TESTS_DATABASE_PORT=5430
export TESTS_DATABASE_USERNAME=postgres
export TESTS_DATABASE_PASSWORD=postgres
# BigQuery test configuration (optional - only needed for BigQuery tests)
export TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
export TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json
# Enable test output (optional)
export ENABLE_TRACING=1
export RUST_LOG=info
# Now run tests
cargo test -p etl-apiCreate a .env.test file and source it:
# .env.test
# PostgreSQL (required for most tests)
TESTS_DATABASE_HOST=localhost
TESTS_DATABASE_PORT=5430
TESTS_DATABASE_USERNAME=postgres
TESTS_DATABASE_PASSWORD=postgres
# BigQuery (optional - only for BigQuery tests)
TESTS_BIGQUERY_PROJECT_ID=your-gcp-project-id
TESTS_BIGQUERY_SA_KEY_PATH=/path/to/service-account-key.json
# Test output (optional)
ENABLE_TRACING=1
RUST_LOG=info# Source the file and run tests
source .env.test
cargo test -p etl-apiImportant: Environment variables must be set in the same command as cargo test, or exported in your current shell session before running tests.
# Run all tests (requires env variables)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test
# Run tests for a specific package
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api
# Run tests for packages with test-utils feature (etl, etl-postgres, etl-destinations)
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl --features test-utils
# Run a specific test
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres cargo test -p etl-api --test tenants tenant_can_be_created
# Run tests with tracing output for debugging
TESTS_DATABASE_HOST=localhost TESTS_DATABASE_PORT=5430 TESTS_DATABASE_USERNAME=postgres TESTS_DATABASE_PASSWORD=postgres ENABLE_TRACING=1 RUST_LOG=info cargo test -p etl-api --test tenants tenant_can_be_created -- --nocapturePackages requiring --features test-utils:
etletl-postgresetl-destinations
Packages that don't require feature flags:
etl-apietl-configetl-telemetryetl-replicator
Note: Ensure PostgreSQL is running and accessible at the configured host and port before running tests. The test suite will fail if it cannot connect to the database or if the required environment variables are not set.
If you encounter connection issues:
-
Verify PostgreSQL is running:
docker-compose -f scripts/docker-compose.yaml ps
-
Check the connection:
psql $DATABASE_URL -c "SELECT 1;"
-
Ensure the correct port is used (default: 5430)
If migrations fail:
-
Check if the database exists:
psql $DATABASE_URL -c "\l"
-
Verify SQLx CLI is installed:
sqlx --version
-
Check migration history:
psql $DATABASE_URL -c "SELECT * FROM _sqlx_migrations;"
If Kubernetes resources aren't deploying:
-
Verify context:
kubectl config current-context
-
Check cluster status:
kubectl cluster-info
-
View events:
kubectl get events -n etl-control-plane --sort-by='.lastTimestamp'