Config-driven Terraform deployment of Lakeflow Connect connectors

This project is intended to deploy and scale out database connectors to the Databricks Lakehouse platform using infrastructure-as-code principles. It is intended to be a database-agnostic, YAML-driven Terraform deployment for Databricks Lakeflow Connect that provides built-in validation and orchestration for deployments.

Key Features:

Database-agnostic: Supports various Lakeflow Connect-supported databases by simply changing the source_type in the config.
- Currently supports the following:
  - SQL Server CDC (Public docs)
  - PostgreSQL CDC (Public docs)
YAML-driven configuration: Define connections, databases, schemas, tables, and schedules in a single configuration file
Flexible Unity Catalog mapping: Configure where data lands with per-database catalog and schema customization
Flexible ingestion modes: Choose between full schema ingestion or specific table selection per database
Automated orchestration: Configurable job scheduling of the managed ingestion pipelines with support for multiple sync frequencies
Infrastructure-as-Code: Fully automated deployment of gateway pipelines, ingestion pipelines, and orchestration jobs

What gets deployed:

Lakeflow Connect gateway pipeline, using an existing Unity Catalog connection
Ingestion pipelines for each database/schema combination, all routed through a single shared gateway
Databricks jobs to orchestrate the ingestion workflows
Optional: Event log tables for pipeline monitoring

Overview

This project allows you to deploy components in the architecture described below.

Project Layout

lakeflow-connect-terraform/
├── config/                      # YAML configuration files
│   ├── lakeflow_dev.yml        # Active configuration
│   └── examples/                # Example configurations
│       ├── postgresql.yml       # PostgreSQL example
│       ├── sqlserver.yml        # SQL Server example
│       └── README.md            # Examples documentation
├── infra/                       # Terraform root module
│   ├── backend.tf
│   ├── main.tf
│   ├── locals.tf
│   ├── outputs.tf
│   ├── provider.tf
│   ├── variables.tf
│   └── modules/
│       ├── gateway_pipeline/
│       ├── ingestion_pipeline/
│       ├── job/
│       ├── gateway_validation/
│       └── validate_catalog_and_schemas/
├── tools/                       # Validation scripts (optional)
│   ├── yaml_content_validator.py
│   └── validate_running_gateway.py
├── pyproject.toml              # Python dependencies for validation tools
└── README.md

Pre-requisites

Software Requirements

Terraform >= 1.10
Databricks Terraform Provider >= 1.104.0 (for PostgreSQL source_configurations support)
Databricks workspace with Unity Catalog enabled
Python 3.10+ and Poetry (optional, for YAML validation tools)

Databricks Setup

1. Create Unity Catalog Connection

Create a Databricks Unity Catalog connection to your source database and specify it in your YAML config under connection.name.

Refer to Databricks documentation for your database type:

2. Create Unity Catalog Structure

Based on your YAML configuration in the unity_catalog section, pre-create:

Catalogs: Create catalog(s) specified in global_uc_catalog and staging_uc_catalog (if different)
Staging Schema: Create the schema specified in staging_schema within the staging catalog
Ingestion Schemas: Create schemas for ingestion pipelines in their target catalogs
- Schema names will be: {schema_prefix}{source_schema_name}{schema_suffix}
- Example: If source schema is dbo with prefix dev_ → create schema dev_dbo
- If no prefix/suffix specified, use the source schema name as-is

3. Configure Source Database (Database-Specific)

For PostgreSQL Sources: Pre-create replication slots and publications in the source database as specified in the YAML config:

-- Example for PostgreSQL
CREATE PUBLICATION dbx_pub_mydb FOR ALL TABLES;
SELECT pg_create_logical_replication_slot('dbx_slot_mydb', 'pgoutput');

Refer to PostgreSQL Replication Setup for detailed instructions.

For SQL Server Sources: Ensure CDC is enabled on the source database and tables. Refer to SQL Server CDC Setup.

4. Grant Permissions

Provide the user/SPN used for Terraform deployment with:

USE CONNECTION on the Unity Catalog connection to the source database
USE CATALOG on all catalogs specified in your YAML config
USE SCHEMA and CREATE TABLE on all ingestion pipeline target schemas
USE SCHEMA, CREATE TABLE and CREATE VOLUME on the staging schema
- Note: CREATE TABLE allows creation of event log tables if enabled
Access to cluster policy for spinning up gateway pipeline compute (Policy Documentation)
- Specify the policy name in your YAML config under gateway_pipeline_cluster_policy_name

Assumptions

Unity Catalog Resources: All Unity Catalog catalogs and schemas specified in the YAML configuration must already exist. This project validates their existence but does not create them.
Database Permissions: The database replication user used with the Unity Catalog connection must have all necessary permissions on the source database.
Source Database Configuration: For PostgreSQL, replication slots and publications must be pre-created. For SQL Server, CDC must be enabled.

Installation

Step 1: Clone and Configure

# Clone the repository
git clone <your-repo-url>
cd lakeflow-connect-terraform

# Copy an example configuration and customize for your environment
# For PostgreSQL:
cp config/examples/postgresql.yml config/lakeflow_<your_env>.yml

# For SQL Server:
cp config/examples/sqlserver.yml config/lakeflow_<your_env>.yml

# Edit the YAML file with your environment details

Step 2: (Optional) Set Up Python Validation Tools

The Python tools provide YAML validation before Terraform deployment. Note: Terraform itself does not require Poetry - it reads YAML directly using the built-in yamldecode() function.

# Install Poetry (if not already installed)
pip install poetry

# Install project dependencies
poetry install

# Validate your configuration before deploying
poetry run python tools/yaml_content_validator.py config/lakeflow_<your_env>.yml

After validation passes, proceed with Terraform deployment (no Poetry needed).

Step 3: Configure Authentication

Set up authentication for Databricks workspace Terraform provider:

Option A: Personal Login (Development)

# For Azure
az login
export DATABRICKS_HOST="https://adb-xxxxx.azuredatabricks.net"
export DATABRICKS_AUTH_TYPE="azure-cli"

Option B: Service Principal (Production)

# Create profile in ~/.databrickscfg
cat >> ~/.databrickscfg << EOF
[production]
host                = https://adb-<workspace_id>.<num>.azuredatabricks.net/
azure_tenant_id     = <tenant_id>
azure_client_id     = <client_id>
azure_client_secret = <client_secret>
auth_type           = azure-client-secret
EOF

# Set environment variable
export DATABRICKS_CONFIG_PROFILE=production

Step 4: Configure Terraform Backend

Option A: Local Backend (Default - For Single User)

The project is configured with local backend by default. State is stored at infra/terraform.tfstate.

cd infra
terraform init
terraform validate

Option B: Remote Backend (Recommended for Teams)

For team collaboration, use a remote backend like Azure Storage. Update infra/backend.tf:

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "tfstatelakeflow"
    container_name       = "tfstate"
    key                  = "lakeflow-connect.tfstate"
  }
}

Then initialize with the backend configuration:

cd infra

# Set Azure credentials
export ARM_SUBSCRIPTION_ID="<subscription_id>"
export ARM_TENANT_ID="<tenant_id>"
export ARM_CLIENT_ID="<client_id>"
export ARM_CLIENT_SECRET="<client_secret>"

# Initialize with remote backend
terraform init -reconfigure
terraform validate

Other Remote Backend Options:

AWS S3: For multi-cloud or AWS environments
Terraform Cloud: Managed service with built-in state management
Google Cloud Storage: For GCP environments

See Terraform Backend Documentation for more options.

Step 5: Deploy Infrastructure

# Plan deployment (review changes)
terraform plan --var yaml_config_path=../config/lakeflow_<env>.yml

# Apply deployment
terraform apply --var yaml_config_path=../config/lakeflow_<env>.yml

YAML Configuration

See config/examples/ for complete example configurations (PostgreSQL, SQL Server). The configuration is fully YAML-driven with the following main sections:

Unity Catalog Configuration

unity_catalog:
  global_uc_catalog: "my_catalog"      # Required: Default catalog for ingestion pipelines
  staging_uc_catalog: "staging_cat"    # Optional: Catalog for gateway staging (defaults to global)
  staging_schema: "lf_staging"         # Optional: Schema for gateway staging (defaults to {app_name}_lf_staging)

global_uc_catalog: Default catalog where ingestion pipeline tables land (required)
staging_uc_catalog: Catalog for gateway pipeline staging assets (optional, defaults to global_uc_catalog)
staging_schema: Schema name for gateway staging assets (optional, defaults to {app_name}_lf_staging)

Connection

connection:
  name: "my_uc_connection"
  source_type: "POSTGRESQL"  # POSTGRESQL, SQLSERVER, MYSQL, ORACLE

Databases

databases:
  - name: my_database
    # Optional: Override global catalog for this database
    uc_catalog: "custom_catalog"
    
    # Optional: Customize schema names
    schema_prefix: "dev_"      # Results in: dev_<schema_name>
    schema_suffix: "_raw"      # Results in: <schema_name>_raw
    
    # PostgreSQL only: Replication configuration
    replication_slot: "dbx_slot_mydb"
    publication: "dbx_pub_mydb"
    
    schemas:
      - name: public
        use_schema_ingestion: true  # Ingest all tables
      - name: app_schema
        use_schema_ingestion: false  # Ingest specific tables only
        tables:
          - source_table: users
          - source_table: orders

Per-database uc_catalog: Override global catalog for specific database (optional)
Per-database schema_prefix and schema_suffix: Customize destination schema names (optional)
Per-database replication_slot and publication: PostgreSQL replication configuration (PostgreSQL only)
Schemas and tables to ingest with granular control

Event Logs

event_log:
  to_table: true  # Enable/disable materialization of event logs to Delta tables

Job Orchestration

job:
  common_job_for_all_pipelines: false
  common_schedule:
    quartz_cron_expression: "0 */30 * * * ?"
    timezone_id: "UTC"
  
  # Per-database schedules (when common_job_for_all_pipelines is false)
  per_database_schedules:
    - name: "frequent_sync"
      applies_to: ["db1", "db2"]
      schedule:
        quartz_cron_expression: "0 */15 * * * ?"
        timezone_id: "UTC"

Supports two scheduling modes:

Common job for all pipelines (common_job_for_all_pipelines: true): Single job runs all database ingestion pipelines on the same schedule
Per-database scheduling (common_job_for_all_pipelines: false): Separate jobs per schedule group, allowing different databases to sync at different frequencies

How does it compare to other deployment options?

Feature	This Project (Terraform)	Databricks CLI	DABS	REST API	UI
Infrastructure-as-Code	✅ Native	⚠️ Limited	✅ Yes	❌ No	❌ No
Version Control	✅ Git-native	⚠️ Manual	✅ Git-native	❌ Manual	❌ Manual
Multi-environment	✅ Built-in	⚠️ Manual	✅ Built-in	⚠️ Manual	❌ Manual
State Management	✅ Native	❌ None	⚠️ Limited	❌ None	❌ None
YAML-driven	✅ Yes	❌ No	✅ Yes	❌ No	❌ No
Validation	✅ Pre-deploy	❌ No	⚠️ Limited	❌ No	⚠️ Post-deploy
Team Collaboration	✅ Excellent	⚠️ Manual	✅ Good	❌ Poor	❌ Poor
Learning Curve	⚠️ Moderate	✅ Easy	⚠️ Moderate	⚠️ Complex	✅ Easy
Automation-friendly	✅ Excellent	✅ Good	✅ Good	✅ Excellent	❌ Poor

When to use Terraform (this project):

Managing multiple environments (dev, staging, prod)
Team collaboration with state management
Complex configurations with many pipelines
CI/CD automation requirements
Need for validation before deployment

When to use alternatives:

UI: Quick one-off deployments, learning/exploration
CLI: Simple scripts, single-pipeline deployments
DABS: Databricks-native workflows with asset bundles
API: Custom integrations, programmatic control

How to get help

Databricks support doesn't cover this content. For questions or bugs, please open a GitHub issue and the team will help on a best effort basis.

Future Enhancements

Pre-commit hooks for YAML validation
Pydantic-based schema validation
Automated CI/CD pipeline examples
Metadata export for downstream applications
Cost estimation based on configuration
Multi-region deployment support

License

© 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License. All included or referenced third party libraries are subject to the licenses set forth below.

library	description	license	source
terraform-provider-databricks	Databricks provider	Apache 2.0	https://github.com/databricks/terraform-provider-databricks

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
docs/assets		docs/assets
infra		infra
tools		tools
.gitignore		.gitignore
CODEOWNERS.txt		CODEOWNERS.txt
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Config-driven Terraform deployment of Lakeflow Connect connectors

Key Features:

Overview

Project Layout

Pre-requisites

Software Requirements

Databricks Setup

1. Create Unity Catalog Connection

2. Create Unity Catalog Structure

3. Configure Source Database (Database-Specific)

4. Grant Permissions

Assumptions

Installation

Step 1: Clone and Configure

Step 2: (Optional) Set Up Python Validation Tools

Step 3: Configure Authentication

Step 4: Configure Terraform Backend

Step 5: Deploy Infrastructure

YAML Configuration

Unity Catalog Configuration

Connection

Databases

Event Logs

Job Orchestration

How does it compare to other deployment options?

How to get help

Future Enhancements

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

databricks-solutions/lakeflow-connect-terraform

Folders and files

Latest commit

History

Repository files navigation

Config-driven Terraform deployment of Lakeflow Connect connectors

Key Features:

Overview

Project Layout

Pre-requisites

Software Requirements

Databricks Setup

1. Create Unity Catalog Connection

2. Create Unity Catalog Structure

3. Configure Source Database (Database-Specific)

4. Grant Permissions

Assumptions

Installation

Step 1: Clone and Configure

Step 2: (Optional) Set Up Python Validation Tools

Step 3: Configure Authentication

Step 4: Configure Terraform Backend

Step 5: Deploy Infrastructure

YAML Configuration

Unity Catalog Configuration

Connection

Databases

Event Logs

Job Orchestration

How does it compare to other deployment options?

How to get help

Future Enhancements

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages