3. Use Airbtye for database replication

Date: 2025-09-11

Status

Accepted

Context

The Schools Digital analytics platform has been built organically over the past 3 years to support the “Becoming A Teacher” service line. It has recently been extended to support a few other services (Teacher Vacancies, Claim, CPD etc.). However, the teams have seen a rise in the number of issues relating to loading/streaming data from the services into the analytics platform. Added to this, there is growing demand to add more service data, most notably “The Teaching Record” data set. This dataset is the largest within the SD portfolio and required to answer a large proportion of the priority cross portfolio data questions and to enable a “data by design” approach to service development.

A Ruby library called DfE Analytics has been developed over the course of the last 3 years to support streaming database data to the analytics platform. This is deemed no longer fit for purpose, so we are considering alternative options.

A .net version of DfE Analytics also exists. However, this does not stream database data to the analytics platform. Any options considered should also cater for .net applications.

Options

The below shows all options considered at a high level.

Option	Description
GCP BigQuery transfer service	Available natively in GCP. Allows sync directly to datasets at chosen frequency. Supports Postgres only (In preview)
GCP Datastream	Available natively in GCP. Real time streamed updates into BigQuery datasets. Supports Postgres and SQL Server.
GCP Dataflow	Primarily used for real-time and batch data processing, for ETL pipelines and data transformations. Can be used in conjunction with Datastream for a native GCP solution
GCP Data Fusion	A no-code/low-code ETLservice designed for data integration across cloud and hybrid environments
Postgres replication	Native PostgreSQL logical replication to stream changes from Postgres databases. Primarily for replicating a database instance for fault tolerance, to another postgres database instance.
Python-based data import (Spark etc.)	Develop custom ETL pipelines using Python-based frameworks like Apache Spark. Primarily used for large-scale, high performance data processing.
Azure-based replication	Used for replication of the Posgres instance on Azure only, primarily for fault tolerance.
Airbyte	Data integration platform for ETL / ELT data pipelines from databases to data warehouses. Both self-hosted and cloud-hosted. Push based service. Has full Terraform support with Terraform provider.
Apache Airflow	Open-source workflow orchestration tool for scheduling, monitoring, and managing data pipelines. It is widely used for ETL workflows.
Azure Data Factory (ADF)	Microsoft cloud based ETL offereing. Provides Data integration enabling data movement, transformation, and orchestration across different sources and destinations. Microsoft equivalent to GCP Dataflow
Apache Kafka	Distributed event streaming platform designed for real-time data processing, messaging, and event-driven architectures. Suitable for high-throughput and scalable data streaming. Not an ETL tool, so could be used in conjunction with other products for the feature gaps. For example, Debezium with Kafka Connect and BigQuery Sink.

The GCP BigQuery Data Transfer Service and GCP Datastream and three other solutions matching our requirements will be considered in more detail as they are close matches to our existing solution. The other products and solutions seem to be different use cases.

Comparison of Options

The table below shows a detailed comparison of shortlisted relevant options from all the options initially considered.

Criteria	Current DfE Analytics	GCP BigQuery Transfer Service	GCP Datastream	Airbyte	Apache Airflow	Azure Data Factory
Can transfer latest version of tables in Postgres to BigQuery	Yes	Yes. Batch only.	Yes. Incremental with CDC and batch.	Yes. Incremental with CDC and batch.	Yes	Yes
Can specify exactly which tables and field to synchronise	Yes. Can specify tables and columns through config.	Can specify tables only.	Yes. Can specify tables and columns through config.	Yes. Can specify tables and columns through config.	Yes. Can specify tables and columns through DAGs.	Can use SQL queries to select specific tables & fields
Can label specified fields with policy tags (to retain hidden PII)	Yes. With config using policy tags	Yes. Manually on BQ Destination tables.	Yes. Manually on BQ Destination tables.	Yes. Manually on BQ Destination tables.	Yes, can apply policy tags in the DAG	Yes. Manually on BQ Destination tables.
Can maintain version history of database entities	Yes	No	Yes with Change Data Capture (CDC). Use BQ MERGE with CDC	Supports incremental sync and CDC (tracks row-level changes and can store historical versions).	Yes, with custom versioning logic using DAGs to store historical snapshots.	No
Can attach web request UUID to changes	Yes	No	No	No	No	No
Deployment Type	Client GEM on Azure	Native GCP Service	Native GCP Service	Client-side deployment. Docker image available for Azure VM or K8.	Client-side deployment. Docker image available for Azure VM or K8.	Native Azure Service
Push or Pull from Postgres	Push	Pull	Pull	Push	Pull, but Push can be customised.	Pull from Postgres, as ADF batch orienatated, but pushes to BiqQuery.
Best available authentication method	WIF	Username/Password on Same VPN only	Username Password / OAuth on Same VPN only	WIF	WIF	WIF
Cost per database (assume 4m rows; 5 transaction/s)	Development, maintainence and operational costs.	Data Transfer free, but ongoing storage and network egress costs apply.	CDC Tier pricing Starting: 0-2.5 TB/month @ $2.577/GB Backfill: Below 500 GB - Free, above 500GB $0.515/GB	Self-hosted is free. Ongoing development, maintainence and operational costs.	Self-hosted is free. Ongoing development, maintainence and operational costs.	Charges per Data Flow execution, compute, and volume of data moved. Pricing based on integration runtime and data volume.
Support for Other DBs e.g. SQL Server	Yes	Limited (SQL Server not supported)	Yes (MySQL, Postgres, SQL Server, etc)	Yes. (150+ Connectors)	Yes. Any DB	Yes. (80+ Connectors)
Data load reliability monitoring approach	Custom built	BQ DTS provides robust load and reliability monitoring via Cloud Logging, Cloud Monitoring, and Cloud Alerts.	Full featured monitoring capabilities with Dashboard and logging. Can monitor latency, data volumes, errors etc. Can setup alerts	Customizable retries, notifications (eg to slack), and error handling with strong UI and logs	Customizable, with detailed logging and notification options	ADF has good monitoring and retries but with less flexibility than Airbyte or Airflow
Process to repair data load errors	Custom built monitoring and manual repair	Automated data error retry mechanism. Notifications and manual retries	Automated data error retry mechanism. Notifications and manual retries	Supports automatic retry & recovery. Manual re-runs possible via the UI or API.	Customizable error handling via DAGs. Custom workflows can be created to repair or reload failed data.	Built-in retry logic for transient failures. Failed activity runs can be re-run manually. Retry policies configurable per activity.
Load impact on service	Heavy load, due to each database update results in a queued event that contains the deltas of the database update.	High for full table transfers (Batch processing). Use incremental transfers to reduce impact.	Designed to minimise load on the source system. Eg on Postgres replication slots are used. There are mitigation strategies available such as reduced change frequency from real-time to a fixed periods. Use CDC for low impact.	High Impact on full refresh mode. Low impact on incremental mode with CDC.	High Impact for batch jobs. DAG-based batch jobs can put heavy load on source. Requires careful query tuning.	Moderate to High. Batch-based data extraction can strain the source. Supports incremental queries to reduce load.
IaaS approach (Terraform, .sh etc.)	Config with Installed library	Yes, Terraform can be used to configure and manage Google BQ DTS. However, there are limitations. BQ DTS does not support real-time streaming, external databases require Datastream as an intermediary and limited source support.	Yes. Terraform can be used to set up and manage GCP Cloud Datastream resources	Yes. Moderate to complex setup depending on deployment. Self-hosted (Docker/Kubernetes) needs more setup. Has Terafform providers.	Yes, can be setup with Terraform. Complex – DAGs must be written separately.	Yes, can be setup with Terraform. Easy to Moderate setup. Prebuilt modules available for Terraform.
Supports web events	Yes	No	No	No	No	No
Supports custom events	Yes	No	No	No	No	No
Supports database sync	Yes	Yes	Yes	Yes	Yes	Yes
Language agnostic	No. Rails and .Net only	Yes	Yes	Yes	Yes	Yes

In considering the options, some of the most fundamental criteria are as follows:

Due to the Postrgres DB setup in Azure it is a requirement that any connections to the Postgres DB is on the same private network in Azure. Hence, a pull connection from the Azure Postgres DB requires a VPN between Azure and GCP, which is a complex setup.
For efficiency purposes only incremental changes must be streamed. Preferably a standard such as a common CDC (Change Data Capture) approach.
For authentication an OIDC connection back to BigQuery is the preferred method of authenticating.
Parity with the current DfE Analytics solution is also an important factor
Costs of the product and ease of maintenance should also be important considerations

The list of the options below summarises the most fundamental pros and cons, with details in the comparison table above.

1. GCP Products

GCP BigQuery transfer service

Available natively in GCP. Allows sync directly to datasets at chosen frequency. Supports Postgres only (In preview).

GCP Datastream

Available natively in GCP. Real time streamed updates into BigQuery datasets. Supports Postgres and SQL Server.

Pros

Available natively in GCP with minimal setup.

Cons

Pull only from Postgres

2. Airbyte

Data integration platform for ETL / ELT data pipelines from databases to data warehouses. Both self-hosted and cloud-hosted. Push based service. Has full Terraform support with Terraform provider.

Pros

Data Push, both incremental data (CDC) and batch data
Version history of tables
Can specify tables and columns to sync
Automatic Schema synchronisation
WIF for BigQuery Auth supported
Setup with Terraform providers available
Open source self-hosting available with Azure K8
Low impact on Database

Cons

Self hosted solution will require ongoing Developer / DevOps support
Terraform setup can be a moderate to complex setup

3. Apache Airflow

Open-source workflow orchestration tool for scheduling, monitoring, and managing data pipelines. It is widely used for ETL workflows.

Pros

Data Push, both incremental data (CDC) and batch data
Version history of tables
Can specify tables and columns to sync
Automatic Schema synchronisation
WIF for BigQuery Auth supported
Open source self-hosting available with Azure K8
Low impact on Database

Cons

Self hosted solution will require ongoing Developer / DevOps support
Complex terraform setup
A lot of setup done with DAGs that are written separately from terraform
No Automatic Schema synchronisation

4. Azure Data Factory

Microsoft cloud based ETL offering. Provides Data integration enabling data movement, transformation, and orchestration across different sources and destinations. Microsoft equivalent to GCP Dataflow

Pros

Available natively in Azure with minimal setup.
Setup with Terraform available

Cons

Must maintain SQL Queries to select specific tables and columns.
No version history of tables
Moderate to high load impact on source database

Decision

Airbyte (Options 2). In assessing these products, the primary considerations where taken into account. Two products in particular ie AirByte and Airflow seem to fit our requirements well. Airbyte seems to have the edge here in assessing these products as it seems to be less complex to setup and supports automatic sync of schema changes. The only area these products may be lacking when compared with native GCP products are the reliability and monitoring tools. However, they seem to be sufficient and customisable through config, although this may require extra effort

Consequences

Data health will be improved and data synchronisation will require no code maintenance, leading to reduced complexity in DfE Analytics.

The load on services using DfE Analytics should be reduced and scaling to large datasets should be possible.

However, there will be a requirement to maintain the ongoing open source installation and configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Use Airbtye for database replication

Status

Context

Options

Comparison of Options

1. GCP Products

GCP BigQuery transfer service

GCP Datastream

Pros

Cons

2. Airbyte

Pros

Cons

3. Apache Airflow

Pros

Cons

4. Azure Data Factory

Pros

Cons

Decision

Consequences

FilesExpand file tree

0003-use-airbyte-for-database-replication.md

Latest commit

History

0003-use-airbyte-for-database-replication.md

File metadata and controls

3. Use Airbtye for database replication

Status

Context

Options

Comparison of Options

1. GCP Products

GCP BigQuery transfer service

GCP Datastream

Pros

Cons

2. Airbyte

Pros

Cons

3. Apache Airflow

Pros

Cons

4. Azure Data Factory

Pros

Cons

Decision

Consequences