Skip to content

Latest commit

 

History

History
146 lines (106 loc) · 31.2 KB

File metadata and controls

146 lines (106 loc) · 31.2 KB

3. Use Airbtye for database replication

Date: 2025-09-11

Status

Accepted

Context

The Schools Digital analytics platform has been built organically over the past 3 years to support the “Becoming A Teacher” service line. It has recently been extended to support a few other services (Teacher Vacancies, Claim, CPD etc.). However, the teams have seen a rise in the number of issues relating to loading/streaming data from the services into the analytics platform. Added to this, there is growing demand to add more service data, most notably “The Teaching Record” data set. This dataset is the largest within the SD portfolio and required to answer a large proportion of the priority cross portfolio data questions and to enable a “data by design” approach to service development.

A Ruby library called DfE Analytics has been developed over the course of the last 3 years to support streaming database data to the analytics platform. This is deemed no longer fit for purpose, so we are considering alternative options.

A .net version of DfE Analytics also exists. However, this does not stream database data to the analytics platform. Any options considered should also cater for .net applications.

Options

The below shows all options considered at a high level.

Option Description
GCP BigQuery transfer service Available natively in GCP. Allows sync directly to datasets at chosen frequency. Supports Postgres only (In preview)
GCP Datastream Available natively in GCP. Real time streamed updates into BigQuery datasets. Supports Postgres and SQL Server.
GCP Dataflow Primarily used for real-time and batch data processing, for ETL pipelines and data transformations. Can be used in conjunction with Datastream for a native GCP solution
GCP Data Fusion A no-code/low-code ETLservice designed for data integration across cloud and hybrid environments
Postgres replication Native PostgreSQL logical replication to stream changes from Postgres databases. Primarily for replicating a database instance for fault tolerance, to another postgres database instance.
Python-based data import (Spark etc.) Develop custom ETL pipelines using Python-based frameworks like Apache Spark. Primarily used for large-scale, high performance data processing.
Azure-based replication Used for replication of the Posgres instance on Azure only, primarily for fault tolerance.
Airbyte Data integration platform for ETL / ELT data pipelines from databases to data warehouses. Both self-hosted and cloud-hosted. Push based service. Has full Terraform support with Terraform provider.
Apache Airflow Open-source workflow orchestration tool for scheduling, monitoring, and managing data pipelines. It is widely used for ETL workflows.
Azure Data Factory (ADF) Microsoft cloud based ETL offereing. Provides Data integration enabling data movement, transformation, and orchestration across different sources and destinations. Microsoft equivalent to GCP Dataflow
Apache Kafka Distributed event streaming platform designed for real-time data processing, messaging, and event-driven architectures. Suitable for high-throughput and scalable data streaming. Not an ETL tool, so could be used in conjunction with other products for the feature gaps. For example, Debezium with Kafka Connect and BigQuery Sink.

The GCP BigQuery Data Transfer Service and GCP Datastream and three other solutions matching our requirements will be considered in more detail as they are close matches to our existing solution. The other products and solutions seem to be different use cases.

Comparison of Options

The table below shows a detailed comparison of shortlisted relevant options from all the options initially considered.

Criteria Current DfE Analytics GCP BigQuery Transfer Service GCP Datastream Airbyte Apache Airflow Azure Data Factory
Can transfer latest version of tables in Postgres to BigQuery Yes Yes. Batch only. Yes. Incremental with CDC and batch. Yes. Incremental with CDC and batch. Yes Yes
Can specify exactly which tables and field to synchronise Yes. Can specify tables and columns through config. Can specify tables only. Yes. Can specify tables and columns through config. Yes. Can specify tables and columns through config. Yes. Can specify tables and columns through DAGs. Can use SQL queries to select specific tables & fields
Can label specified fields with policy tags (to retain hidden PII) Yes. With config using policy tags Yes. Manually on BQ Destination tables. Yes. Manually on BQ Destination tables. Yes. Manually on BQ Destination tables. Yes, can apply policy tags in the DAG Yes. Manually on BQ Destination tables.
Can maintain version history of database entities Yes No Yes with Change Data Capture (CDC). Use BQ MERGE with CDC Supports incremental sync and CDC (tracks row-level changes and can store historical versions). Yes, with custom versioning logic using DAGs to store historical snapshots. No
Can attach web request UUID to changes Yes No No No No No
Deployment Type Client GEM on Azure Native GCP Service Native GCP Service Client-side deployment. Docker image available for Azure VM or K8. Client-side deployment. Docker image available for Azure VM or K8. Native Azure Service
Push or Pull from Postgres Push Pull Pull Push Pull, but Push can be customised. Pull from Postgres, as ADF batch orienatated, but pushes to BiqQuery.
Best available authentication method WIF Username/Password on Same VPN only Username Password / OAuth on Same VPN only WIF WIF WIF
Cost per database (assume 4m rows; 5 transaction/s) Development, maintainence and operational costs. Data Transfer free, but ongoing storage and network egress costs apply. CDC Tier pricing Starting: 0-2.5 TB/month @ $2.577/GB Backfill: Below 500 GB - Free, above 500GB $0.515/GB Self-hosted is free. Ongoing development, maintainence and operational costs. Self-hosted is free. Ongoing development, maintainence and operational costs. Charges per Data Flow execution, compute, and volume of data moved. Pricing based on integration runtime and data volume.
Support for Other DBs e.g. SQL Server Yes Limited (SQL Server not supported) Yes (MySQL, Postgres, SQL Server, etc) Yes. (150+ Connectors) Yes. Any DB Yes. (80+ Connectors)
Data load reliability monitoring approach Custom built BQ DTS provides robust load and reliability monitoring via Cloud Logging, Cloud Monitoring, and Cloud Alerts. Full featured monitoring capabilities with Dashboard and logging. Can monitor latency, data volumes, errors etc. Can setup alerts Customizable retries, notifications (eg to slack), and error handling with strong UI and logs Customizable, with detailed logging and notification options ADF has good monitoring and retries but with less flexibility than Airbyte or Airflow
Process to repair data load errors Custom built monitoring and manual repair Automated data error retry mechanism. Notifications and manual retries Automated data error retry mechanism. Notifications and manual retries Supports automatic retry & recovery. Manual re-runs possible via the UI or API. Customizable error handling via DAGs. Custom workflows can be created to repair or reload failed data. Built-in retry logic for transient failures. Failed activity runs can be re-run manually. Retry policies configurable per activity.
Load impact on service Heavy load, due to each database update results in a queued event that contains the deltas of the database update. High for full table transfers (Batch processing). Use incremental transfers to reduce impact. Designed to minimise load on the source system. Eg on Postgres replication slots are used. There are mitigation strategies available such as reduced change frequency from real-time to a fixed periods. Use CDC for low impact. High Impact on full refresh mode. Low impact on incremental mode with CDC. High Impact for batch jobs. DAG-based batch jobs can put heavy load on source. Requires careful query tuning. Moderate to High. Batch-based data extraction can strain the source. Supports incremental queries to reduce load.
IaaS approach (Terraform, .sh etc.) Config with Installed library Yes, Terraform can be used to configure and manage Google BQ DTS. However, there are limitations. BQ DTS does not support real-time streaming, external databases require Datastream as an intermediary and limited source support. Yes. Terraform can be used to set up and manage GCP Cloud Datastream resources Yes. Moderate to complex setup depending on deployment. Self-hosted (Docker/Kubernetes) needs more setup. Has Terafform providers. Yes, can be setup with Terraform. Complex – DAGs must be written separately. Yes, can be setup with Terraform. Easy to Moderate setup. Prebuilt modules available for Terraform.
Supports web events Yes No No No No No
Supports custom events Yes No No No No No
Supports database sync Yes Yes Yes Yes Yes Yes
Language agnostic No. Rails and .Net only Yes Yes Yes Yes Yes

In considering the options, some of the most fundamental criteria are as follows:

  • Due to the Postrgres DB setup in Azure it is a requirement that any connections to the Postgres DB is on the same private network in Azure. Hence, a pull connection from the Azure Postgres DB requires a VPN between Azure and GCP, which is a complex setup.
  • For efficiency purposes only incremental changes must be streamed. Preferably a standard such as a common CDC (Change Data Capture) approach.
  • For authentication an OIDC connection back to BigQuery is the preferred method of authenticating.
  • Parity with the current DfE Analytics solution is also an important factor
  • Costs of the product and ease of maintenance should also be important considerations

The list of the options below summarises the most fundamental pros and cons, with details in the comparison table above.

1. GCP Products

Available natively in GCP. Allows sync directly to datasets at chosen frequency. Supports Postgres only (In preview).

Available natively in GCP. Real time streamed updates into BigQuery datasets. Supports Postgres and SQL Server.

Pros

  1. Available natively in GCP with minimal setup.

Cons

  1. Pull only from Postgres

Data integration platform for ETL / ELT data pipelines from databases to data warehouses. Both self-hosted and cloud-hosted. Push based service. Has full Terraform support with Terraform provider.

Pros

  1. Data Push, both incremental data (CDC) and batch data
  2. Version history of tables
  3. Can specify tables and columns to sync
  4. Automatic Schema synchronisation
  5. WIF for BigQuery Auth supported
  6. Setup with Terraform providers available
  7. Open source self-hosting available with Azure K8
  8. Low impact on Database

Cons

  1. Self hosted solution will require ongoing Developer / DevOps support
  2. Terraform setup can be a moderate to complex setup

Open-source workflow orchestration tool for scheduling, monitoring, and managing data pipelines. It is widely used for ETL workflows.

Pros

  1. Data Push, both incremental data (CDC) and batch data
  2. Version history of tables
  3. Can specify tables and columns to sync
  4. Automatic Schema synchronisation
  5. WIF for BigQuery Auth supported
  6. Open source self-hosting available with Azure K8
  7. Low impact on Database

Cons

  1. Self hosted solution will require ongoing Developer / DevOps support
  2. Complex terraform setup
  3. A lot of setup done with DAGs that are written separately from terraform
  4. No Automatic Schema synchronisation

4. Azure Data Factory

Microsoft cloud based ETL offering. Provides Data integration enabling data movement, transformation, and orchestration across different sources and destinations. Microsoft equivalent to GCP Dataflow

Pros

  1. Available natively in Azure with minimal setup.
  2. Setup with Terraform available

Cons

  1. Must maintain SQL Queries to select specific tables and columns.
  2. No version history of tables
  3. Moderate to high load impact on source database

Decision

Airbyte (Options 2). In assessing these products, the primary considerations where taken into account. Two products in particular ie AirByte and Airflow seem to fit our requirements well. Airbyte seems to have the edge here in assessing these products as it seems to be less complex to setup and supports automatic sync of schema changes. The only area these products may be lacking when compared with native GCP products are the reliability and monitoring tools. However, they seem to be sufficient and customisable through config, although this may require extra effort

Consequences

Data health will be improved and data synchronisation will require no code maintenance, leading to reduced complexity in DfE Analytics.

The load on services using DfE Analytics should be reduced and scaling to large datasets should be possible.

However, there will be a requirement to maintain the ongoing open source installation and configuration