Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/adr/0003-use-airbyte-for-database-replication.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 3. Connections to DTTP Dynamics
# 3. Use Airbtye for database replication

Date: 2020-09-11
Date: 2025-09-11
## Status

Accepted
Expand Down Expand Up @@ -65,7 +65,7 @@ In considering the options, some of the most fundamental criteria are as follows
- Parity with the current DfE Analytics solution is also an important factor
- Costs of the product and ease of maintenance should also be important considerations

The list of the option below summarises the most fundamental pros and cons , with details in the comparison table above.
The list of the options below summarises the most fundamental pros and cons, with details in the comparison table above.
### 1. GCP Products

#### [GCP BigQuery transfer service](https://cloud.google.com/bigquery/docs/dts-introduction)
Expand Down
32 changes: 17 additions & 15 deletions docs/airbyte.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ An alternative to sending DfE::Analytics database events is database replication

DfE::Analytics database events are also used for replicating data. However, there have been a number issues outlined in the [Data health import issues](https://github.com/DFE-Digital/dfe-analytics/blob/main/README.md#data-health-import-issues) section. Using Airbyte database replication avoids these issues.

Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/003-use-airbyte-for-database-replication.md).
Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/0003-use-airbyte-for-database-replication.md).

## Airbyte Architecture

Expand Down Expand Up @@ -34,13 +34,13 @@ Main configuration options for Airbyte:

- Only tables and columns specified in the `analytics.yml` config file are synchronised

- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:
- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:<br>
`rake dfe:analytics:big_query_apply_policy_tags`

- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:
- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:<br>
`SUPPRESS_DFE_ANALYTICS_INIT=1 rake dfe:analytics:regenerate_airbyte_stream_config`

- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:
- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:<br>
`rake dfe:analytics:airbyte_connection_refresh`

- DfE Analytics will still be used to stream the following event types to BigQuery:
Expand All @@ -66,7 +66,9 @@ A dedicated dataset for the final airbyte tables with a service account and perm
NOTES:
- Due to the inability to control the Airbyte destination connector process though ENV variables we are unable to retrieve an azure token from the azure tenant token endpoint. To workaround this we to use a proxy for the credential source. The proxy is an API server that runs the ruby server from the [dfe-azure-access-token](https://github.com/DFE-Digital/dfe-azure-access-token) repo. This is controlled through the Service Account JSON. See the section below for [WIF Configuration](#wif-configuration).

Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to
Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to remove internal Airbyte tables from the destination.

SD DevOps provide this setup through terraform configuration.

### 3. Setup Connection

Expand All @@ -82,17 +84,17 @@ We require the following datasets for the Airbyte setup.
- One required per BigQuery project
- The expiry on the table should be set to 1 day
- Encryption should be changed to the project Cloud KMS Key
- The GCP Service account used by Airbyte should have permissions below on this dataset:
`BigQuery Data Owner`
- The GCP Service account used by Airbyte should have permissions below on this dataset:<br>
`BigQuery Data Owner`<br>
`Data Catalog Admin`

2. A dataset for the final airbyte tables with naming convention:
2. A dataset for the final airbyte tables with naming convention:<br>
`<service_name>_airbyte_<environment>` Eg. `register_airbyte_qa`
- One required per service
- An expiry should NOT be set
- Encryption should be changed to the project Cloud KMS Key
- The GCP Service account used by Airbyte should have permissions below on this dataset:
`BigQuery Data Owner`
- The GCP Service account used by Airbyte should have permissions below on this dataset:<br>
`BigQuery Data Owner`<br>
`Data Catalog Admin`

## WIF Configuration
Expand All @@ -105,9 +107,9 @@ The client process connecting to GCP should have WIF enabled. SD DevOps provide

If the client process is enabled for WIF, then it will have the following properties per environment (namespace):

The following environment variables will be set:
`AZURE_CLIENT_ID`
`AZURE_FEDERATED_TOKEN_FILE`
The following environment variables will be set:<br>
`AZURE_CLIENT_ID`<br>
`AZURE_FEDERATED_TOKEN_FILE`<br>
`AZURE_TENANT_ID`


Expand Down Expand Up @@ -165,7 +167,7 @@ SD DevOps create and configure the JSON WIF Credentials file through terraform c

Migration of sending events from DfE Analytics to Airbyte will be done in several phases.

The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. Subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. A subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.

A more detailed document on the migration can be in the [Data Ingestion Migration Plan](https://educationgovuk-my.sharepoint.com.mcas.ms/:w:/r/personal/elizabeth_karina_education_gov_uk/_layouts/15/doc2.aspx?sourcedoc=%7BBDF5AF1F-443C-4FC0-AD10-D5CB87444714%7D&file=Data%20Ingestion%20Migration%20Plan.docx&action=default&mobileredirect=true&ovuser=fad277c9-c60a-4da1-b5f3-b3b8b34a82f9%2CAmarjit.SINGH-ATWAL%40EDUCATION.GOV.UK&clickparams=eyJBcHBOYW1lIjoiVGVhbXMtRGVza3RvcCIsIkFwcFZlcnNpb24iOiIxNDE1LzI1MDgxNTAwNzE3IiwiSGFzRmVkZXJhdGVkVXNlciI6ZmFsc2V9).

Expand All @@ -185,4 +187,4 @@ In summary the first phase of the migration will focus on the following:
| Airbyte uses internal Postgres database for operational purposes.<br>Default setup allocates minimal db storage space.<br><br> | Use larger Azure Postgres database.<br><br>Use Airbyte config:<br>`TEMPORAL_HISTORY_RETENTION_IN_DAYS=7`<br><br>Add Database size monitoring<br> | L |
| Single Airbyte instance per namespace maybe overloaded for projects with numerous services. | Add CPU/Memory monitoring and alerts and resize CPU/Memory if required | M |
| Airbyte outage may cause replication log bloat and lead to a service database outage.<br><br> | Limit max replication log size to prevent log bloat.<br>This must be customised per service. <br>May result in changes being missed during Airbyte outage.<br><br>Add replication monitoring, | M |
| Airbyte ongoing version maintenance.<br>Upgrades nay break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.<br><br> | L |
| Airbyte ongoing version maintenance.<br>Upgrades may break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.<br><br> | L |