diff --git a/docs/adr/0003-use-airbyte-for-database-replication.md b/docs/adr/0003-use-airbyte-for-database-replication.md
index 60be6a35..3e700f09 100644
--- a/docs/adr/0003-use-airbyte-for-database-replication.md
+++ b/docs/adr/0003-use-airbyte-for-database-replication.md
@@ -1,6 +1,6 @@
-# 3. Connections to DTTP Dynamics
+# 3. Use Airbtye for database replication
-Date: 2020-09-11
+Date: 2025-09-11
## Status
Accepted
@@ -65,7 +65,7 @@ In considering the options, some of the most fundamental criteria are as follows
- Parity with the current DfE Analytics solution is also an important factor
- Costs of the product and ease of maintenance should also be important considerations
-The list of the option below summarises the most fundamental pros and cons , with details in the comparison table above.
+The list of the options below summarises the most fundamental pros and cons, with details in the comparison table above.
### 1. GCP Products
#### [GCP BigQuery transfer service](https://cloud.google.com/bigquery/docs/dts-introduction)
diff --git a/docs/airbyte.md b/docs/airbyte.md
index 0b7aa004..c8668211 100644
--- a/docs/airbyte.md
+++ b/docs/airbyte.md
@@ -6,7 +6,7 @@ An alternative to sending DfE::Analytics database events is database replication
DfE::Analytics database events are also used for replicating data. However, there have been a number issues outlined in the [Data health import issues](https://github.com/DFE-Digital/dfe-analytics/blob/main/README.md#data-health-import-issues) section. Using Airbyte database replication avoids these issues.
-Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/003-use-airbyte-for-database-replication.md).
+Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/0003-use-airbyte-for-database-replication.md).
## Airbyte Architecture
@@ -34,13 +34,13 @@ Main configuration options for Airbyte:
- Only tables and columns specified in the `analytics.yml` config file are synchronised
-- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:
+- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:
`rake dfe:analytics:big_query_apply_policy_tags`
-- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:
+- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:
`SUPPRESS_DFE_ANALYTICS_INIT=1 rake dfe:analytics:regenerate_airbyte_stream_config`
-- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:
+- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:
`rake dfe:analytics:airbyte_connection_refresh`
- DfE Analytics will still be used to stream the following event types to BigQuery:
@@ -66,7 +66,9 @@ A dedicated dataset for the final airbyte tables with a service account and perm
NOTES:
- Due to the inability to control the Airbyte destination connector process though ENV variables we are unable to retrieve an azure token from the azure tenant token endpoint. To workaround this we to use a proxy for the credential source. The proxy is an API server that runs the ruby server from the [dfe-azure-access-token](https://github.com/DFE-Digital/dfe-azure-access-token) repo. This is controlled through the Service Account JSON. See the section below for [WIF Configuration](#wif-configuration).
-Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to
+Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to remove internal Airbyte tables from the destination.
+
+SD DevOps provide this setup through terraform configuration.
### 3. Setup Connection
@@ -82,17 +84,17 @@ We require the following datasets for the Airbyte setup.
- One required per BigQuery project
- The expiry on the table should be set to 1 day
- Encryption should be changed to the project Cloud KMS Key
- - The GCP Service account used by Airbyte should have permissions below on this dataset:
- `BigQuery Data Owner`
+ - The GCP Service account used by Airbyte should have permissions below on this dataset:
+ `BigQuery Data Owner`
`Data Catalog Admin`
-2. A dataset for the final airbyte tables with naming convention:
+2. A dataset for the final airbyte tables with naming convention:
`_airbyte_` Eg. `register_airbyte_qa`
- One required per service
- An expiry should NOT be set
- Encryption should be changed to the project Cloud KMS Key
- - The GCP Service account used by Airbyte should have permissions below on this dataset:
- `BigQuery Data Owner`
+ - The GCP Service account used by Airbyte should have permissions below on this dataset:
+ `BigQuery Data Owner`
`Data Catalog Admin`
## WIF Configuration
@@ -105,9 +107,9 @@ The client process connecting to GCP should have WIF enabled. SD DevOps provide
If the client process is enabled for WIF, then it will have the following properties per environment (namespace):
-The following environment variables will be set:
-`AZURE_CLIENT_ID`
-`AZURE_FEDERATED_TOKEN_FILE`
+The following environment variables will be set:
+`AZURE_CLIENT_ID`
+`AZURE_FEDERATED_TOKEN_FILE`
`AZURE_TENANT_ID`
@@ -165,7 +167,7 @@ SD DevOps create and configure the JSON WIF Credentials file through terraform c
Migration of sending events from DfE Analytics to Airbyte will be done in several phases.
-The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. Subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
+The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. A subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
A more detailed document on the migration can be in the [Data Ingestion Migration Plan](https://educationgovuk-my.sharepoint.com.mcas.ms/:w:/r/personal/elizabeth_karina_education_gov_uk/_layouts/15/doc2.aspx?sourcedoc=%7BBDF5AF1F-443C-4FC0-AD10-D5CB87444714%7D&file=Data%20Ingestion%20Migration%20Plan.docx&action=default&mobileredirect=true&ovuser=fad277c9-c60a-4da1-b5f3-b3b8b34a82f9%2CAmarjit.SINGH-ATWAL%40EDUCATION.GOV.UK&clickparams=eyJBcHBOYW1lIjoiVGVhbXMtRGVza3RvcCIsIkFwcFZlcnNpb24iOiIxNDE1LzI1MDgxNTAwNzE3IiwiSGFzRmVkZXJhdGVkVXNlciI6ZmFsc2V9).
@@ -185,4 +187,4 @@ In summary the first phase of the migration will focus on the following:
| Airbyte uses internal Postgres database for operational purposes.
Default setup allocates minimal db storage space.
| Use larger Azure Postgres database.
Use Airbyte config:
`TEMPORAL_HISTORY_RETENTION_IN_DAYS=7`
Add Database size monitoring
| L |
| Single Airbyte instance per namespace maybe overloaded for projects with numerous services. | Add CPU/Memory monitoring and alerts and resize CPU/Memory if required | M |
| Airbyte outage may cause replication log bloat and lead to a service database outage.
| Limit max replication log size to prevent log bloat.
This must be customised per service.
May result in changes being missed during Airbyte outage.
Add replication monitoring, | M |
-| Airbyte ongoing version maintenance.
Upgrades nay break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.
| L |
+| Airbyte ongoing version maintenance.
Upgrades may break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.
| L |