Skip to content

Commit d61c18b

Browse files
authored
Airbyte documentation corrections and typos (#207)
1 parent 076957c commit d61c18b

File tree

2 files changed

+20
-18
lines changed

2 files changed

+20
-18
lines changed

docs/adr/0003-use-airbyte-for-database-replication.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# 3. Connections to DTTP Dynamics
1+
# 3. Use Airbtye for database replication
22

3-
Date: 2020-09-11
3+
Date: 2025-09-11
44
## Status
55

66
Accepted
@@ -65,7 +65,7 @@ In considering the options, some of the most fundamental criteria are as follows
6565
- Parity with the current DfE Analytics solution is also an important factor
6666
- Costs of the product and ease of maintenance should also be important considerations
6767

68-
The list of the option below summarises the most fundamental pros and cons , with details in the comparison table above.
68+
The list of the options below summarises the most fundamental pros and cons, with details in the comparison table above.
6969
### 1. GCP Products
7070

7171
#### [GCP BigQuery transfer service](https://cloud.google.com/bigquery/docs/dts-introduction)

docs/airbyte.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ An alternative to sending DfE::Analytics database events is database replication
66

77
DfE::Analytics database events are also used for replicating data. However, there have been a number issues outlined in the [Data health import issues](https://github.com/DFE-Digital/dfe-analytics/blob/main/README.md#data-health-import-issues) section. Using Airbyte database replication avoids these issues.
88

9-
Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/003-use-airbyte-for-database-replication.md).
9+
Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/0003-use-airbyte-for-database-replication.md).
1010

1111
## Airbyte Architecture
1212

@@ -34,13 +34,13 @@ Main configuration options for Airbyte:
3434

3535
- Only tables and columns specified in the `analytics.yml` config file are synchronised
3636

37-
- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:
37+
- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:<br>
3838
`rake dfe:analytics:big_query_apply_policy_tags`
3939

40-
- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:
40+
- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:<br>
4141
`SUPPRESS_DFE_ANALYTICS_INIT=1 rake dfe:analytics:regenerate_airbyte_stream_config`
4242

43-
- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:
43+
- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:<br>
4444
`rake dfe:analytics:airbyte_connection_refresh`
4545

4646
- DfE Analytics will still be used to stream the following event types to BigQuery:
@@ -66,7 +66,9 @@ A dedicated dataset for the final airbyte tables with a service account and perm
6666
NOTES:
6767
- Due to the inability to control the Airbyte destination connector process though ENV variables we are unable to retrieve an azure token from the azure tenant token endpoint. To workaround this we to use a proxy for the credential source. The proxy is an API server that runs the ruby server from the [dfe-azure-access-token](https://github.com/DFE-Digital/dfe-azure-access-token) repo. This is controlled through the Service Account JSON. See the section below for [WIF Configuration](#wif-configuration).
6868

69-
Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to
69+
Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to remove internal Airbyte tables from the destination.
70+
71+
SD DevOps provide this setup through terraform configuration.
7072

7173
### 3. Setup Connection
7274

@@ -82,17 +84,17 @@ We require the following datasets for the Airbyte setup.
8284
- One required per BigQuery project
8385
- The expiry on the table should be set to 1 day
8486
- Encryption should be changed to the project Cloud KMS Key
85-
- The GCP Service account used by Airbyte should have permissions below on this dataset:
86-
`BigQuery Data Owner`
87+
- The GCP Service account used by Airbyte should have permissions below on this dataset:<br>
88+
`BigQuery Data Owner`<br>
8789
`Data Catalog Admin`
8890

89-
2. A dataset for the final airbyte tables with naming convention:
91+
2. A dataset for the final airbyte tables with naming convention:<br>
9092
`<service_name>_airbyte_<environment>` Eg. `register_airbyte_qa`
9193
- One required per service
9294
- An expiry should NOT be set
9395
- Encryption should be changed to the project Cloud KMS Key
94-
- The GCP Service account used by Airbyte should have permissions below on this dataset:
95-
`BigQuery Data Owner`
96+
- The GCP Service account used by Airbyte should have permissions below on this dataset:<br>
97+
`BigQuery Data Owner`<br>
9698
`Data Catalog Admin`
9799

98100
## WIF Configuration
@@ -105,9 +107,9 @@ The client process connecting to GCP should have WIF enabled. SD DevOps provide
105107

106108
If the client process is enabled for WIF, then it will have the following properties per environment (namespace):
107109

108-
The following environment variables will be set:
109-
`AZURE_CLIENT_ID`
110-
`AZURE_FEDERATED_TOKEN_FILE`
110+
The following environment variables will be set:<br>
111+
`AZURE_CLIENT_ID`<br>
112+
`AZURE_FEDERATED_TOKEN_FILE`<br>
111113
`AZURE_TENANT_ID`
112114

113115

@@ -165,7 +167,7 @@ SD DevOps create and configure the JSON WIF Credentials file through terraform c
165167

166168
Migration of sending events from DfE Analytics to Airbyte will be done in several phases.
167169

168-
The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. Subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
170+
The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. A subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
169171

170172
A more detailed document on the migration can be in the [Data Ingestion Migration Plan](https://educationgovuk-my.sharepoint.com.mcas.ms/:w:/r/personal/elizabeth_karina_education_gov_uk/_layouts/15/doc2.aspx?sourcedoc=%7BBDF5AF1F-443C-4FC0-AD10-D5CB87444714%7D&file=Data%20Ingestion%20Migration%20Plan.docx&action=default&mobileredirect=true&ovuser=fad277c9-c60a-4da1-b5f3-b3b8b34a82f9%2CAmarjit.SINGH-ATWAL%40EDUCATION.GOV.UK&clickparams=eyJBcHBOYW1lIjoiVGVhbXMtRGVza3RvcCIsIkFwcFZlcnNpb24iOiIxNDE1LzI1MDgxNTAwNzE3IiwiSGFzRmVkZXJhdGVkVXNlciI6ZmFsc2V9).
171173

@@ -185,4 +187,4 @@ In summary the first phase of the migration will focus on the following:
185187
| Airbyte uses internal Postgres database for operational purposes.<br>Default setup allocates minimal db storage space.<br><br> | Use larger Azure Postgres database.<br><br>Use Airbyte config:<br>`TEMPORAL_HISTORY_RETENTION_IN_DAYS=7`<br><br>Add Database size monitoring<br> | L |
186188
| Single Airbyte instance per namespace maybe overloaded for projects with numerous services. | Add CPU/Memory monitoring and alerts and resize CPU/Memory if required | M |
187189
| Airbyte outage may cause replication log bloat and lead to a service database outage.<br><br> | Limit max replication log size to prevent log bloat.<br>This must be customised per service. <br>May result in changes being missed during Airbyte outage.<br><br>Add replication monitoring, | M |
188-
| Airbyte ongoing version maintenance.<br>Upgrades nay break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.<br><br> | L |
190+
| Airbyte ongoing version maintenance.<br>Upgrades may break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.<br><br> | L |

0 commit comments

Comments
 (0)