You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/airbyte.md
+17-15Lines changed: 17 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ An alternative to sending DfE::Analytics database events is database replication
6
6
7
7
DfE::Analytics database events are also used for replicating data. However, there have been a number issues outlined in the [Data health import issues](https://github.com/DFE-Digital/dfe-analytics/blob/main/README.md#data-health-import-issues) section. Using Airbyte database replication avoids these issues.
8
8
9
-
Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/003-use-airbyte-for-database-replication.md).
9
+
Airbyte is a an open source data integration platform that handles database replication reliably. The decision to use Airbyte for database replication is documented in this [ADR](https://github.com/DFE-Digital/dfe-analytics/blob/main/docs/adr/0003-use-airbyte-for-database-replication.md).
10
10
11
11
## Airbyte Architecture
12
12
@@ -34,13 +34,13 @@ Main configuration options for Airbyte:
34
34
35
35
- Only tables and columns specified in the `analytics.yml` config file are synchronised
36
36
37
-
- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:
37
+
- Any columns with sensitive data listed in the `analytics_blocklist.yml` config file will have the hidden policy tag applied. A rails rake task is available for this:<br>
38
38
`rake dfe:analytics:big_query_apply_policy_tags`
39
39
40
-
- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:
40
+
- An airbyte configuration file (`airbyte_stream_config.json`) required by terraform for provisioning the connection can be generated from `analytics.yml`. A rails rake task is available for this:<br>
- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:
43
+
- Following a schema migration the airbyte connection config can be regenerated by the rails rake task:<br>
44
44
`rake dfe:analytics:airbyte_connection_refresh`
45
45
46
46
- DfE Analytics will still be used to stream the following event types to BigQuery:
@@ -66,7 +66,9 @@ A dedicated dataset for the final airbyte tables with a service account and perm
66
66
NOTES:
67
67
- Due to the inability to control the Airbyte destination connector process though ENV variables we are unable to retrieve an azure token from the azure tenant token endpoint. To workaround this we to use a proxy for the credential source. The proxy is an API server that runs the ruby server from the [dfe-azure-access-token](https://github.com/DFE-Digital/dfe-azure-access-token) repo. This is controlled through the Service Account JSON. See the section below for [WIF Configuration](#wif-configuration).
68
68
69
-
Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to
69
+
Note that the raw internal Airbyte tables are not required in the destination. However, this option is not configurable and it is not possible to remove internal Airbyte tables from the destination.
70
+
71
+
SD DevOps provide this setup through terraform configuration.
70
72
71
73
### 3. Setup Connection
72
74
@@ -82,17 +84,17 @@ We require the following datasets for the Airbyte setup.
82
84
- One required per BigQuery project
83
85
- The expiry on the table should be set to 1 day
84
86
- Encryption should be changed to the project Cloud KMS Key
85
-
- The GCP Service account used by Airbyte should have permissions below on this dataset:
86
-
`BigQuery Data Owner`
87
+
- The GCP Service account used by Airbyte should have permissions below on this dataset:<br>
88
+
`BigQuery Data Owner`<br>
87
89
`Data Catalog Admin`
88
90
89
-
2. A dataset for the final airbyte tables with naming convention:
91
+
2. A dataset for the final airbyte tables with naming convention:<br>
- Encryption should be changed to the project Cloud KMS Key
94
-
- The GCP Service account used by Airbyte should have permissions below on this dataset:
95
-
`BigQuery Data Owner`
96
+
- The GCP Service account used by Airbyte should have permissions below on this dataset:<br>
97
+
`BigQuery Data Owner`<br>
96
98
`Data Catalog Admin`
97
99
98
100
## WIF Configuration
@@ -105,9 +107,9 @@ The client process connecting to GCP should have WIF enabled. SD DevOps provide
105
107
106
108
If the client process is enabled for WIF, then it will have the following properties per environment (namespace):
107
109
108
-
The following environment variables will be set:
109
-
`AZURE_CLIENT_ID`
110
-
`AZURE_FEDERATED_TOKEN_FILE`
110
+
The following environment variables will be set:<br>
111
+
`AZURE_CLIENT_ID`<br>
112
+
`AZURE_FEDERATED_TOKEN_FILE`<br>
111
113
`AZURE_TENANT_ID`
112
114
113
115
@@ -165,7 +167,7 @@ SD DevOps create and configure the JSON WIF Credentials file through terraform c
165
167
166
168
Migration of sending events from DfE Analytics to Airbyte will be done in several phases.
167
169
168
-
The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. Subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
170
+
The first phase will focus on migrating the database create, update and delete events from being sent as events to database replication using Airbyte. A subsequent phase will focus on the mechanism used to send Web request events, API request events, Custom events from being emitted using queueing to database replication of an `events` database table.
169
171
170
172
A more detailed document on the migration can be in the [Data Ingestion Migration Plan](https://educationgovuk-my.sharepoint.com.mcas.ms/:w:/r/personal/elizabeth_karina_education_gov_uk/_layouts/15/doc2.aspx?sourcedoc=%7BBDF5AF1F-443C-4FC0-AD10-D5CB87444714%7D&file=Data%20Ingestion%20Migration%20Plan.docx&action=default&mobileredirect=true&ovuser=fad277c9-c60a-4da1-b5f3-b3b8b34a82f9%2CAmarjit.SINGH-ATWAL%40EDUCATION.GOV.UK&clickparams=eyJBcHBOYW1lIjoiVGVhbXMtRGVza3RvcCIsIkFwcFZlcnNpb24iOiIxNDE1LzI1MDgxNTAwNzE3IiwiSGFzRmVkZXJhdGVkVXNlciI6ZmFsc2V9).
171
173
@@ -185,4 +187,4 @@ In summary the first phase of the migration will focus on the following:
185
187
| Airbyte uses internal Postgres database for operational purposes.<br>Default setup allocates minimal db storage space.<br><br> | Use larger Azure Postgres database.<br><br>Use Airbyte config:<br>`TEMPORAL_HISTORY_RETENTION_IN_DAYS=7`<br><br>Add Database size monitoring<br> | L |
186
188
| Single Airbyte instance per namespace maybe overloaded for projects with numerous services. | Add CPU/Memory monitoring and alerts and resize CPU/Memory if required | M |
187
189
| Airbyte outage may cause replication log bloat and lead to a service database outage.<br><br> | Limit max replication log size to prevent log bloat.<br>This must be customised per service. <br>May result in changes being missed during Airbyte outage.<br><br>Add replication monitoring, | M |
188
-
| Airbyte ongoing version maintenance.<br>Upgrades nay break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.<br><br> | L |
190
+
| Airbyte ongoing version maintenance.<br>Upgrades may break APIs being called if APIs are not backwards compatible. | Check release notes before version upgrades.<br><br> | L |
0 commit comments