You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuration.rst
+150Lines changed: 150 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,6 +166,143 @@ Data governance configuration
166
166
**dp** can sends **dbt** metadata to DataHub. All related configuration is stored in ``config/<ENV>/datahub.yml`` file.
167
167
More information about it can be found `here <https://datahubproject.io/docs/metadata-ingestion#recipes>`_ and `here <https://datahubproject.io/docs/generated/ingestion/sources/dbt>`_.
168
168
169
+
Data ingestion configuration
170
+
++++++++++++++++++++++++++++++
171
+
172
+
Ingestion configuration is divided into two levels:
173
+
174
+
- General: ``config/<ENV>/ingestion.yml``
175
+
- Ingestion tool related: e.g. ``config/<ENV>/airbyte.yml``
176
+
177
+
``config/<ENV>/ingestion.yml`` contains basic configuration of ingestion:
178
+
179
+
.. list-table::
180
+
:widths: 25 20 55
181
+
:header-rows: 1
182
+
183
+
* - Parameter
184
+
- Data type
185
+
- Description
186
+
* - enable
187
+
- bool
188
+
- Flag for enable/disable ingestion option in **dp**.
189
+
* - engine
190
+
- enum string
191
+
- Ingestion tool you would like to integrate with (currently the only supported value is ``airbyte``).
192
+
193
+
``config/<ENV>/airbyte.yml`` must be present if engine of your choice is ``airbyte``. It consists of two parts:
194
+
195
+
1. First part is required by `dbt-airflow-factory <https://github.com/getindata/dbt-airflow-factory>`_
196
+
and must be present in order to create ingestion tasks preceding dbt rebuild in Airflow. When you choose to manage
197
+
Airbyte connections with `dp` tool, ``connectionId`` is unknown at the time of coding however `dp` tool is ready to
198
+
handle this case. For detailed info reference example ``airbyte.yml`` at the end of this section.
199
+
200
+
.. list-table::
201
+
:widths: 25 20 55
202
+
:header-rows: 1
203
+
204
+
* - Parameter
205
+
- Data type
206
+
- Description
207
+
* - airbyte_connection_id
208
+
- string
209
+
- Name of Airbyte connection in Airflow
210
+
* - tasks
211
+
- array<*task*>
212
+
- Configurations of Airflow tasks used by `dbt-airflow-factory <https://github.com/getindata/dbt-airflow-factory>`_.
213
+
Allowed *task* options are documented `here <https://dbt-airflow-factory.readthedocs.io/en/latest/configuration.html#id3>`_.
214
+
215
+
2. Second part is used directly by `dp` tool to manage (insert or update) connections in Airbyte. It is **not** required
216
+
unless you would like to manage Airbyte connections with `dp` tool.
217
+
218
+
.. list-table::
219
+
:widths: 25 20 55
220
+
:header-rows: 1
221
+
222
+
* - Parameter
223
+
- Data type
224
+
- Description
225
+
* - airbyte_url
226
+
- string
227
+
- Https address of Airbyte deployment that allows to connect to Airbyte API
228
+
* - connections
229
+
- array<*connection*>
230
+
- Configurations of Airbyte connections that should be upserted during CI/CD. Minimal connection schema is documented below.
231
+
These configurations are passed directly to Airbyte API to the `connections/create` or `connections/update` endpoint.
232
+
Please reference
233
+
`Airbyte API reference <https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/create>`_
234
+
for more detailed configuration.
235
+
236
+
.. code-block:: text
237
+
238
+
YOUR_CONNECTION_NAME: string
239
+
name: string Optional name of the connection
240
+
sourceId: uuid UUID of Airbyte source used for this connection
241
+
destinationId: uuid UUID of Airbyte destination used for this connection
242
+
namespaceDefinition: enum Method used for computing final namespace in destination
243
+
namespaceFormat: string Used when namespaceDefinition is 'customformat'
244
+
status: enum `active` means that data is flowing through the connection. `inactive` means it is not
245
+
syncCatalog: object Describes the available schema (catalog).
246
+
streams: array
247
+
- stream: object
248
+
name: string Stream's name
249
+
jsonSchema: object Stream schema using Json Schema specs.
0 commit comments