Skip to content

Commit 84945c1

Browse files
Merge pull request #2 from databrickslabs/feature/pp-apply-changes-support
## [v0.0.2] - 2023-05-11 ### Added - Table properties support for bronze, quarantine and silver tables using create_streaming_live_table api call - Support for track history column using apply_changes api - Support for delta as source - Validation for bronze/silver onboarding ### Fixed - Input schema parsing issue in onboarding ### Modified - Readme and docs to include above features
2 parents 0c85fd3 + e90c329 commit 84945c1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+754
-501
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
**NOTE:** For CLI interfaces, we support SemVer approach. However, for API components we don't use SemVer as of now. This may lead to instability when using dbx API methods directly.
99

1010
[Please read through the Keep a Changelog (~5min)](https://keepachangelog.com/en/1.0.0/).
11+
## [v0.0.2] - 2023-05-11
12+
### Added
13+
- Table properties support for bronze, quarantine and silver tables using create_streaming_live_table api call
14+
- Support for track history column using apply_changes api
15+
- Support for delta as source
16+
- Validation for bronze/silver onboarding
17+
### Fixed
18+
- Input schema parsing issue in onboarding
19+
### Modified
20+
- Readme and docs to include above features
1121

1222
## [v0.0.1] - 2023-03-22
1323
### Added

README.md

Lines changed: 6 additions & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
alt="GitHub Workflow Status (branch)"/>
2121
</a>
2222
<a href="https://codecov.io/gh/databrickslabs/dlt-meta">
23-
<img src="https://img.shields.io/codecov/c/github/databrickslabs/dlt-meta?style=for-the-badge&amp;token=KI3HFZQWF0"
23+
<img src="https://img.shields.io/codecov/c/github/databrickslabs/dlt-meta?style=for-the-badge&amp;token=2CxLj3YBam"
2424
alt="codecov"/>
2525
</a>
2626
<a href="https://lgtm.com/projects/g/databrickslabs/dlt-meta/alerts">
@@ -63,167 +63,15 @@ With this framework you need to record the source and target metadata in an onbo
6363
## High-Level Process Flow:
6464
![DLT-META High-Level Process Flow](./docs/static/images/solutions_overview.png)
6565

66-
## More questions
67-
68-
Refer to the [FAQ](https://databrickslabs.github.io/dlt-meta/faq)
69-
and DLT-META [documentation](https://databrickslabs.github.io/dlt-meta/)
70-
7166
## Steps
7267
![DLT-META Stages](./docs/static/images/dlt-meta_stages.png)
7368

69+
## Getting Started
70+
Refer to the [Getting Started](https://databrickslabs.github.io/dlt-meta/getting_started)
7471

75-
## 1. Metadata preparation
76-
1. Create ```onboarding.json``` metadata file and save to s3/adls/dbfs e.g.[onboarding file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/onboarding.json)
77-
2. Create ```silver_transformations.json``` and save to s3/adls/dbfs e.g [Silver transformation file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/silver_transformations.json)
78-
3. Create data quality rules json and store to s3/adls/dbfs e.g [Data Quality Rules](https://github.com/databrickslabs/dlt-meta/tree/main/examples/dqe/customers/bronze_data_quality_expectations.json)
79-
80-
## 2. Onboarding job
81-
82-
1. Go to your Databricks landing page and do one of the following:
83-
84-
2. In the sidebar, click Jobs Icon Workflows and click Create Job Button.
85-
86-
3. In the sidebar, click New Icon New and select Job from the menu.
87-
88-
4. In the task dialog box that appears on the Tasks tab, replace Add a name for your job… with your job name, for example, Python wheel example.
89-
90-
5. In Task name, enter a name for the task, for example, ```dlt_meta_onboarding_pythonwheel_task```.
91-
92-
6. In Type, select Python wheel.
93-
94-
5. In Package name, enter ```dlt_meta```.
95-
96-
6. In Entry point, enter ``run``.
97-
98-
7. Click Add under Dependent Libraries. In the Add dependent library dialog, under Library Type, click PyPI. Enter Package: ```dlt-meta```
99-
100-
101-
8. Click Add.
102-
103-
9. In Parameters, select keyword argument then select JSON. Past below json parameters with :
104-
```
105-
{
106-
"database": "dlt_demo",
107-
"onboarding_file_path": "dbfs:/onboarding_files/users_onboarding.json",
108-
"silver_dataflowspec_table": "silver_dataflowspec_table",
109-
"silver_dataflowspec_path": "dbfs:/onboarding_tables_cdc/silver",
110-
"bronze_dataflowspec_table": "bronze_dataflowspec_table",
111-
"import_author": "Ravi",
112-
"version": "v1",
113-
"bronze_dataflowspec_path": "dbfs:/onboarding_tables_cdc/bronze",
114-
"overwrite": "True",
115-
"env": "dev"
116-
}
117-
```
118-
Alternatly you can enter keyword arguments, click + Add and enter a key and value. Click + Add again to enter more arguments.
119-
120-
10. Click Save task.
121-
122-
11. Run now
123-
124-
12. Make sure job run successfully. Verify metadata in your dataflow spec tables entered in step: 9 e.g ```dlt_demo.bronze_dataflowspec_table``` , ```dlt_demo.silver_dataflowspec_table```
125-
126-
## 3. Launch Dataflow DLT Pipeline
127-
### Create a dlt launch notebook
128-
129-
1. Go to your Databricks landing page and select Create a notebook, or click New Icon New in the sidebar and select Notebook. The Create Notebook dialog appears.
130-
131-
2. In the Create Notebook dialogue, give your notebook a name e.g ```dlt_meta_pipeline``` and select Python from the Default Language dropdown menu. You can leave Cluster set to the default value. The Delta Live Tables runtime creates a cluster before it runs your pipeline.
132-
133-
3. Click Create.
134-
135-
4. You can add the [example dlt pipeline](https://github.com/databrickslabs/dlt-meta/blob/main/examples/dlt_meta_pipeline.ipynb) code or import iPython notebook as is.
136-
137-
### Create a DLT pipeline
138-
139-
1. Click Jobs Icon Workflows in the sidebar, click the Delta Live Tables tab, and click Create Pipeline.
140-
141-
2. Give the pipeline a name e.g. DLT_META_BRONZE and click File Picker Icon to select a notebook ```dlt_meta_pipeline``` created in step: ```Create a dlt launch notebook```.
142-
143-
3. Optionally enter a storage location for output data from the pipeline. The system uses a default location if you leave Storage location empty.
144-
145-
4. Select Triggered for Pipeline Mode.
146-
147-
5. Enter Configuration parameters e.g.
148-
```
149-
"layer": "bronze",
150-
"bronze.dataflowspecTable": "dataflowspec table name",
151-
"bronze.group": "enter group name from metadata e.g. G1",
152-
```
153-
154-
6. Enter target schema where you wants your bronze/silver tables to be created
155-
156-
7. Click Create.
157-
158-
8. Start pipeline: click the Start button on in top panel. The system returns a message confirming that your pipeline is starting
159-
160-
161-
162-
# Additional
163-
You can run integration tests from you local with dlt-meta.
164-
## Run Integration Tests
165-
1. Clone [DLT-META](https://github.com/databrickslabs/dlt-meta)
166-
167-
2. Open terminal and Goto root folder ```DLT-META```
168-
169-
3. Create environment variables.
170-
171-
```
172-
export DATABRICKS_HOST=<DATABRICKS HOST>
173-
export DATABRICKS_TOKEN=<DATABRICKS TOKEN> # Account needs permission to create clusters/dlt pipelines.
174-
```
175-
176-
4. Run itegration tests for different supported input sources: cloudfiles, eventhub, kafka
177-
178-
4a. Run the command for cloudfiles ```python integration-tests/run-integration-test.py --cloud_provider_name=aws --dbr_version=11.3.x-scala2.12 --source=cloudfiles --dbfs_path=dbfs:/tmp/DLT-META/```
179-
180-
4b. Run the command for eventhub ```python integration-tests/run-integration-test.py --cloud_provider_name=azure --dbr_version=11.3.x-scala2.12 --source=eventhub --dbfs_path=dbfs:/tmp/DLT-META/ --eventhub_name=iot --eventhub_secrets_scope_name=eventhubs_creds --eventhub_namespace=int_test-standard --eventhub_port=9093 --eventhub_producer_accesskey_name=producer ----eventhub_consumer_accesskey_name=consumer```
181-
182-
For eventhub integration tests, the following are the prerequisites:
183-
1. Needs eventhub instance running
184-
2. Using Databricks CLI, Create databricks secrets scope for eventhub keys
185-
3. Using Databricks CLI, Create databricks secrets to store producer and consumer keys using the scope created in step 2
186-
187-
Following are the mandatory arguments for running EventHubs integration test
188-
1. Provide your eventhub topic name : ```--eventhub_name```
189-
2. Provide eventhub namespace using ```--eventhub_namespace```
190-
3. Provide eventhub port using ```--eventhub_port```
191-
4. Provide databricks secret scope name using ```----eventhub_secrets_scope_name```
192-
5. Provide eventhub producer access key name using ```--eventhub_producer_accesskey_name```
193-
6. Provide eventhub access key name using ```--eventhub_consumer_accesskey_name```
194-
195-
196-
4c. Run the command for kafka ```python3 integration-tests/run-integration-test.py --cloud_provider_name=aws --dbr_version=11.3.x-scala2.12 --source=kafka --dbfs_path=dbfs:/tmp/DLT-META/ --kafka_topic_name=dlt-meta-integration-test --kafka_broker=host:9092```
197-
198-
For kafka integration tests, the following are the prerequisites:
199-
1. Needs kafka instance running
200-
201-
Following are the mandatory arguments for running EventHubs integration test
202-
1. Provide your kafka topic name : ```--kafka_topic_name```
203-
2. Provide kafka_broker ```--kafka_broker```
204-
205-
206-
207-
Once finished integration output file will be copied locally to ```integration-test-output_<run_id>.csv```
208-
209-
5. Output of a successful run should have the following in the file
210-
211-
```
212-
,0
213-
0,Completed Bronze DLT Pipeline.
214-
1,Completed Silver DLT Pipeline.
215-
2,Validating DLT Bronze and Silver Table Counts...
216-
3,Validating Counts for Table bronze_7b866603ab184c70a66805ac8043a03d.transactions_cdc.
217-
4,Expected: 10002 Actual: 10002. Passed!
218-
5,Validating Counts for Table bronze_7b866603ab184c70a66805ac8043a03d.transactions_cdc_quarantine.
219-
6,Expected: 9842 Actual: 9842. Passed!
220-
7,Validating Counts for Table bronze_7b866603ab184c70a66805ac8043a03d.customers_cdc.
221-
8,Expected: 98928 Actual: 98928. Passed!
222-
9,Validating Counts for Table silver_7b866603ab184c70a66805ac8043a03d.transactions.
223-
10,Expected: 8759 Actual: 8759. Passed!
224-
11,Validating Counts for Table silver_7b866603ab184c70a66805ac8043a03d.customers.
225-
12,Expected: 87256 Actual: 87256. Passed!
226-
```
72+
## More questions
73+
Refer to the [FAQ](https://databrickslabs.github.io/dlt-meta/faq)
74+
and DLT-META [documentation](https://databrickslabs.github.io/dlt-meta/)
22775

22876
# Project Support
22977
Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)

docs/content/faq/execution.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,25 @@ DLT-META translates input metadata into Delta table as DataflowSpecs
2424
**Q. How many DLT pipelines will be launched using DLT-META?**
2525

2626
DLT-META uses data_flow_group to launch DLT pipelines, so all the tables belongs to same group will be executed under single DLT pipeline.
27+
28+
**Q. Can we run onboarding for bronze layer only?**
29+
30+
Yes! Remove silver related attributes from onboarding file and call `onboard_bronze_dataflow_spec()` API from ```OnboardDataflowspec```. Similarly you can run silver layer onboarding separately using `onboard_silver_dataflow_spec()`API from `OnboardDataflowspec` with silver parameters included in `onboarding_params_map`
31+
32+
```
33+
onboarding_params_map = {
34+
"onboarding_file_path":onboarding_file_path,
35+
"database":bronze_database,
36+
"env":"dev",
37+
"bronze_dataflowspec_table":"bronze_dataflowspec_tablename",
38+
"bronze_dataflowspec_path": bronze_dataflowspec_path,
39+
"overwrite":"True",
40+
"version":"v1",
41+
"import_author":"Ravi",
42+
}
43+
print(onboarding_params_map)
44+
45+
from src.onboard_dataflowspec import OnboardDataflowspec
46+
OnboardDataflowspec(spark, onboarding_params_map).onboard_bronze_dataflow_spec()
47+
48+
```

docs/content/faq/general.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ DLT-META is a solution/framework using Databricks Delta Live Tables aka DLT whic
1717

1818
**Q. What different types of reader are supported using DLT-META ?**
1919

20-
DLT-META uses Databricks [Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) to read from s3/adls/blog stroage.
20+
DLT-META uses Databricks [Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html), DELTA, KAFKA, EVENTHUB to read from s3/adls/blog stroage.
2121

2222
**Q. Can DLT-META support any other readers?**
2323

Lines changed: 35 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
---
22
title: "Additionals"
33
date: 2021-08-04T14:25:26-04:00
4-
weight: 19
4+
weight: 21
55
draft: false
66
---
7+
This is easist way to launch dlt-meta to your databricks workspace with following steps.
78

89
## Run Integration Tests
910
1. Launch Terminal/Command promt
@@ -22,55 +23,50 @@ export DATABRICKS_TOKEN=<DATABRICKS TOKEN> # Account needs permission to create
2223
5. Run integration test against cloudfile or eventhub or kafka using below options:
2324
5a. Run the command for cloudfiles ```python integration-tests/run-integration-test.py --cloud_provider_name=aws --dbr_version=11.3.x-scala2.12 --source=cloudfiles --dbfs_path=dbfs:/tmp/DLT-META/```
2425

25-
5b. Run the command for eventhub ```python integration-tests/run-integration-test.py --cloud_provider_name=azure --dbr_version=11.3.x-scala2.12 --source=eventhub --dbfs_path=dbfs:/tmp/DLT-META/ --eventhub_name=iot --eventhub_secrets_scope_name=eventhubs_creds --eventhub_namespace=int_test-standard --eventhub_port=9093 --eventhub_producer_accesskey_name=producer ----eventhub_consumer_accesskey_name=consumer```
26+
5b. Run the command for eventhub ```python integration-tests/run-integration-test.py --cloud_provider_name=azure --dbr_version=11.3.x-scala2.12 --source=eventhub --dbfs_path=dbfs:/tmp/DLT-META/ --eventhub_name=iot --eventhub_secrets_scope_name=eventhubs_creds --eventhub_namespace=int_test-standard --eventhub_port=9093 --eventhub_producer_accesskey_name=producer --eventhub_consumer_accesskey_name=consumer```
2627

27-
For eventhub integration tests, the following are the prerequisites:
28-
1. Needs eventhub instance running
29-
2. Using Databricks CLI, Create databricks secrets scope for eventhub keys
30-
3. Using Databricks CLI, Create databricks secrets to store producer and consumer keys using the scope created in step 2
28+
For eventhub integration tests, the following are the prerequisites:
29+
1. Needs eventhub instance running
30+
2. Using Databricks CLI, Create databricks secrets scope for eventhub keys
31+
3. Using Databricks CLI, Create databricks secrets to store producer and consumer keys using the scope created in step 2
3132

32-
Following are the mandatory arguments for running EventHubs integration test
33-
1. Provide your eventhub topic name : ```--eventhub_name```
34-
2. Provide eventhub namespace using ```--eventhub_namespace```
35-
3. Provide eventhub port using ```--eventhub_port```
36-
4. Provide databricks secret scope name using ```----eventhub_secrets_scope_name```
37-
5. Provide eventhub producer access key name using ```--eventhub_producer_accesskey_name```
38-
6. Provide eventhub access key name using ```--eventhub_consumer_accesskey_name```
33+
Following are the mandatory arguments for running EventHubs integration test
34+
1. Provide your eventhub topic : --eventhub_name
35+
2. Provide eventhub namespace : --eventhub_namespace
36+
3. Provide eventhub port : --eventhub_port
37+
4. Provide databricks secret scope name : --eventhub_secrets_scope_name
38+
5. Provide eventhub producer access key name : --eventhub_producer_accesskey_name
39+
6. Provide eventhub access key name : --eventhub_consumer_accesskey_name
3940

4041

4142
5c. Run the command for kafka ```python3 integration-tests/run-integration-test.py --cloud_provider_name=aws --dbr_version=11.3.x-scala2.12 --source=kafka --dbfs_path=dbfs:/tmp/DLT-META/ --kafka_topic_name=dlt-meta-integration-test --kafka_broker=host:9092```
4243

43-
For kafka integration tests, the following are the prerequisites:
44-
1. Needs kafka instance running
44+
For kafka integration tests, the following are the prerequisites:
45+
1. Needs kafka instance running
4546

46-
Following are the mandatory arguments for running EventHubs integration test
47-
1. Provide your kafka topic name : ```--kafka_topic_name```
48-
2. Provide kafka_broker ```--kafka_broker```
47+
Following are the mandatory arguments for running EventHubs integration test
48+
1. Provide your kafka topic name : --kafka_topic_name
49+
2. Provide kafka_broker : --kafka_broker
4950

5051
6. Once finished integration output file will be copied locally to
5152
```integration-test-output_<run_id>.txt```
5253

5354
7. Output of a successful run should have the following in the file
5455
```
55-
Generating Onboarding Json file for Integration Test.
56-
Successfully Generated Onboarding Json file for Integration Test.
57-
Setting up dlt-meta metadata tables.
58-
Successfully setup dlt-meta metadata tables.
59-
Completed Bronze DLT Pipeline.
60-
Completed Silver DLT Pipeline.
61-
Validating DLT Bronze and Silver Table Counts...
62-
Validating Counts for Table bronze_f7d4934efe494de987f364e8d93acaba.transactions_cdc.
63-
Expected: 10002 Actual: 10002. Passed!
64-
Validating Counts for Table bronze_f7d4934efe494de987f364e8d93acaba.transactions_cdc_quarantine.
65-
Expected: 9842 Actual: 9842. Passed!
66-
Validating Counts for Table bronze_f7d4934efe494de987f364e8d93acaba.customers_cdc.
67-
Expected: 98928 Actual: 98928. Passed!
68-
Validating Counts for Table silver_f7d4934efe494de987f364e8d93acaba.transactions.
69-
Expected: 8759 Actual: 8759. Passed!
70-
Validating Counts for Table silver_f7d4934efe494de987f364e8d93acaba.customers.
71-
Expected: 87256 Actual: 87256. Passed!
72-
DROPPING DB bronze_f7d4934efe494de987f364e8d93acaba
73-
DROPPING DB silver_f7d4934efe494de987f364e8d93acaba
74-
DROPPING DB dlt_meta_framework_it_f7d4934efe494de987f364e8d93acaba_f7d4934efe494de987f364e8d93acaba
75-
Removed Integration test databases
56+
,0
57+
0,Completed Bronze DLT Pipeline.
58+
1,Completed Silver DLT Pipeline.
59+
2,Validating DLT Bronze and Silver Table Counts...
60+
3,Validating Counts for Table bronze_7d1d3ccc9e144a85b07c23110ea50133.transactions.
61+
4,Expected: 10002 Actual: 10002. Passed!
62+
5,Validating Counts for Table bronze_7d1d3ccc9e144a85b07c23110ea50133.transactions_quarantine.
63+
6,Expected: 7 Actual: 7. Passed!
64+
7,Validating Counts for Table bronze_7d1d3ccc9e144a85b07c23110ea50133.customers.
65+
8,Expected: 98928 Actual: 98923. Failed!
66+
9,Validating Counts for Table bronze_7d1d3ccc9e144a85b07c23110ea50133.customers_quarantine.
67+
10,Expected: 1077 Actual: 1077. Passed!
68+
11,Validating Counts for Table silver_7d1d3ccc9e144a85b07c23110ea50133.transactions.
69+
12,Expected: 8759 Actual: 8759. Passed!
70+
13,Validating Counts for Table silver_7d1d3ccc9e144a85b07c23110ea50133.customers.
71+
14,Expected: 87256 Actual: 87251. Failed!
7672
```

docs/content/getting_started/dltpipeline.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "Launch Generic DLT pipeline"
33
date: 2021-08-04T14:25:26-04:00
4-
weight: 18
4+
weight: 20
55
draft: false
66
---
77

0 commit comments

Comments
 (0)