|
20 | 20 | alt="GitHub Workflow Status (branch)"/> |
21 | 21 | </a> |
22 | 22 | <a href="https://codecov.io/gh/databrickslabs/dlt-meta"> |
23 | | - <img src="https://img.shields.io/codecov/c/github/databrickslabs/dlt-meta?style=for-the-badge&token=KI3HFZQWF0" |
| 23 | + <img src="https://img.shields.io/codecov/c/github/databrickslabs/dlt-meta?style=for-the-badge&token=2CxLj3YBam" |
24 | 24 | alt="codecov"/> |
25 | 25 | </a> |
26 | 26 | <a href="https://lgtm.com/projects/g/databrickslabs/dlt-meta/alerts"> |
@@ -63,167 +63,15 @@ With this framework you need to record the source and target metadata in an onbo |
63 | 63 | ## High-Level Process Flow: |
64 | 64 |  |
65 | 65 |
|
66 | | -## More questions |
67 | | - |
68 | | -Refer to the [FAQ](https://databrickslabs.github.io/dlt-meta/faq) |
69 | | -and DLT-META [documentation](https://databrickslabs.github.io/dlt-meta/) |
70 | | - |
71 | 66 | ## Steps |
72 | 67 |  |
73 | 68 |
|
| 69 | +## Getting Started |
| 70 | +Refer to the [Getting Started](https://databrickslabs.github.io/dlt-meta/getting_started) |
74 | 71 |
|
75 | | -## 1. Metadata preparation |
76 | | -1. Create ```onboarding.json``` metadata file and save to s3/adls/dbfs e.g.[onboarding file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/onboarding.json) |
77 | | -2. Create ```silver_transformations.json``` and save to s3/adls/dbfs e.g [Silver transformation file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/silver_transformations.json) |
78 | | -3. Create data quality rules json and store to s3/adls/dbfs e.g [Data Quality Rules](https://github.com/databrickslabs/dlt-meta/tree/main/examples/dqe/customers/bronze_data_quality_expectations.json) |
79 | | - |
80 | | -## 2. Onboarding job |
81 | | - |
82 | | -1. Go to your Databricks landing page and do one of the following: |
83 | | - |
84 | | -2. In the sidebar, click Jobs Icon Workflows and click Create Job Button. |
85 | | - |
86 | | -3. In the sidebar, click New Icon New and select Job from the menu. |
87 | | - |
88 | | -4. In the task dialog box that appears on the Tasks tab, replace Add a name for your job… with your job name, for example, Python wheel example. |
89 | | - |
90 | | -5. In Task name, enter a name for the task, for example, ```dlt_meta_onboarding_pythonwheel_task```. |
91 | | - |
92 | | -6. In Type, select Python wheel. |
93 | | - |
94 | | -5. In Package name, enter ```dlt_meta```. |
95 | | - |
96 | | -6. In Entry point, enter ``run``. |
97 | | - |
98 | | -7. Click Add under Dependent Libraries. In the Add dependent library dialog, under Library Type, click PyPI. Enter Package: ```dlt-meta``` |
99 | | - |
100 | | - |
101 | | -8. Click Add. |
102 | | - |
103 | | -9. In Parameters, select keyword argument then select JSON. Past below json parameters with : |
104 | | - ``` |
105 | | - { |
106 | | - "database": "dlt_demo", |
107 | | - "onboarding_file_path": "dbfs:/onboarding_files/users_onboarding.json", |
108 | | - "silver_dataflowspec_table": "silver_dataflowspec_table", |
109 | | - "silver_dataflowspec_path": "dbfs:/onboarding_tables_cdc/silver", |
110 | | - "bronze_dataflowspec_table": "bronze_dataflowspec_table", |
111 | | - "import_author": "Ravi", |
112 | | - "version": "v1", |
113 | | - "bronze_dataflowspec_path": "dbfs:/onboarding_tables_cdc/bronze", |
114 | | - "overwrite": "True", |
115 | | - "env": "dev" |
116 | | - } |
117 | | - ``` |
118 | | - Alternatly you can enter keyword arguments, click + Add and enter a key and value. Click + Add again to enter more arguments. |
119 | | -
|
120 | | -10. Click Save task. |
121 | | -
|
122 | | -11. Run now |
123 | | -
|
124 | | -12. Make sure job run successfully. Verify metadata in your dataflow spec tables entered in step: 9 e.g ```dlt_demo.bronze_dataflowspec_table``` , ```dlt_demo.silver_dataflowspec_table``` |
125 | | -
|
126 | | -## 3. Launch Dataflow DLT Pipeline |
127 | | -### Create a dlt launch notebook |
128 | | -
|
129 | | -1. Go to your Databricks landing page and select Create a notebook, or click New Icon New in the sidebar and select Notebook. The Create Notebook dialog appears. |
130 | | -
|
131 | | -2. In the Create Notebook dialogue, give your notebook a name e.g ```dlt_meta_pipeline``` and select Python from the Default Language dropdown menu. You can leave Cluster set to the default value. The Delta Live Tables runtime creates a cluster before it runs your pipeline. |
132 | | -
|
133 | | -3. Click Create. |
134 | | -
|
135 | | -4. You can add the [example dlt pipeline](https://github.com/databrickslabs/dlt-meta/blob/main/examples/dlt_meta_pipeline.ipynb) code or import iPython notebook as is. |
136 | | -
|
137 | | -### Create a DLT pipeline |
138 | | -
|
139 | | -1. Click Jobs Icon Workflows in the sidebar, click the Delta Live Tables tab, and click Create Pipeline. |
140 | | -
|
141 | | -2. Give the pipeline a name e.g. DLT_META_BRONZE and click File Picker Icon to select a notebook ```dlt_meta_pipeline``` created in step: ```Create a dlt launch notebook```. |
142 | | -
|
143 | | -3. Optionally enter a storage location for output data from the pipeline. The system uses a default location if you leave Storage location empty. |
144 | | -
|
145 | | -4. Select Triggered for Pipeline Mode. |
146 | | -
|
147 | | -5. Enter Configuration parameters e.g. |
148 | | - ``` |
149 | | - "layer": "bronze", |
150 | | - "bronze.dataflowspecTable": "dataflowspec table name", |
151 | | - "bronze.group": "enter group name from metadata e.g. G1", |
152 | | - ``` |
153 | | -
|
154 | | -6. Enter target schema where you wants your bronze/silver tables to be created |
155 | | -
|
156 | | -7. Click Create. |
157 | | -
|
158 | | -8. Start pipeline: click the Start button on in top panel. The system returns a message confirming that your pipeline is starting |
159 | | -
|
160 | | -
|
161 | | -
|
162 | | -# Additional |
163 | | -You can run integration tests from you local with dlt-meta. |
164 | | -## Run Integration Tests |
165 | | -1. Clone [DLT-META](https://github.com/databrickslabs/dlt-meta) |
166 | | -
|
167 | | -2. Open terminal and Goto root folder ```DLT-META``` |
168 | | -
|
169 | | -3. Create environment variables. |
170 | | -
|
171 | | -``` |
172 | | -export DATABRICKS_HOST=<DATABRICKS HOST> |
173 | | -export DATABRICKS_TOKEN=<DATABRICKS TOKEN> # Account needs permission to create clusters/dlt pipelines. |
174 | | -``` |
175 | | -
|
176 | | -4. Run itegration tests for different supported input sources: cloudfiles, eventhub, kafka |
177 | | -
|
178 | | - 4a. Run the command for cloudfiles ```python integration-tests/run-integration-test.py --cloud_provider_name=aws --dbr_version=11.3.x-scala2.12 --source=cloudfiles --dbfs_path=dbfs:/tmp/DLT-META/``` |
179 | | -
|
180 | | - 4b. Run the command for eventhub ```python integration-tests/run-integration-test.py --cloud_provider_name=azure --dbr_version=11.3.x-scala2.12 --source=eventhub --dbfs_path=dbfs:/tmp/DLT-META/ --eventhub_name=iot --eventhub_secrets_scope_name=eventhubs_creds --eventhub_namespace=int_test-standard --eventhub_port=9093 --eventhub_producer_accesskey_name=producer ----eventhub_consumer_accesskey_name=consumer``` |
181 | | -
|
182 | | - For eventhub integration tests, the following are the prerequisites: |
183 | | - 1. Needs eventhub instance running |
184 | | - 2. Using Databricks CLI, Create databricks secrets scope for eventhub keys |
185 | | - 3. Using Databricks CLI, Create databricks secrets to store producer and consumer keys using the scope created in step 2 |
186 | | -
|
187 | | - Following are the mandatory arguments for running EventHubs integration test |
188 | | - 1. Provide your eventhub topic name : ```--eventhub_name``` |
189 | | - 2. Provide eventhub namespace using ```--eventhub_namespace``` |
190 | | - 3. Provide eventhub port using ```--eventhub_port``` |
191 | | - 4. Provide databricks secret scope name using ```----eventhub_secrets_scope_name``` |
192 | | - 5. Provide eventhub producer access key name using ```--eventhub_producer_accesskey_name``` |
193 | | - 6. Provide eventhub access key name using ```--eventhub_consumer_accesskey_name``` |
194 | | -
|
195 | | -
|
196 | | - 4c. Run the command for kafka ```python3 integration-tests/run-integration-test.py --cloud_provider_name=aws --dbr_version=11.3.x-scala2.12 --source=kafka --dbfs_path=dbfs:/tmp/DLT-META/ --kafka_topic_name=dlt-meta-integration-test --kafka_broker=host:9092``` |
197 | | -
|
198 | | - For kafka integration tests, the following are the prerequisites: |
199 | | - 1. Needs kafka instance running |
200 | | -
|
201 | | - Following are the mandatory arguments for running EventHubs integration test |
202 | | - 1. Provide your kafka topic name : ```--kafka_topic_name``` |
203 | | - 2. Provide kafka_broker ```--kafka_broker``` |
204 | | -
|
205 | | -
|
206 | | -
|
207 | | - Once finished integration output file will be copied locally to ```integration-test-output_<run_id>.csv``` |
208 | | -
|
209 | | -5. Output of a successful run should have the following in the file |
210 | | -
|
211 | | - ``` |
212 | | - ,0 |
213 | | - 0,Completed Bronze DLT Pipeline. |
214 | | - 1,Completed Silver DLT Pipeline. |
215 | | - 2,Validating DLT Bronze and Silver Table Counts... |
216 | | - 3,Validating Counts for Table bronze_7b866603ab184c70a66805ac8043a03d.transactions_cdc. |
217 | | - 4,Expected: 10002 Actual: 10002. Passed! |
218 | | - 5,Validating Counts for Table bronze_7b866603ab184c70a66805ac8043a03d.transactions_cdc_quarantine. |
219 | | - 6,Expected: 9842 Actual: 9842. Passed! |
220 | | - 7,Validating Counts for Table bronze_7b866603ab184c70a66805ac8043a03d.customers_cdc. |
221 | | - 8,Expected: 98928 Actual: 98928. Passed! |
222 | | - 9,Validating Counts for Table silver_7b866603ab184c70a66805ac8043a03d.transactions. |
223 | | - 10,Expected: 8759 Actual: 8759. Passed! |
224 | | - 11,Validating Counts for Table silver_7b866603ab184c70a66805ac8043a03d.customers. |
225 | | - 12,Expected: 87256 Actual: 87256. Passed! |
226 | | - ``` |
| 72 | +## More questions |
| 73 | +Refer to the [FAQ](https://databrickslabs.github.io/dlt-meta/faq) |
| 74 | +and DLT-META [documentation](https://databrickslabs.github.io/dlt-meta/) |
227 | 75 |
|
228 | 76 | # Project Support |
229 | 77 | Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs) |
|
0 commit comments