Skip to content

Commit 1bc01ba

Browse files
Merge pull request #80 from databrickslabs/feature/v0.0.8
Merging feature/v0.0.8 release
2 parents 3555aaa + a15c517 commit 1bc01ba

File tree

123 files changed

+120551
-26981
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

123 files changed

+120551
-26981
lines changed

.github/workflows/release.yml

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,21 @@ on:
77

88
jobs:
99
release:
10-
runs-on: ${{ matrix.os }}
11-
strategy:
12-
max-parallel: 1
13-
matrix:
14-
python-version: [ 3.9 ]
15-
os: [ ubuntu-latest ]
10+
runs-on: ubuntu-latest
11+
environment: release
12+
permissions:
13+
# Used to authenticate to PyPI via OIDC and sign the release's artifacts with sigstore-python.
14+
id-token: write
15+
# Used to attach signing artifacts to the published release.
16+
contents: write
1617

1718
steps:
1819
- uses: actions/checkout@v1
1920

20-
- name: Set up Python ${{ matrix.python-version }}
21+
- name: Set up Python
2122
uses: actions/setup-python@v4
2223
with:
23-
python-version: ${{ matrix.python-version }}
24+
python-version: 3.9
2425
cache: 'pip' # caching pip dependencies
2526
cache-dependency-path: setup.py
2627

@@ -35,6 +36,3 @@ jobs:
3536

3637
- name: Publish a Python distribution to PyPI
3738
uses: pypa/gh-action-pypi-publish@release/v1
38-
with:
39-
user: __token__
40-
password: ${{ secrets.LABS_PYPI_TOKEN }}

.gitignore

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -151,9 +151,6 @@ deployment-merged.yaml
151151
.idea/
152152
.vscode/
153153

154-
# ignore integration test onboarding file.
155-
integration-tests/conf/dlt-meta/onboarding.json
156-
157154
.databricks
158155
.databricks-login.json
159156
demo/conf/onboarding.json

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,18 @@
11
# Changelog
2+
## [v.0.0.8]
3+
- Added dlt append_flow api support: [PR](https://github.com/databrickslabs/dlt-meta/pull/58)
4+
- Added dlt append_flow api support for silver layer: [PR](https://github.com/databrickslabs/dlt-meta/pull/63)
5+
- Added support for file metadata columns for autoloader: [PR](https://github.com/databrickslabs/dlt-meta/pull/56)
6+
- Added support for Bring your own custom transformation: [Issue](https://github.com/databrickslabs/dlt-meta/issues/68)
7+
- Added support to Unify PyPI releases with GitHub OIDC: [PR](https://github.com/databrickslabs/dlt-meta/pull/62)
8+
- Added demo for append_flow and file_metadata options: [PR](https://github.com/databrickslabs/dlt-meta/issues/74)
9+
- Added Demo for silver fanout architecture: [PR](https://github.com/databrickslabs/dlt-meta/pull/83)
10+
- Added documentation in docs site for new features: [PR](https://github.com/databrickslabs/dlt-meta/pull/64)
11+
- Added unit tests to showcase silver layer fanout examples: [PR](https://github.com/databrickslabs/dlt-meta/pull/67)
12+
- Fixed issue for No such file or directory: '/demo' :[PR](https://github.com/databrickslabs/dlt-meta/issues/59)
13+
- Fixed issue DLT-META CLI onboard command issue for Azure: databricks.sdk.errors.platform.ResourceAlreadyExists :[PR](https://github.com/databrickslabs/dlt-meta/issues/51)
14+
- Fixed issue Changed dbfs.create to mkdirs for CLI: [PR](https://github.com/databrickslabs/dlt-meta/pull/53)
15+
- Fixed issue DLT-META CLI should use pypi lib instead of whl : [PR](https://github.com/databrickslabs/dlt-meta/pull/79)
216

317
## [v.0.0.7]
418
- Added dlt-meta cli documentation and readme with browser support: [PR](https://github.com/databrickslabs/dlt-meta/pull/45)

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,9 @@
4040
---
4141

4242
# Project Overview
43+
`DLT-META` is a metadata-driven framework designed to work with [Delta Live Tables](https://www.databricks.com/product/delta-live-tables). This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in an onboarding JSON file. This file, known as the Dataflowspec, serves as the data flow specification, detailing the source and target metadata required for the pipelines.
4344

44-
`DLT-META` is a metadata-driven framework based on Databricks [Delta Live Tables](https://www.databricks.com/product/delta-live-tables) (aka DLT) which lets you automate your bronze and silver data pipelines.
45-
46-
With this framework you need to record the source and target metadata in an onboarding json file which acts as the data flow specification aka Dataflowspec. A single generic `DLT` pipeline takes the `Dataflowspec` and runs your workloads.
45+
In practice, a single generic DLT pipeline reads the Dataflowspec and uses it to orchestrate and run the necessary data processing workloads. This approach streamlines the development and management of data pipelines, allowing for a more efficient and scalable data processing workflow
4746

4847
### Components:
4948

@@ -128,11 +127,18 @@ If you want to run existing demo files please follow these steps before running
128127
```
129128

130129
```commandline
131-
databricks labs dlt-meta onboard
130+
dlt_meta_home=$(pwd)
132131
```
133132

133+
```commandline
134+
export PYTHONPATH=$dlt_meta_home
135+
```
136+
```commandline
137+
databricks labs dlt-meta onboard
138+
```
134139
![onboardingDLTMeta.gif](docs/static/images/onboardingDLTMeta.gif)
135140

141+
136142
Above commands will prompt you to provide onboarding details. If you have cloned dlt-meta git repo then accept defaults which will launch config from demo folder.
137143
![onboardingDLTMeta_2.gif](docs/static/images/onboardingDLTMeta_2.gif)
138144

demo/README.md

Lines changed: 185 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,42 @@
11
# [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's
22
1. [DAIS 2023 DEMO](#dais-2023-demo): Showcases DLT-META's capabilities of creating Bronze and Silver DLT pipelines with initial and incremental mode automatically.
33
2. [Databricks Techsummit Demo](#databricks-tech-summit-fy2024-demo): 100s of data sources ingestion in bronze and silver DLT pipelines automatically.
4+
3. [Append FLOW Autoloader Demo](#append-flow-autoloader-file-metadata-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows) and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
5+
4. [Append FLOW Eventhub Demo](#append-flow-eventhub-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows) and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
6+
5. [Silver Fanout Demo](#silver-fanout-demo): This demo showcases the implementation of fanout architecture in the silver layer.
7+
48

59

610
# DAIS 2023 DEMO
711
## [DAIS 2023 Session Recording](https://www.youtube.com/watch?v=WYv5haxLlfA)
8-
This Demo launches Bronze and Silver DLT pipleines with following activities:
12+
This Demo launches Bronze and Silver DLT pipelines with following activities:
913
- Customer and Transactions feeds for initial load
1014
- Adds new feeds Product and Stores to existing Bronze and Silver DLT pipelines with metadata changes.
1115
- Runs Bronze and Silver DLT for incremental load for CDC events
1216

1317
### Steps:
14-
1. Launch Terminal/Command promt
18+
1. Launch Terminal/Command prompt
1519

1620
2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
1721

18-
3. ```git clone https://github.com/databrickslabs/dlt-meta.git ```
22+
3. ```commandline
23+
git clone https://github.com/databrickslabs/dlt-meta.git
24+
```
1925
20-
4. ```cd dlt-meta```
26+
4. ```commandline
27+
cd dlt-meta
28+
```
2129
2230
5. Set python environment variable into terminal
31+
```commandline
32+
dlt_meta_home=$(pwd)
2333
```
24-
export PYTHONPATH=<<local dlt-meta path>>
34+
35+
```commandline
36+
export PYTHONPATH=$dlt_meta_home
2537
```
2638
27-
6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated_new```
39+
6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated```
2840
- cloud_provider_name : aws or azure or gcp
2941
- db_version : Databricks Runtime Version
3042
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
@@ -53,17 +65,28 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
5365
5466
2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
5567
56-
3. ```git clone https://github.com/databrickslabs/dlt-meta.git ```
68+
3. ```commandline
69+
git clone https://github.com/databrickslabs/dlt-meta.git
70+
```
5771
58-
4. ```cd dlt-meta```
72+
4. ```commandline
73+
cd dlt-meta
74+
```
5975
6076
5. Set python environment variable into terminal
77+
```commandline
78+
dlt_meta_home=$(pwd)
6179
```
62-
export PYTHONPATH=<<local dlt-meta path>>
80+
81+
```commandline
82+
export PYTHONPATH=$dlt_meta_home
6383
```
6484
65-
6. Run the command ```python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated ```
66-
- cloud_provider_name : aws or azure or gcp
85+
6. Run the command
86+
```commandline
87+
python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated
88+
```
89+
- cloud_provider_name : aws or azure
6790
- db_version : Databricks Runtime Version
6891
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
6992
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
@@ -82,4 +105,154 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
82105
83106
- Copy the displayed token
84107
85-
- Paste to command prompt
108+
- Paste to command prompt
109+
110+
111+
# Append Flow Autoloader file metadata demo:
112+
This demo will perform following tasks:
113+
- Read from different source paths using autoloader and write to same target using append_flow API
114+
- Read from different delta tables and write to same silver table using append_flow API
115+
- Add file_name and file_path to target bronze table for autoloader source using [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
116+
117+
1. Launch Terminal/Command prompt
118+
119+
2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
120+
121+
3. ```commandline
122+
git clone https://github.com/databrickslabs/dlt-meta.git
123+
```
124+
125+
4. ```commandline
126+
cd dlt-meta
127+
```
128+
129+
5. Set python environment variable into terminal
130+
```commandline
131+
dlt_meta_home=$(pwd)
132+
```
133+
134+
```commandline
135+
export PYTHONPATH=$dlt_meta_home
136+
```
137+
138+
6. ```commandline
139+
python demo/launch_af_cloudfiles_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc
140+
```
141+
142+
- cloud_provider_name : aws or azure or gcp
143+
- db_version : Databricks Runtime Version
144+
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
145+
- uc_catalog_name: Unity catalog name
146+
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
147+
148+
![af_am_demo.png](docs/static/images/af_am_demo.png)
149+
150+
# Append Flow Eventhub demo:
151+
- Read from different eventhub topics and write to same target tables using append_flow API
152+
153+
### Steps:
154+
1. Launch Terminal/Command prompt
155+
156+
2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
157+
158+
3. ```commandline
159+
git clone https://github.com/databrickslabs/dlt-meta.git
160+
```
161+
162+
4. ```commandline
163+
cd dlt-meta
164+
```
165+
5. Set python environment variable into terminal
166+
```commandline
167+
dlt_meta_home=$(pwd)
168+
```
169+
```commandline
170+
export PYTHONPATH=$dlt_meta_home
171+
```
172+
6. Eventhub
173+
- Needs eventhub instance running
174+
- Need two eventhub topics first for main feed (eventhub_name) and second for append flow feed (eventhub_name_append_flow)
175+
- Create databricks secrets scope for eventhub keys
176+
- ```
177+
commandline databricks secrets create-scope eventhubs_dltmeta_creds
178+
```
179+
- ```commandline
180+
databricks secrets put-secret --json '{
181+
"scope": "eventhubs_dltmeta_creds",
182+
"key": "RootManageSharedAccessKey",
183+
"string_value": "<<value>>"
184+
}'
185+
```
186+
- Create databricks secrets to store producer and consumer keys using the scope created in step 2
187+
188+
- Following are the mandatory arguments for running EventHubs demo
189+
- cloud_provider_name: Cloud provider name e.g. aws or azure
190+
- dbr_version: Databricks Runtime Version e.g. 15.3.x-scala2.12
191+
- uc_catalog_name : unity catalog name e.g. ravi_dlt_meta_uc
192+
- dbfs_path: Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines e.g. dbfs:/tmp/DLT-META/demo/
193+
- eventhub_namespace: Eventhub namespace e.g. dltmeta
194+
- eventhub_name : Primary Eventhubname e.g. dltmeta_demo
195+
- eventhub_name_append_flow: Secondary eventhub name for appendflow feed e.g. dltmeta_demo_af
196+
- eventhub_producer_accesskey_name: Producer databricks access keyname e.g. RootManageSharedAccessKey
197+
- eventhub_consumer_accesskey_name: Consumer databricks access keyname e.g. RootManageSharedAccessKey
198+
- eventhub_secrets_scope_name: Databricks secret scope name e.g. eventhubs_dltmeta_creds
199+
- eventhub_port: Eventhub port
200+
201+
7. ```commandline
202+
python3 demo/launch_af_eventhub_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey --uc_catalog_name=ravi_dlt_meta_uc
203+
```
204+
205+
![af_eh_demo.png](docs/static/images/af_eh_demo.png)
206+
207+
208+
# Silver Fanout Demo
209+
- This demo will showcase the onboarding process for the silver fanout pattern.
210+
- Run the onboarding process for the bronze cars table, which contains data from various countries.
211+
- Run the onboarding process for the silver tables, which have a `where_clause` based on the country condition specified in [silver_transformations_cars.json](https://github.com/databrickslabs/dlt-meta/blob/main/demo/conf/silver_transformations_cars.json).
212+
- Run the Bronze DLT pipeline which will produce cars table.
213+
- Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.
214+
215+
### Steps:
216+
1. Launch Terminal/Command prompt
217+
218+
2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
219+
220+
3. ```commandline
221+
git clone https://github.com/databrickslabs/dlt-meta.git
222+
```
223+
224+
4. ```commandline
225+
cd dlt-meta
226+
```
227+
5. Set python environment variable into terminal
228+
```commandline
229+
dlt_meta_home=$(pwd)
230+
```
231+
```commandline
232+
export PYTHONPATH=$dlt_meta_home
233+
234+
6. Run the command ```python demo/launch_silver_fanout_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-silver-fanout```
235+
- cloud_provider_name : aws or azure
236+
- db_version : Databricks Runtime Version
237+
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
238+
- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token.
239+
240+
- - 6a. Databricks Workspace URL:
241+
- - Enter your workspace URL, with the format https://<instance-name>.cloud.databricks.com. To get your workspace URL, see Workspace instance names, URLs, and IDs.
242+
243+
- - 6b. Token:
244+
- In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
245+
246+
- On the Access tokens tab, click Generate new token.
247+
248+
- (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).
249+
250+
- Click Generate.
251+
252+
- Copy the displayed token
253+
254+
- Paste to command prompt
255+
256+
![silver_fanout_workflow.png](docs/static/images/silver_fanout_workflow.png)
257+
258+
![silver_fanout_dlt.png](docs/static/images/silver_fanout_dlt.png)
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
[
2+
{
3+
"target_table": "customers",
4+
"select_exp": [
5+
"address",
6+
"email",
7+
"firstname",
8+
"id",
9+
"lastname",
10+
"operation_date",
11+
"operation",
12+
"_rescued_data"
13+
]
14+
},
15+
{
16+
"target_table": "transactions",
17+
"select_exp": [
18+
"id",
19+
"customer_id",
20+
"amount",
21+
"item_count",
22+
"operation_date",
23+
"operation",
24+
"_rescued_data"
25+
]
26+
}
27+
]

0 commit comments

Comments
 (0)