databrickslabs
diff --git a/‎.github/workflows/release.yml‎
Lines changed: 9 additions & 11 deletions b/‎.github/workflows/release.yml‎
Lines changed: 9 additions & 11 deletions
diff --git a/‎.gitignore‎
Lines changed: 0 additions & 3 deletions b/‎.gitignore‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 14 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 10 additions & 4 deletions b/‎README.md‎
Lines changed: 10 additions & 4 deletions
diff --git a/‎demo/README.md‎
Lines changed: 185 additions & 12 deletions b/‎demo/README.md‎
Lines changed: 185 additions & 12 deletions
diff --git a/‎demo/conf/afam_silver_transformations.json‎
Lines changed: 27 additions & 0 deletions b/‎demo/conf/afam_silver_transformations.json‎
Lines changed: 27 additions & 0 deletions
@@ -7,20 +7,21 @@ on:
 
 jobs:
   release:
-    runs-on: ${{ matrix.os }}
-    strategy:
-      max-parallel: 1
-      matrix:
-        python-version: [ 3.9 ]
-        os: [ ubuntu-latest ]
+    runs-on: ubuntu-latest
+    environment: release
+    permissions:
+      # Used to authenticate to PyPI via OIDC and sign the release's artifacts with sigstore-python.
+      id-token: write
+      # Used to attach signing artifacts to the published release.
+      contents: write
 
     steps:
       - uses: actions/checkout@v1
 
-      - name: Set up Python ${{ matrix.python-version }}
+      - name: Set up Python
         uses: actions/setup-python@v4
         with:
-          python-version: ${{ matrix.python-version }}
+          python-version: 3.9
           cache: 'pip' # caching pip dependencies
           cache-dependency-path: setup.py
 
@@ -35,6 +36,3 @@ jobs:
 
       - name: Publish a Python distribution to PyPI
         uses: pypa/gh-action-pypi-publish@release/v1
-        with:
-          user: __token__
-          password: ${{ secrets.LABS_PYPI_TOKEN }}
@@ -151,9 +151,6 @@ deployment-merged.yaml
 .idea/
 .vscode/
 
-# ignore integration test onboarding file.
-integration-tests/conf/dlt-meta/onboarding.json
-
 .databricks
 .databricks-login.json
 demo/conf/onboarding.json
 
@@ -1,4 +1,18 @@
 # Changelog
+## [v.0.0.8] 
+- Added dlt append_flow api support: [PR](https://github.com/databrickslabs/dlt-meta/pull/58)
+- Added dlt append_flow api support for silver layer: [PR](https://github.com/databrickslabs/dlt-meta/pull/63)
+- Added support for file metadata columns for autoloader: [PR](https://github.com/databrickslabs/dlt-meta/pull/56)
+- Added support for Bring your own custom transformation: [Issue](https://github.com/databrickslabs/dlt-meta/issues/68)
+- Added support to Unify PyPI releases with GitHub OIDC: [PR](https://github.com/databrickslabs/dlt-meta/pull/62)
+- Added demo for append_flow and file_metadata options: [PR](https://github.com/databrickslabs/dlt-meta/issues/74)
+- Added Demo for silver fanout architecture: [PR](https://github.com/databrickslabs/dlt-meta/pull/83)
+- Added documentation in docs site for new features: [PR](https://github.com/databrickslabs/dlt-meta/pull/64)
+- Added unit tests to showcase silver layer fanout examples: [PR](https://github.com/databrickslabs/dlt-meta/pull/67)
+- Fixed issue for No such file or directory: '/demo' :[PR](https://github.com/databrickslabs/dlt-meta/issues/59)
+- Fixed issue DLT-META CLI onboard command issue for Azure: databricks.sdk.errors.platform.ResourceAlreadyExists :[PR](https://github.com/databrickslabs/dlt-meta/issues/51)
+- Fixed issue Changed dbfs.create to mkdirs for CLI: [PR](https://github.com/databrickslabs/dlt-meta/pull/53)
+- Fixed issue DLT-META CLI should use pypi lib instead of whl : [PR](https://github.com/databrickslabs/dlt-meta/pull/79)
 
 ## [v.0.0.7] 
 - Added dlt-meta cli documentation and readme with browser support: [PR](https://github.com/databrickslabs/dlt-meta/pull/45)
 
@@ -40,10 +40,9 @@
 ---
 
 # Project Overview
+`DLT-META` is a metadata-driven framework designed to work with [Delta Live Tables](https://www.databricks.com/product/delta-live-tables). This framework enables the automation of bronze and silver data pipelines by leveraging metadata recorded in an onboarding JSON file. This file, known as the Dataflowspec, serves as the data flow specification, detailing the source and target metadata required for the pipelines.
 
-`DLT-META` is a metadata-driven framework based on Databricks [Delta Live Tables](https://www.databricks.com/product/delta-live-tables) (aka DLT) which lets you automate your bronze and silver data pipelines.
-
-With this framework you need to record the source and target metadata in an onboarding json file which acts as the data flow specification aka Dataflowspec. A single generic `DLT` pipeline takes the `Dataflowspec` and runs your workloads.
+In practice, a single generic DLT pipeline reads the Dataflowspec and uses it to orchestrate and run the necessary data processing workloads. This approach streamlines the development and management of data pipelines, allowing for a more efficient and scalable data processing workflow
 
 ### Components:
 
@@ -128,11 +127,18 @@ If you want to run existing demo files please follow these steps before running
 ```
 
 ```commandline
-    databricks labs dlt-meta onboard
+    dlt_meta_home=$(pwd)
 ```
 
+```commandline
+    export PYTHONPATH=$dlt_meta_home
+```
+```commandline
+    databricks labs dlt-meta onboard
+```
 ![onboardingDLTMeta.gif](docs/static/images/onboardingDLTMeta.gif)
 
+
 Above commands will prompt you to provide onboarding details. If you have cloned dlt-meta git repo then accept defaults which will launch config from demo folder.
 ![onboardingDLTMeta_2.gif](docs/static/images/onboardingDLTMeta_2.gif)
 
 
@@ -1,30 +1,42 @@
  # [DLT-META](https://github.com/databrickslabs/dlt-meta) DEMO's 
  1. [DAIS 2023 DEMO](#dais-2023-demo): Showcases DLT-META's capabilities of creating Bronze and Silver DLT pipelines with initial and incremental mode automatically.
  2. [Databricks Techsummit Demo](#databricks-tech-summit-fy2024-demo): 100s of data sources ingestion in bronze and silver DLT pipelines automatically.
+ 3. [Append FLOW Autoloader Demo](#append-flow-autoloader-file-metadata-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows)  and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
+ 4. [Append FLOW Eventhub Demo](#append-flow-eventhub-demo): Write to same target from multiple sources using [dlt.append_flow](https://docs.databricks.com/en/delta-live-tables/flows.html#append-flows)  and adding [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
+ 5. [Silver Fanout Demo](#silver-fanout-demo): This demo showcases the implementation of fanout architecture in the silver layer.
+
 
 
 # DAIS 2023 DEMO 
 ## [DAIS 2023 Session Recording](https://www.youtube.com/watch?v=WYv5haxLlfA)
-This Demo launches Bronze and Silver DLT pipleines with following activities:
+This Demo launches Bronze and Silver DLT pipelines with following activities:
 - Customer and Transactions feeds for initial load
 - Adds new feeds Product and Stores to existing Bronze and Silver DLT pipelines with metadata changes.
 - Runs Bronze and Silver DLT for incremental load for CDC events
 
 ### Steps:
-1. Launch Terminal/Command promt 
+1. Launch Terminal/Command prompt 
 
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
-3. ```git clone https://github.com/databrickslabs/dlt-meta.git ```
+3. ```commandline
+    git clone https://github.com/databrickslabs/dlt-meta.git 
+    ```
 
-4. ```cd dlt-meta```
+4. ```commandline
+    cd dlt-meta
+    ```
 
 5. Set python environment variable into terminal
+    ```commandline
+    dlt_meta_home=$(pwd)
     ```
-        export PYTHONPATH=<<local dlt-meta path>>
+
+    ```commandline
+    export PYTHONPATH=$dlt_meta_home
     ```
 
-6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated_new```
+6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated```
     - cloud_provider_name : aws or azure or gcp
     - db_version : Databricks Runtime Version
     - dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
@@ -53,17 +65,28 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
 
 2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
 
-3. ```git clone https://github.com/databrickslabs/dlt-meta.git ```
+3. ```commandline
+    git clone https://github.com/databrickslabs/dlt-meta.git 
+    ```
 
-4. ```cd dlt-meta```
+4. ```commandline 
+    cd dlt-meta
+    ```
 
 5. Set python environment variable into terminal
+    ```commandline
+    dlt_meta_home=$(pwd)
     ```
-        export PYTHONPATH=<<local dlt-meta path>>
+
+    ```commandline
+    export PYTHONPATH=$dlt_meta_home
     ```
 
-6. Run the command ```python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated ```
-    - cloud_provider_name : aws or azure or gcp
+6. Run the command 
+    ```commandline 
+    python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated 
+    ```
+    - cloud_provider_name : aws or azure
     - db_version : Databricks Runtime Version
     - dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
     - you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
@@ -82,4 +105,154 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
 
         - Copy the displayed token
 
-        - Paste to command prompt
+        - Paste to command prompt
+
+
+# Append Flow Autoloader file metadata demo:
+This demo will perform following tasks:
+- Read from different source paths using autoloader and write to same target using append_flow API
+- Read from different delta tables and write to same silver table using append_flow API
+- Add file_name and file_path to target bronze table for autoloader source using [File metadata column](https://docs.databricks.com/en/ingestion/file-metadata-column.html)
+
+1. Launch Terminal/Command prompt 
+
+2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
+
+3. ```commandline
+    git clone https://github.com/databrickslabs/dlt-meta.git 
+    ```
+
+4. ```commandline
+    cd dlt-meta
+    ```
+
+5. Set python environment variable into terminal
+    ```commandline
+    dlt_meta_home=$(pwd)
+    ```
+
+    ```commandline
+    export PYTHONPATH=$dlt_meta_home
+    ```
+
+6. ```commandline
+    python demo/launch_af_cloudfiles_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc
+    ```
+
+- cloud_provider_name : aws or azure or gcp
+- db_version : Databricks Runtime Version
+- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
+- uc_catalog_name: Unity catalog name
+- you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token
+
+![af_am_demo.png](docs/static/images/af_am_demo.png)
+
+# Append Flow Eventhub demo:
+- Read from different eventhub topics and write to same target tables using append_flow API
+
+### Steps:
+1. Launch Terminal/Command prompt 
+
+2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
+
+3. ```commandline
+    git clone https://github.com/databrickslabs/dlt-meta.git 
+    ```
+
+4. ```commandline
+    cd dlt-meta
+    ```
+5. Set python environment variable into terminal
+    ```commandline
+    dlt_meta_home=$(pwd)
+    ```
+    ```commandline
+    export PYTHONPATH=$dlt_meta_home
+    ```
+6. Eventhub
+- Needs eventhub instance running
+- Need two eventhub topics first for main feed (eventhub_name) and second for append flow feed (eventhub_name_append_flow)
+- Create databricks secrets scope for eventhub keys
+    - ```
+            commandline databricks secrets create-scope eventhubs_dltmeta_creds
+        ```
+    - ```commandline 
+            databricks secrets put-secret --json '{
+                "scope": "eventhubs_dltmeta_creds",
+                "key": "RootManageSharedAccessKey",
+                "string_value": "<<value>>"
+                }' 
+        ```
+- Create databricks secrets to store producer and consumer keys using the scope created in step 2 
+
+- Following are the mandatory arguments for running EventHubs demo
+    - cloud_provider_name: Cloud provider name e.g. aws or azure 
+    - dbr_version:  Databricks Runtime Version e.g. 15.3.x-scala2.12
+    - uc_catalog_name : unity catalog name e.g. ravi_dlt_meta_uc
+    - dbfs_path: Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines e.g. dbfs:/tmp/DLT-META/demo/ 
+    - eventhub_namespace: Eventhub namespace e.g. dltmeta
+    - eventhub_name : Primary Eventhubname e.g. dltmeta_demo
+    - eventhub_name_append_flow: Secondary eventhub name for appendflow feed e.g. dltmeta_demo_af
+    - eventhub_producer_accesskey_name: Producer databricks access keyname e.g. RootManageSharedAccessKey
+    - eventhub_consumer_accesskey_name: Consumer databricks access keyname e.g. RootManageSharedAccessKey
+    - eventhub_secrets_scope_name: Databricks secret scope name e.g. eventhubs_dltmeta_creds
+    - eventhub_port: Eventhub port
+
+7. ```commandline 
+    python3 demo/launch_af_eventhub_demo.py --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/tmp/DLT-META/demo/ --uc_catalog_name=ravi_dlt_meta_uc --eventhub_name=dltmeta_demo --eventhub_name_append_flow=dltmeta_demo_af --eventhub_secrets_scope_name=dltmeta_eventhub_creds --eventhub_namespace=dltmeta --eventhub_port=9093 --eventhub_producer_accesskey_name=RootManageSharedAccessKey --eventhub_consumer_accesskey_name=RootManageSharedAccessKey --eventhub_accesskey_secret_name=RootManageSharedAccessKey --uc_catalog_name=ravi_dlt_meta_uc
+    ```
+
+  ![af_eh_demo.png](docs/static/images/af_eh_demo.png)
+
+
+# Silver Fanout Demo
+- This demo will showcase the onboarding process for the silver fanout pattern.
+    - Run the onboarding process for the bronze cars table, which contains data from various countries.
+    - Run the onboarding process for the silver tables, which have a `where_clause` based on the country condition specified in [silver_transformations_cars.json](https://github.com/databrickslabs/dlt-meta/blob/main/demo/conf/silver_transformations_cars.json).
+    - Run the Bronze DLT pipeline which will produce cars table.
+    - Run Silver DLT pipeline, fanning out from the bronze cars table to country-specific tables such as cars_usa, cars_uk, cars_germany, and cars_japan.    
+
+### Steps:
+1. Launch Terminal/Command prompt 
+
+2. Install [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html)
+
+3. ```commandline
+    git clone https://github.com/databrickslabs/dlt-meta.git 
+    ```
+
+4. ```commandline
+    cd dlt-meta
+    ```
+5. Set python environment variable into terminal
+    ```commandline
+    dlt_meta_home=$(pwd)
+    ```
+    ```commandline
+    export PYTHONPATH=$dlt_meta_home
+
+6. Run the command ```python demo/launch_silver_fanout_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=15.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-silver-fanout```
+    - cloud_provider_name : aws or azure
+    - db_version : Databricks Runtime Version
+    - dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
+    - you can provide `--profile=databricks_profile name` in case you already have databricks cli otherwise command prompt will ask host and token.
+
+    - - 6a. Databricks Workspace URL:
+    - - Enter your workspace URL, with the format https://<instance-name>.cloud.databricks.com. To get your workspace URL, see Workspace instance names, URLs, and IDs.
+
+    - - 6b. Token:
+        - In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
+
+        - On the Access tokens tab, click Generate new token.
+
+        - (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).
+
+        - Click Generate.
+
+        - Copy the displayed token
+
+        - Paste to command prompt
+
+    ![silver_fanout_workflow.png](docs/static/images/silver_fanout_workflow.png)
+    
+    ![silver_fanout_dlt.png](docs/static/images/silver_fanout_dlt.png)
@@ -0,0 +1,27 @@
+[
+  {
+    "target_table": "customers",
+    "select_exp": [
+      "address",
+      "email",
+      "firstname",
+      "id",
+      "lastname",
+      "operation_date",
+      "operation",
+      "_rescued_data"
+    ]
+  },
+  {
+    "target_table": "transactions",
+    "select_exp": [
+      "id",
+      "customer_id",
+      "amount",
+      "item_count",
+      "operation_date",
+      "operation",
+      "_rescued_data"
+    ]
+  }
+]