-
Notifications
You must be signed in to change notification settings - Fork 170
DBSQL for Dimension ETL #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shyamraodb
wants to merge
43
commits into
databricks-demos:main
Choose a base branch
from
shyamraodb:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 22 commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
54d0407
Create readme
shyamraodb ed4caf6
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-etl directory
shyamraodb a185326
Create readme
shyamraodb 4d991f6
Create _util
shyamraodb 6dbf4bb
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/_util
shyamraodb 18e47c1
Add files via upload
shyamraodb b15202a
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/readme
shyamraodb a07de86
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/02-Popul…
shyamraodb eb6a371
Add files via upload
shyamraodb 2bd85c7
Update initialize-staging.py
shyamraodb 58de132
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/04-Utili…
shyamraodb abd76d5
Add files via upload
shyamraodb 3d768a5
Add files via upload
shyamraodb c270770
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/_images/…
shyamraodb a614192
Add files via upload
shyamraodb 43e9b37
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/_images/…
shyamraodb 5d316c8
Add files via upload
shyamraodb ec83c94
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl directory
shyamraodb 0cfe126
Add files via upload
shyamraodb 6849d91
Add files via upload
shyamraodb 8acac76
Added comment to COPY INTO (about streaming tables)
shyamraodb b92b62e
enable serverless sqlw
shyamraodb da21621
Updates following Quentin's review of PR
fe79363
Delete product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/_images …
shyamraodb 3136e7d
Refinements, bundle_config
72c513a
changed folder/dir name
08552b4
name change - etl_run_log
3016d3d
bundle config
994609c
Updates, Case, Removed Logging
64cb555
Some changes
048a288
ETL Log insert in main notebook
819d8a8
With sql scripting sample snippet
3e7e2fe
New Intro images
shyamraodb 0e28324
Changes to commentary (including what next)
e6eff1c
backtick comment
shyamraodb e29f696
Comment changes in 00
shyamraodb 87370a2
Latest comments for 00; moved scripting example to separate notebook
shyamraodb ee76745
Changed start schema to star schema in Cmd 1
shyamraodb 89fa05a
environement to environment
shyamraodb 6619da5
parameterize
shyamraodb 2b75b48
Not creating catalog / schema. Has to pre-exist
shyamraodb 490c3ab
task rename
shyamraodb b153f55
more changes to comments
shyamraodb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
50 changes: 50 additions & 0 deletions
50
product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/00-Setup/Initialize.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| -- Databricks notebook source | ||
| -- MAGIC %md-sandbox | ||
| -- MAGIC **Configure settings** <br> | ||
| -- MAGIC | ||
| -- MAGIC 1. Specify Catalog to create demo schemas <br> | ||
| -- MAGIC | ||
| -- MAGIC 2. Specify Schema to create data warehouse tables, staging volume <br> | ||
| -- MAGIC | ||
| -- MAGIC 3. Specify whether to enable Predictive Optimization for DW schema | ||
| -- MAGIC | ||
| -- MAGIC <u>NOTE:</u> | ||
| -- MAGIC The catalog and schema can be create beforehand. If not, ensure that the user running the workflow has permissions to create catalog and schema. | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- DBTITLE 1,dimension schema | ||
| /* | ||
| Manually update the following, to use a different catalog / schema: | ||
| */ | ||
|
|
||
| declare or replace variable catalog_nm string = 'dbsqldemos'; | ||
| declare or replace variable schema_nm string = 'clinical_star'; | ||
shyamraodb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- enable PO at schema level? else inherit from account setting | ||
| declare or replace variable enable_po_for_schema boolean = true; | ||
shyamraodb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC **Additional settings** | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| declare or replace variable run_log_table string; | ||
|
||
| declare or replace variable code_table string; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| set variable (run_log_table, code_table) = (select catalog_nm || '.' || schema_nm || '.' || 'elt_run_log', catalog_nm || '.' || schema_nm || '.' || 'code_m'); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| declare or replace variable volume_name string = 'staging'; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| declare or replace variable staging_path string; | ||
| set variable staging_path = '/Volumes/' || catalog_nm || "/" || schema_nm || "/" || volume_name; | ||
39 changes: 39 additions & 0 deletions
39
product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/00-Setup/Setup.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| -- Databricks notebook source | ||
| -- MAGIC %run "./Initialize" | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| declare or replace variable sqlstr string; -- variable to hold any sql statement for EXECUTE IMMEDIATE | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC Create Catalog and Schema(s) if required | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| set variable sqlstr = "create catalog if not exists " || catalog_nm; | ||
| execute immediate sqlstr; | ||
shyamraodb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| set variable sqlstr = "create schema if not exists " || catalog_nm || "." || schema_nm; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| execute immediate sqlstr; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| set variable sqlstr = "alter schema " || catalog_nm || "." || schema_nm || if(enable_po_for_schema, ' enable', ' inherit') || ' predictive optimization'; | ||
| execute immediate sqlstr; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC Create Volume for staging source data files | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| set variable sqlstr = "create volume if not exists " || catalog_nm || "." || schema_nm || "." || volume_name; | ||
| execute immediate sqlstr; | ||
38 changes: 38 additions & 0 deletions
38
product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/01-Create/Code Table.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| -- Databricks notebook source | ||
| -- MAGIC %run "../00-Setup/Initialize" | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC # Create Table | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC ##Master Data | ||
| -- MAGIC Standardized codes used for coded attributes | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| drop table if exists identifier(code_table); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- LC options - m_code, m_type | ||
|
|
||
| create table identifier(code_table) ( | ||
| m_code string comment 'code', | ||
| m_desc string comment 'name or description for the code', | ||
| m_type string comment 'attribute type utilizing code' | ||
| ) | ||
| comment 'master table for coded attributes' | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| insert into identifier(code_table) | ||
| values | ||
| ('M', 'Male', 'GENDER'), | ||
| ('F', 'Female', 'GENDER'), | ||
| ('hispanic', 'Hispanic', 'ETHNICITY'), | ||
| ('nonhispanic', 'Not Hispanic', 'ETHNICITY')e | ||
| ; |
22 changes: 22 additions & 0 deletions
22
product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/01-Create/ETL Log Table.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| -- Databricks notebook source | ||
| -- MAGIC %run "../00-Setup/Initialize" | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC # Create Tables | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC ## Config/Log Table for ETL | ||
| -- MAGIC This table captures the metadata for a given table that includes the table name, load start time and load end time. | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| drop table if exists identifier(run_log_table); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| create table identifier(run_log_table) (data_source string, table_name string, load_start_time timestamp, locked boolean, load_end_time timestamp, num_inserts int, num_updates int, process_id string) | ||
| ; |
155 changes: 155 additions & 0 deletions
155
product_demos/DBSQL-Datawarehousing/dbsql-for-dim-etl/01-Create/Patient Tables.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,155 @@ | ||
| -- Databricks notebook source | ||
| -- MAGIC %run "../00-Setup/Initialize" | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| declare or replace variable br_table string; -- staging/bronze table identifier | ||
shyamraodb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| declare or replace variable si_table string; -- integration/silver table identifier | ||
| declare or replace variable gd_table string; -- dimension table identifier | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| declare or replace variable sqlstr string; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| set variable (br_table, si_table, gd_table) = (select catalog_nm || '.' || schema_nm || '.' || 'patient_stg', catalog_nm || '.' || schema_nm || '.' || 'patient_int', catalog_nm || '.' || schema_nm || '.' || 'g_patient_d'); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC | ||
| -- MAGIC # Create Tables | ||
| -- MAGIC Create the staging, integration, and dimension tables for patient.<br> | ||
| -- MAGIC The patient dimension is part of the clinical data warehouse (star schema). | ||
| -- MAGIC | ||
| -- MAGIC <u>NOTE:</u> By default, the tables are created in the **catalog dbsqldemos**. To change this, or specify an existing catalog / schema, please see [Configure notebook]($../00-Setup/Configure) for more context. | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC ## Create Staging Table | ||
| -- MAGIC The schema for the staging table will be derived from the source data file(s) | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| drop table if exists identifier(br_table); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| create table if not exists identifier(br_table) | ||
| comment 'Patient staging table ingesting initial and incremental master data from csv files' | ||
| ; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC ## Create Integration Table | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| drop table if exists identifier(si_table); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC Potential clustering columns - (data_source, patient_src_id) <br> | ||
| -- MAGIC Also, column src_changed_on_dt will be naturally ordered (ingestion-time clustering) AND data_source will typically be the same for all records in a source file. | ||
| -- MAGIC | ||
| -- MAGIC **Note:** Predictive Optimization intelligently optimizes your table data layouts for faster queries and reduced storage costs. | ||
| -- MAGIC | ||
| -- MAGIC | ||
| -- MAGIC | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| create table if not exists identifier(si_table) ( | ||
| patient_src_id string not null comment 'ID of the record in the source', | ||
| date_of_birth date comment 'date of birth', | ||
| ssn string comment 'social security number', | ||
| drivers_license string comment 'driver\'s license', | ||
| name_prefix string comment 'name prefix', | ||
| first_name string comment 'first name of patient', | ||
| last_name string not null comment 'last name of patient', | ||
| name_suffix string comment 'name suffix', | ||
| maiden_name string comment 'maiden name', | ||
| gender_cd string comment 'code for patient\'s gender', | ||
| gender_nm string comment 'description of patient\'s gender', | ||
| marital_status string comment 'marital status', | ||
| ethnicity_cd string comment 'code for patient\'s ethnicity', | ||
| ethnicity_nm string comment 'description of patient\'s ethnicity', | ||
| src_changed_on_dt timestamp comment 'date of last change to record in source', | ||
| data_source string not null comment 'code for source system', | ||
| insert_dt timestamp comment 'date record inserted', | ||
| update_dt timestamp comment 'date record updated', | ||
| process_id string comment 'Process ID for run', | ||
| constraint c_int_pk primary key (patient_src_id, data_source) RELY | ||
| ) | ||
| comment 'curated integration table for patient data' | ||
| tblproperties (delta.enableChangeDataFeed = true) | ||
| ; | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC ## Create Dimension | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| drop table if exists identifier(gd_table); | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- MAGIC %md | ||
| -- MAGIC Potential clustering columns - attributes used for filtering in end-user queries. For e.g., Last Name, Gender Code. | ||
| -- MAGIC | ||
| -- MAGIC Additionally, for large dimensions, using the Source ID (patient_src_id) as a cluster key may help with ETL performance. | ||
| -- MAGIC | ||
| -- MAGIC **Note:** <br> | ||
| -- MAGIC For the dimension table, take advantage of Predictive Optimization and Auto clustering. | ||
| -- MAGIC | ||
| -- MAGIC Auto Clustering can be used to automatically cluster your tables based on your evolving workload! | ||
| -- MAGIC <br> | ||
| -- MAGIC Auto Clustering is enabled via **CLUSTER BY AUTO** clause. | ||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| create table if not exists identifier(gd_table) ( | ||
| patient_sk bigint generated always as identity comment 'Primary Key (ID)', | ||
| last_name string NOT NULL comment 'Last name of the person', | ||
| first_name string NOT NULL comment 'First name of the person', | ||
| name_prefix string comment 'Prefix of person name', | ||
| name_suffix string comment 'Suffix of person name', | ||
| maiden_name string comment 'Maiden name', | ||
| gender_code string comment 'Gender code', | ||
| gender string comment 'gender description', | ||
| date_of_birth timestamp comment 'Birth date and time', | ||
| marital_status string comment 'Marital status', | ||
| ethnicity_code string, | ||
| ethnicity string, | ||
| ssn string comment 'Patient SSN', | ||
| other_identifiers map <string, string> comment 'Identifier type (passport number, license number except mrn, ssn) and value', | ||
| uda map <string, string> comment 'User Defined Attributes', | ||
| patient_src_id string comment 'Unique reference to the source record', | ||
| effective_start_date timestamp comment 'SCD2 effective start date for version', | ||
| effective_end_date timestamp comment 'SCD2 effective start date for version', | ||
| checksum string comment 'Checksum for the record', | ||
| data_source string comment 'Code for source system', | ||
| insert_dt timestamp comment 'record inserted time', | ||
| update_dt timestamp comment 'record updated time', | ||
| process_id string comment 'Process ID for run', | ||
| constraint c_d_pk primary key (patient_sk) RELY | ||
| ) | ||
| cluster by auto | ||
| comment 'Patient dimension' | ||
| tblproperties ( | ||
| delta.deletedFileRetentionDuration = 'interval 30 days' | ||
| ) | ||
| ; | ||
|
|
||
|
|
||
| -- COMMAND ---------- | ||
|
|
||
| -- FK to integration table | ||
| set variable sqlstr = 'alter table ' || gd_table || ' add constraint c_d_int_source_fk foreign key (patient_src_id, data_source) references ' || si_table || '(patient_src_id, data_source) not enforced rely'; | ||
| execute immediate sqlstr; | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.