Skip to content

Commit d90d9be

Browse files
committed
Updates
1 parent 2d3cbc4 commit d90d9be

14 files changed

+120
-141
lines changed

Design_Patterns/Design Pattern - Data Vault - Loading Hub tables.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,24 @@
11
# Design Pattern - Data Vault - Loading Hub tables
22

33
## Purpose
4-
This Design Pattern describes how to load data into Data Vault Hub style entities.
4+
This Design Pattern describes how to load data into Data Vault Hub style tables. It is a specification of the Hub ETL process.
55

66
## Motivation
7-
Loading data into Hub tables is a relatively straightforward process with a set location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer. It is a vital component of the Data Warehouse architecture, making sure that Data Warehouse keys are distributed properly and at the right point in time.
7+
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
88

9-
Decoupling key distribution and historical information is an essential requirement for reducing dependencies in the loading process and enabling flexible storage design in the Data Warehouse.
9+
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
1010

11-
This pattern specifies how the Hub ETL process works and why it is important to follow.
12-
13-
In a Data Vault based Enterprise Data Warehouse solution, the Hub tables (and corresponding ETL) are the only places where Data Warehouse keys are distributed.
11+
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
1412

1513
Also known as:
1614

15+
- Core Business Concept (Ensemble modelling)
1716
- Hub (Data Vault modelling concept)
18-
- Surrogate Key (SK) or Hash Key (HSH) distribution
17+
- Surrogate Key (SK) or Hash Key (HSH) distribution, as commonly used implementations of the concept
1918
- Data Warehouse key distribution
2019

2120
## Applicability
22-
This pattern is applicable for the process of loading from the Staging Layer into the Integration Area Hub tables. It is used in all Hub in the Integration Layer. Derived (Business Data Vault) Hub tables follow the same pattern, but with business logic applied.
21+
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub ETL processes follow the same pattern.
2322

2423
## Structure
2524
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an ‘insert only’ of the unique business keys that are not yet in the the target Hub.
@@ -44,15 +43,15 @@ The logic to create the initial (dummy) Satellite record can both be implemented
4443

4544
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is ‘self-standing’ and meaningful.
4645

47-
To cater for a situation where multiple OMD_INSERT_DATETIME values exist for a single business key, the minimum OMD_INSERT_DATETIME should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum OMD_INSERT_DATETIME in one step.
46+
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
4847

4948
## Considerations and Consequences
5049
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
5150

5251
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
5352

5453
## Related Patterns
55-
Design Pattern 006 – Generic – Using Start, Process and End Dates
56-
Design Pattern 009 – Data Vault – Loading Satellite tables
57-
Design Pattern 010 – Data Vault – Loading Link tables
58-
Design Pattern 023 – Data Vault – Missing keys and placeholders
54+
* Design Pattern 006 – Generic – Using Start, Process and End Dates
55+
* Design Pattern 009 – Data Vault – Loading Satellite tables
56+
* Design Pattern 010 – Data Vault – Loading Link tables
57+
* Design Pattern 023 – Data Vault – Missing keys and placeholders

Design_Patterns/Design Pattern - Data Vault - Loading Link Satellite tables.md

Lines changed: 17 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,57 +5,54 @@ This Design Pattern describes how to load data into Link-Satellite tables within
55

66
## Motivation
77

8-
Also known as
9-
Link-Satellite (Data Vault modelling concept).
10-
History or INT tables.
8+
To provide a generic approach for loading Link Satellites.
119

1210
## Applicability
11+
1312
This pattern is only applicable for loading data to Link-Satellite tables from:
14-
The Staging Area into the Integration Area.
15-
The Integration Area into the Interpretation Area.
16-
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
13+
14+
* The Staging Area into the Integration Area.
15+
* The Integration Area into the Interpretation Area.
16+
* The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
1717

1818
## Structure
1919
Standard Link-Satellites use the Driving Key concept to manage the ending of ‘old’ relationships.
2020

2121
## Implementation Guidelines
2222
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
2323
Select all records for the Link Satellite which have more than one open effective date / current record indicator but are not the most recent (because that record does not need to be closed
24-
WITH MyCTE (<Link SK>, <Driving Key SK>, OMD_EFFECTIVE_DATE, OMD_EXPIRY_DATE, RowVersion)
24+
WITH MyCTE (<Link SK>, <Driving Key SK>, <Effective Date/Time>, <Expiry Date/Time>, RowVersion)
2525
AS (
2626
SELECT
27-
A.<Link SK>, B.<Driving Key SK>, A.OMD_EFFECTIVE_DATE, A.OMD_EXPIRY_DATE,
28-
DENSE_RANK() OVER(PARTITION BY B.<Driving Key SK> ORDER BY B.<Link SK>, OMD_EFFECTIVE_DATE ASC) RowVersion
27+
A.<Link SK>, B.<Driving Key SK>, A.<Effective Date/Time>, A.<Expiry Date/Time>,
28+
DENSE_RANK() OVER(PARTITION BY B.<Driving Key SK> ORDER BY B.<Link SK>, <Effective Date/Time> ASC) RowVersion
2929
FROM <Link Sat table> A
3030
JOIN <Link table> B ON A.<Link SK>=B.<Link SK>
3131
JOIN (
3232
SELECT <Driving Key SK>
3333
FROM <Link Sat table> A
3434
JOIN <Link table> B ON A.<Link SK>=B.<Link SK>
35-
WHERE A.OMD_EXPIRY_DATE = '99991231'
35+
WHERE A.<Expiry Date/Time> = '99991231'
3636
GROUP BY <Driving Key SK>
3737
HAVING COUNT(*) > 1
3838
) C ON B.<Driving Key SK> = C.<Driving Key SK>
3939
)
4040
SELECT
4141
BASE.<Link SK>
42-
,CASE WHEN LAG.OMD_EFFECTIVE_DATE IS NULL THEN '19000101' ELSE BASE.OMD_EFFECTIVE_DATE END AS OMD_EFFECTIVE_DATE
43-
,CASE WHEN LEAD.OMD_EFFECTIVE_DATE IS NULL THEN '99991231' ELSE LEAD.OMD_EFFECTIVE_DATE END AS OMD_EXPIRY_DATE
44-
,CASE WHEN LEAD.OMD_EFFECTIVE_DATE IS NULL THEN 'Y' ELSE 'N' END AS OMD_CURRENT_RECORD_INDICATOR
42+
,CASE WHEN LAG.<Effective Date/Time> IS NULL THEN '19000101' ELSE BASE.<Effective Date/Time> END AS <Effective Date/Time>
43+
,CASE WHEN LEAD.<Effective Date/Time> IS NULL THEN '99991231' ELSE LEAD.<Effective Date/Time> END AS <Expiry Date/Time>
44+
,CASE WHEN LEAD.<Effective Date/Time> IS NULL THEN 'Y' ELSE 'N' END AS <Current Row Indicator>
4545
FROM MyCTE BASE
4646
LEFT JOIN MyCTE LEAD ON BASE.<Driving Key SK> = LEAD.<Driving Key SK>
4747
AND BASE.RowVersion = LEAD.RowVersion-1
4848
LEFT JOIN MyCTE LAG ON BASE.<Driving Key SK> = LAG.<Driving Key SK>
4949
AND BASE.RowVersion = LAG.RowVersion+1
50-
WHERE BASE.OMD_EXPIRY_DATE = '99991231'
50+
WHERE BASE.<Expiry Date/Time> = '99991231'
5151

5252
## Considerations and Consequences
5353
Multiple passes on source data are likely to be required.
54-
Known uses
5554

5655
## Related Patterns
57-
Design Pattern 006 – Using Start, Process and End Dates
58-
Design Pattern 009 – Loading Satellite tables.
59-
Design Pattern 010 – Loading Link tables.
60-
Discussion items (not yet to be implemented or used until final)
61-
None.
56+
* Design Pattern 006 – Using Start, Process and End Dates
57+
* Design Pattern 009 – Loading Satellite tables.
58+
* Design Pattern 010 – Loading Link tables.

Design_Patterns/Design Pattern - Data Vault - Loading Satellite tables.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ Figure 2: Dependencies
3232
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
3333
The process in Figure 1 shows the entire ETL in one single process. For specific tools this way of developing ETL might be relatively inefficient. Therefore, the process can also be broken up into two separate mappings; one for inserts and one for updates. Logically the same actions will be executed, but physically two separate mappings can be used. This can be done in two ways:
3434
Follow the same logic, with the same selects, but place filters for the update and insert branches. This leads to an extra pass on the source table, at the possible benefit of running the processes in parallel.
35-
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open OMD_EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the OMD_EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open. As sample query for this selection is:
36-
SELECT satellite.DWH_ID, satellite.OMD_EXPIRY_DATE
35+
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open. As sample query for this selection is:
36+
SELECT satellite.DWH_ID, satellite.<Expiry Date/Time>
3737
FROM satellite
38-
WHERE ( satellite.OMD_EXPIRY_DATE IS NULL AND
38+
WHERE ( satellite.<Expiry Date/Time> IS NULL AND
3939
2 <= (SELECT COUNT(DWH_ID)
4040
FROM satellite A WHERE a.DWH_ID = satellite.DWH_ID
4141
AND a.FIRM_LEDTS IS NULL)

0 commit comments

Comments
 (0)