Skip to content

Commit 93697e0

Browse files
committed
WIP
1 parent 3c64d83 commit 93697e0

File tree

2 files changed

+46
-30
lines changed

2 files changed

+46
-30
lines changed
Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
# Design Pattern - Data Vault - Loading Hub tables
1+
# Design Pattern - Data Vault - Hub table
22

33
## Purpose
4+
45
This Design Pattern describes how to load data into Data Vault Hub style tables. It is a specification of the Hub ETL process.
56

67
## Motivation
7-
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
8+
9+
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
810

911
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
1012

11-
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
13+
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
1214

1315
Also known as:
1416

@@ -18,17 +20,20 @@ Also known as:
1820
- Data Warehouse key distribution
1921

2022
## Applicability
23+
2124
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub ETL processes follow the same pattern.
2225

2326
## Structure
24-
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an ‘insert only’ of the unique business keys that are not yet in the the target Hub.
27+
28+
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an �insert only� of the unique business keys that are not yet in the the target Hub.
2529

2630
The process performs a distinct selection on the business key attribute(s) in the Staging Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
2731

2832
During the selection the key distribution approach is implemented to make sure a dedicated Data Warehouse key is created. This can be an integer value, a hash key (i.e. MD5 or SHA1) or a natural business key.
2933

3034
## Implementation Guidelines
31-
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
35+
36+
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
3237

3338
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
3439
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
@@ -41,17 +46,19 @@ By default the DISTINCT function is executed on database level to reserve resour
4146

4247
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
4348

44-
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is self-standing and meaningful.
49+
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is self-standing and meaningful.
4550

4651
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
4752

4853
## Considerations and Consequences
54+
4955
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
5056

5157
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
5258

5359
## Related Patterns
54-
* Design Pattern 006 – Generic – Using Start, Process and End Dates
55-
* Design Pattern 009 – Data Vault – Loading Satellite tables
56-
* Design Pattern 010 – Data Vault – Loading Link tables
57-
* Design Pattern 023 – Data Vault – Missing keys and placeholders
60+
61+
- Design Pattern 006 - Generic - Using Start, Process and End Dates
62+
- Design Pattern 009 - Data Vault - Loading Satellite tables
63+
- Design Pattern 010 - Data Vault - Loading Link tables
64+
- Design Pattern 023 - Data Vault - Missing keys and placeholders
Lines changed: 29 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,45 @@
1-
# Design Pattern - Data Vault - Loading Satellites tables
1+
# Design Pattern - Data Vault - Satellites table
2+
3+
---
4+
**NOTE**
5+
6+
This design pattern requires a major update.
7+
8+
---
29

310
## Purpose
4-
This Design Pattern describes how to load data into Satellite tables within a ‘Data Vault’ EDW architecture. The concept can be applied to any SCD2 mechanism as well.
11+
12+
This Design Pattern describes how to represent, or load data into, Satellite tables using Data Vault methodology.
513

614
## Motivation
7-
The Design Pattern to load data into Satellite style tables aims to simplify and streamline the way ETL design is done for these tables. The process is essentially straightforward and does not require any business logic other than the definition of the business key. This is already done as part of the data modelling and Hub definition steps.
8-
Also known as
9-
Satellite (Data Vault modelling concept).
10-
History or INT tables.
15+
16+
The Design Pattern for Satellite tables contain context, descriptive properties that describe a Data Vault 'Hub' table. They
1117

1218
## Applicability
19+
1320
This pattern is only applicable for loading data to Satellite tables from:
1421
The Staging Area into the Integration Area.
1522
The Integration Area into the Interpretation Area.
1623
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area Satellite tables.
1724

1825
## Structure
26+
1927
The ETL process can be described as a slowly changing dimension / history update of all attributes except the business key (which is stored in the Hub table). This is explained in the following diagram. Most attribute values, including some of the ETL process control values are copied from the Staging Area table. This includes:
2028
Load Date / Time Stamp (used for the target Effective Date / Time and potentially the Update Date / TimeE attributes).
2129
Source Row Id.
22-
The following diagram will detail this process and address how the other ETL process control attributes are handled.
23-
24-
Business Insights > Design Pattern 009 - Data Vault - Loading Satellite Tables > BI Docs.png
25-
26-
Figure 1: Satellite ETL process
27-
The Satellite ETL processes can only be run after the Hub process has finished, but can run in parallel with the Link ETL process. This is displayed in the following diagram:
28-
Business Insights > Design Pattern 009 - Data Vault - Loading Satellite Tables > BI.png
29-
Figure 2: Dependencies
3030

3131
## Implementation Guidelines
32+
3233
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
34+
3335
The process in Figure 1 shows the entire ETL in one single process. For specific tools this way of developing ETL might be relatively inefficient. Therefore, the process can also be broken up into two separate mappings; one for inserts and one for updates. Logically the same actions will be executed, but physically two separate mappings can be used. This can be done in two ways:
36+
3437
Follow the same logic, with the same selects, but place filters for the update and insert branches. This leads to an extra pass on the source table, at the possible benefit of running the processes in parallel.
35-
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open. As sample query for this selection is:
38+
39+
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open.
40+
41+
A sample query for this selection is:
42+
3643
SELECT satellite.DWH_ID, satellite.<Expiry Date/Time>
3744
FROM satellite
3845
WHERE ( satellite.<Expiry Date/Time> IS NULL AND
@@ -41,15 +48,17 @@ WHERE ( satellite.<Expiry Date/Time> IS NULL AND
4148
AND a.FIRM_LEDTS IS NULL)
4249
)
4350
ORDER BY 1,2 DESC
51+
4452
If you have a Change Data Capture based source, the attribute comparison is not required because the source system supplies the information whether the record in the Staging Area is new, updated or deleted.
53+
4554
Use hash values to detect changes, instead of comparing attributes separately. The hash value is created from all attributes except the business key and ETL process control values.
4655

4756
## Considerations and Consequences
57+
4858
Multiple passes on source data are likely to be required.
49-
Known uses
50-
This type of ETL process is to be used in all Hub or SK tables in the Integration Area. The Cleansing Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
5159

5260
## Related Patterns
53-
Design Pattern 006 – Generic – Using Start, Process and End Dates
54-
Design Pattern 009 – Data Vault – Loading Satellite tables.
55-
Design Pattern 010 – Data Vault – Loading Link tables.
61+
62+
Design Pattern 006 - Generic - Using Start, Process and End Dates
63+
Design Pattern 009 - Data Vault - Loading Satellite tables
64+
Design Pattern 010 - Data Vault - Loading Link tables

0 commit comments

Comments
 (0)