Skip to content

Commit 6922783

Browse files
committed
Major re-review
Still in progress. Need to redo all markups, go through documentation end to end.
1 parent 7a34c07 commit 6922783

17 files changed

+248
-189
lines changed

1000_Design_Patterns/Design Pattern - Data Vault - Loading Hub tables.md

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,18 @@
22

33
## Purpose
44
This Design Pattern describes how to load data into Data Vault Hub style entities.
5-
Motivation
5+
6+
## Motivation
67
Loading data into Hub tables is a relatively straightforward process with a fixed location in the scheduling of loading data from the Staging Layer to the Integration Layer. It is a vital component of the Data Warehouse architecture, making sure that Data Warehouse keys are distributed properly and at the right point in time. Decoupling key distribution and historical information is an essential requirement for parallel processing and for reducing dependencies in the loading process. This pattern specifies how this process works and why it is important to follow. In a Data Vault based Enterprise Data Warehouse solution, the Hub tables (and corresponding ETL) are the only places where Data Warehouse keys are distributed.
78
Also known as
89
Hub (Data Vault modelling concept)
910
Surrogate Key (SK) or Hash Key (HSH) distribution
1011
Data Warehouse key distribution
11-
Applicability
12+
13+
## Applicability
1214
This pattern is applicable for the process of loading from the Staging Layer into the Integration Area Hub tables only.
13-
Structure
15+
16+
## Structure
1417
The ETL process can be described as an ‘insert only’ set of the unique business keys. The process performs a SELECT DISTINCT on the Staging Area table and a key lookup to retrieve the OMD_RECORD_SOURCE_ID based on the value in the Staging Layer table. If no entry for the record source is found the ETL process is set to fail because this indicates a major error in the ETL Framework configuration (i.e. this must be tested during unit and UAT testing).
1518
Using this value and the source business key the process performs a key lookup (outer join) to verify if that specific business key already exists in the target Hub table (for that particular record source). If it exists, the row can be discarded, if not it can be inserted.
1619
Business Insights > Design Pattern 008 - Data Vault - Loading Hub tables > image2015-4-29 14:54:58.png
@@ -20,7 +23,8 @@ The Hub ETL processes are the first ones that need to be executed in the Integra
2023
Business Insights > Design Pattern 008 - Data Vault - Loading Hub tables > BI2.png
2124
Figure 2: Dependencies
2225
Logically the creation of the initial Satellite record is part of the ETL process for Hub tables and is a prerequisite for further processing of the Satellites.
23-
Implementation guidelines
26+
27+
## Implementation Guidelines
2428
Use a single ETL process, module or mapping to load the Hub table, thus improving flexibility in processing. This means that no Hub keys will be distributed as part of another ETL process.
2529
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
2630
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
@@ -30,16 +34,15 @@ By default the DISTINCT function is executed on database level to reserve resour
3034
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
3135
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is ‘self-standing’ and meaningful.
3236
To cater for a situation where multiple OMD_INSERT_DATETIME values exist for a single business key, the minimum OMD_INSERT_DATETIME should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum OMD_INSERT_DATETIME in one step.
33-
Consequences
37+
38+
## Considerations and Consequences
3439
Multiple passes on source data is likely to be required: once for Hub tables and subsequently for Link and Satellite tables. Defining Hub ETL processes in the atomic way as defined in this Design Pattern means that many files load data to the same central Hub table; all processes will be very similar with the only difference the mapping between the source attribute which represents the business key and the Hub counterpart.
3540
A single Hub may be loaded by many Modules from a single source system, and there may be several Satellites for the source system hanging off this Hub. It needs to be ensured that all corresponding Satellites are populated by the Hub ETL.
3641
Known uses
3742
This type of ETL process is to be used in all Hub or Surrogate Key tables in the Integration Area. The Interpretation Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
38-
Related patterns
43+
44+
## Related Patterns
3945
Design Pattern 006 – Generic – Using Start, Process and End Dates
4046
Design Pattern 009 – Data Vault – Loading Satellite tables
4147
Design Pattern 010 – Data Vault – Loading Link tables
42-
Design Pattern 023 – Data Vault – Missing keys and placeholders
43-
44-
Discussion items (not yet to be implemented or used until final)
45-
The OMD_INSERT_DATETIME that represents the implementation of the Event Date/Time concept is currently populated in a different way than similar OMD information (such as the OMD_UPDATE_DATETIME). It may be easier to introduce an OMD_EVENT_DATETIME attribute that captures this information.
48+
Design Pattern 023 – Data Vault – Missing keys and placeholders

1000_Design_Patterns/Design Pattern - Data Vault - Loading Link Satellite tables.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,23 @@
22

33
## Purpose
44
This Design Pattern describes how to load data into Link-Satellite tables within a ‘Data Vault’ EDW architecture. In Data Vault, Link-Satellite tables manage the change for relationships over time.
5-
Motivation
5+
6+
## Motivation
67

78
Also known as
89
Link-Satellite (Data Vault modelling concept).
910
History or INT tables.
10-
Applicability
11+
12+
## Applicability
1113
This pattern is only applicable for loading data to Link-Satellite tables from:
1214
The Staging Area into the Integration Area.
1315
The Integration Area into the Interpretation Area.
1416
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
15-
Structure
17+
18+
## Structure
1619
Standard Link-Satellites use the Driving Key concept to manage the ending of ‘old’ relationships.
17-
Implementation guidelines
20+
21+
## Implementation Guidelines
1822
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
1923
Select all records for the Link Satellite which have more than one open effective date / current record indicator but are not the most recent (because that record does not need to be closed
2024
WITH MyCTE (<Link SK>, <Driving Key SK>, OMD_EFFECTIVE_DATE, OMD_EXPIRY_DATE, RowVersion)
@@ -44,10 +48,12 @@ LEFT JOIN MyCTE LEAD ON BASE.<Driving Key SK> = LEAD.<Driving Key SK>
4448
LEFT JOIN MyCTE LAG ON BASE.<Driving Key SK> = LAG.<Driving Key SK>
4549
AND BASE.RowVersion = LAG.RowVersion+1
4650
WHERE BASE.OMD_EXPIRY_DATE = '99991231'
47-
Consequences
51+
52+
## Considerations and Consequences
4853
Multiple passes on source data are likely to be required.
4954
Known uses
50-
Related patterns
55+
56+
## Related Patterns
5157
Design Pattern 006 – Using Start, Process and End Dates
5258
Design Pattern 009 – Loading Satellite tables.
5359
Design Pattern 010 – Loading Link tables.

1000_Design_Patterns/Design Pattern - Data Vault - Loading Link tables.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,17 @@
22

33
## Purpose
44
This design pattern describes the loading process for ‘Link’ tables in the Data Vault concept.
5-
Motivation
5+
6+
## Motivation
67
The Link concept in Data Vault provides the flexibility of this data modeling approach. Links are sets of (hub) keys that indicate that a relationship between those Hubs has existed at some point in time. A Link table is similar in concept to the Hub table, but only stores key pairs. The structure of the relationship tables (including the Link table) is documented in the Integration FrameworkA120 – Integration Layer document of the Outline Architecture. Even though the Data Vault concept allows for adding attributes in the Link table it is strongly recommended (for flexibility reasons) to only store the generated hashkey (meaningless key), the Hub surrogate hashkeys and the date/time information. Doing this will ensure compatibility with both stationary facts (time dependent facts such as balances) and pure transactions.
78
Link to file Eru Marumaru
89
Also known As
910
Relationship table
10-
Applicability
11+
12+
## Applicability
1113
This pattern is only applicable for loading processes from the Staging Area into the Integration Area and from the Integration Area to the Interpretation Area. The pattern varies slightly for the type of Link table specified (such as Transactional Link, Same-As Link, Hierarchical Link or Low-Value Links) or whether it contains a degenerate attribute. In most cases the Link table will contain the default attributes (Link Key, Hub Keys and metadata attributes) but in the case of a pure transactional Link table it can contain the transaction attributes as well.
12-
Structure
14+
15+
## Structure
1316
The ETL process can be described as an ‘insert only’ set of the unique combination of Data Warehouse keys. Depending on the type of source table, the process will do the following:
1417
Source Area to Integration Area: the process executes a SELECT DISTINCT query on business keys and performs key lookups (outer join) on the corresponding Hub tables to obtain the Hub Data Warehouse keys. The resulting key combination is then verified using a key lookup against the target Link table to verify if that specific combination of Data Warehouse keys already exists. If it exists, the row can be discarded, if not it can be inserted.
1518
Integration Area to Interpretation Area: the process executes a SELECT DISTINCT query on Data Warehouse keys (likely after combining multiple tables first) and performs a key lookup against the target Link table to verify if that specific combination of Data Warehouse keys already exists. If it exists, the row can be discarded, if not it can be inserted.
@@ -18,7 +21,8 @@ The following diagram displays the ETL process for Link tables;
1821
Business Insights > Design Pattern 010 - Data Vault - Loading Link tables > image2015-4-29 16:24:14.png
1922
This image needs updating to reflect DV 2.0 (hashkey) useage Eru Marumaru
2023
In a pure relational Link it is required that a dummy key is available in each corresponding Link-Satellite to complete the timelines. This is handled as part of the Link-Satellite processing as a Link can contain multiple Link-Satellites. Dummy records are only required to be inserted for each driving key as a view in time across the driving key is ultimately required. Inserting a dummy record for every Link key will cause issues in the timeline. This is explained in more detail in the Link-Satellite Design Pattern.
21-
Implementation Guidelines
24+
25+
## Implementation Guidelines
2226
Use a single ETL process, module or mapping to load the Link table, thus improving flexibility in processing. Every ETL process should have a distinct function.
2327
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Link table; the other passes are needed to populate the Link Satellite tables (if any).
2428
By default, create a sequence / meaningless key for each unique key combination in a Link table.
@@ -27,11 +31,13 @@ Date/time information is copied from the Staging Area tables and not generated b
2731
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Link ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Link-Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Link-Satellite ETL since it does not require rework when additional Link-Satellites are associated with the Link. This means that each Link-Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
2832
Depending on how the Link table is modelled (what kind of relationship it manages) the Link table may contains a relationship type attribute. If a link table contains multiple, or changing, relationships (types) this attributes is moved to the Link-Satellite table.
2933
Ending /closing relationships is always done in the Link-Satellite table, typically using a separate ETL process.
30-
Consequences
34+
35+
## Considerations and Consequences
3136
Multiple passes on source data is likely to be required. In extreme cases a single source table might be used (branch out) to Hubs, Satellites, Links and Link Satellites.
3237
Known Uses
3338
This type of ETL process is to be used for loading all link tables in both the Integration Area as well as the Interpretation Area. This is because the Link table is also used to relate raw (Integration Area) data and cleansed (Interpretation Area) data together.
34-
Related Patterns
39+
40+
## Related Patterns
3541
Design Pattern 006 – Generic – Using Start, Process and End Dates
3642
Design Pattern 008 – Data Vault – Loading Hub tables
3743
Design Pattern 009 – Data Vault – Loading Satellite tables

1000_Design_Patterns/Design Pattern - Data Vault - Loading Satellite tables.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,20 @@
22

33
## Purpose
44
This Design Pattern describes how to load data into Satellite tables within a ‘Data Vault’ EDW architecture. The concept can be applied to any SCD2 mechanism as well.
5-
Motivation
5+
6+
## Motivation
67
The Design Pattern to load data into Satellite style tables aims to simplify and streamline the way ETL design is done for these tables. The process is essentially straightforward and does not require any business logic other than the definition of the business key. This is already done as part of the data modelling and Hub definition steps.
78
Also known as
89
Satellite (Data Vault modelling concept).
910
History or INT tables.
10-
Applicability
11+
12+
## Applicability
1113
This pattern is only applicable for loading data to Satellite tables from:
1214
The Staging Area into the Integration Area.
1315
The Integration Area into the Interpretation Area.
1416
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area Satellite tables.
15-
Structure
17+
18+
## Structure
1619
The ETL process can be described as a slowly changing dimension / history update of all attributes except the business key (which is stored in the Hub table). This is explained in the following diagram. Most attribute values, including some of the OMD values are copied from the Staging Area table. This includes:
1720
OMD_INSERT_DATETIME (used for the target OMD_EFFECTIVE_DATE and OMD_UPDATE_DATETIME attributes).
1821
OMD_SOURCE_ROW_ID.
@@ -24,7 +27,8 @@ Figure 1: Satellite ETL process
2427
The Satellite ETL processes can only be run after the Hub process has finished, but can run in parallel with the Link ETL process. This is displayed in the following diagram:
2528
Business Insights > Design Pattern 009 - Data Vault - Loading Satellite Tables > BI.png
2629
Figure 2: Dependencies
27-
Implementation guidelines
30+
31+
## Implementation Guidelines
2832
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
2933
The process in Figure 1 shows the entire ETL in one single process. For specific tools this way of developing ETL might be relatively inefficient. Therefore, the process can also be broken up into two separate mappings; one for inserts and one for updates. Logically the same actions will be executed, but physically two separate mappings can be used. This can be done in two ways:
3034
Follow the same logic, with the same selects, but place filters for the update and insert branches. This leads to an extra pass on the source table, at the possible benefit of running the processes in parallel.
@@ -39,13 +43,13 @@ WHERE ( satellite.OMD_EXPIRY_DATE IS NULL AND
3943
ORDER BY 1,2 DESC
4044
If you have a Change Data Capture based source, the attribute comparison is not required because the source system supplies the information whether the record in the Staging Area is new, updated or deleted.
4145
Use hash values to detect changes, instead of comparing attributes separately. The hash value is created from all attributes except the business key and OMD values.
42-
Consequences
46+
47+
## Considerations and Consequences
4348
Multiple passes on source data are likely to be required.
4449
Known uses
4550
This type of ETL process is to be used in all Hub or SK tables in the Integration Area. The Cleansing Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
46-
Related patterns
51+
52+
## Related Patterns
4753
Design Pattern 006 – Generic – Using Start, Process and End Dates
4854
Design Pattern 009 – Data Vault – Loading Satellite tables.
49-
Design Pattern 010 – Data Vault – Loading Link tables.
50-
Discussion items (not yet to be implemented or used until final)
51-
None.
55+
Design Pattern 010 – Data Vault – Loading Link tables.

0 commit comments

Comments
 (0)