Skip to content

Commit 4a88313

Browse files
committed
Minor updates, many , many updates are needed.
1 parent aa90d4d commit 4a88313

File tree

3 files changed

+66
-30
lines changed

3 files changed

+66
-30
lines changed

design-patterns/design-pattern-data-vault-hub.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
---
2+
uid: design-pattern-data-vault-hub-table
3+
---
4+
15
# Design Pattern - Data Vault - Hub table
26

37
---
@@ -9,10 +13,12 @@ This design pattern requires a major update to refresh the content.
913

1014
## Purpose
1115

12-
This Design Pattern describes how to load data into Data Vault Hub style tables. It is a specification of the Hub ETL process.
16+
This design pattern describes how to define, and load data into, Data Vault Hub style tables.
1317

1418
## Motivation
1519

20+
A Data Vault Hub is the physical implementation of a Core Business Concept. These are the the key identified 'things' that can be meaningfully identified as part of an organization's business processes.
21+
1622
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
1723

1824
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
@@ -40,6 +46,8 @@ During the selection the key distribution approach is implemented to make sure a
4046

4147
## Implementation Guidelines
4248

49+
Hubs are core business concepts which must be immediately and uniquely identifiable through their name.
50+
4351
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
4452

4553
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
Lines changed: 41 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,72 @@
1+
---
2+
uid: design-pattern-data-vault-link-table
3+
---
4+
15
# Design Pattern - Data Vault - Loading Link tables
26

7+
---
8+
**NOTE**
9+
10+
This design pattern requires a major update to refresh the content.
11+
12+
---
13+
314
## Purpose
4-
This design pattern describes the loading process for ‘Link’ tables in the Data Vault concept.
15+
16+
This design pattern describes how to define, and load data into, Data Vault Link style tables.
517

618
## Motivation
7-
The Link concept in Data Vault provides the flexibility of this data modeling approach. Links are sets of (hub) keys that indicate that a relationship between those Hubs has existed at some point in time. A Link table is similar in concept to the Hub table, but only stores key pairs. The structure of the relationship tables (including the Link table) is documented in the Integration FrameworkA120 – Integration Layer document of the Outline Architecture. Even though the Data Vault concept allows for adding attributes in the Link table it is strongly recommended (for flexibility reasons) to only store the generated hashkey (meaningless key), the Hub surrogate hashkeys and the date/time information. Doing this will ensure compatibility with both stationary facts (time dependent facts such as balances) and pure transactions.
8-
Link to file Eru Marumaru
9-
Also known As
10-
Relationship table
19+
20+
A Link table in Data Vault is the physical implementation of a Natural Business Relationship. A Link uniquely identifies a relationship between Core Business Concepts (Hub tables in Data Vault).
21+
22+
The Link concept in Data Vault provides the flexibility of this data modeling approach. Links are sets of (hub) keys that indicate that a relationship between those Hubs has existed at some point in time.
23+
24+
A Link table is similar in concept to the Hub table, but only stores key pairs.
25+
26+
Even though the Data Vault concept allows for adding attributes in the Link table it is strongly recommended (for flexibility reasons) to only store the generated hash key (meaningless key), the Hub surrogate hash keys and the date/time information. Doing this will ensure compatibility with both stationary facts (time dependent facts such as balances) and pure transactions.
1127

1228
## Applicability
29+
1330
This pattern is only applicable for loading processes from the Staging Area into the Integration Area and from the Integration Area to the Interpretation Area. The pattern varies slightly for the type of Link table specified (such as Transactional Link, Same-As Link, Hierarchical Link or Low-Value Links) or whether it contains a degenerate attribute. In most cases the Link table will contain the default attributes (Link Key, Hub Keys and metadata attributes) but in the case of a pure transactional Link table it can contain the transaction attributes as well.
1431

1532
## Structure
16-
The ETL process can be described as an ‘insert only’ set of the unique combination of Data Warehouse keys. Depending on the type of source table, the process will do the following:
33+
34+
The ETL process can be described as an 'insert only' set of the unique combination of Data Warehouse keys. Depending on the type of source table, the process will do the following:
35+
1736
Source Area to Integration Area: the process executes a SELECT DISTINCT query on business keys and performs key lookups (outer join) on the corresponding Hub tables to obtain the Hub Data Warehouse keys. The resulting key combination is then verified using a key lookup against the target Link table to verify if that specific combination of Data Warehouse keys already exists. If it exists, the row can be discarded, if not it can be inserted.
37+
1838
Integration Area to Interpretation Area: the process executes a SELECT DISTINCT query on Data Warehouse keys (likely after combining multiple tables first) and performs a key lookup against the target Link table to verify if that specific combination of Data Warehouse keys already exists. If it exists, the row can be discarded, if not it can be inserted.
1939
The maintenance of the Interpretation Area can also be done as part of an (external) process or through Master Data Management. In this context, Link tables between Integration and Interpretation Area tables are very similar to cross-referencing tables.
40+
2041
The following diagram displays the ETL process for Link tables;
2142
Business Insights > Design Pattern 010 - Data Vault - Loading Link tables > image2015-4-29 16:24:14.png
22-
This image needs updating to reflect DV 2.0 (hashkey) useage Eru Marumaru
43+
2344
In a pure relational Link it is required that a dummy key is available in each corresponding Link-Satellite to complete the timelines. This is handled as part of the Link-Satellite processing as a Link can contain multiple Link-Satellites. Dummy records are only required to be inserted for each driving key as a view in time across the driving key is ultimately required. Inserting a dummy record for every Link key will cause issues in the timeline. This is explained in more detail in the Link-Satellite Design Pattern.
2445

2546
## Implementation Guidelines
47+
2648
Use a single ETL process, module or mapping to load the Link table, thus improving flexibility in processing. Every ETL process should have a distinct function.
49+
2750
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Link table; the other passes are needed to populate the Link Satellite tables (if any).
51+
2852
By default, create a sequence / meaningless key for each unique key combination in a Link table.
2953
Link tables can be seen as the relationship equivalent of Hub tables; only distinct new key pairs are inserted.
3054
Date/time information is copied from the Staging Area tables and not generated by the ETL process.
31-
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Link ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Link-Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Link-Satellite ETL since it does not require rework when additional Link-Satellites are associated with the Link. This means that each Link-Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
55+
56+
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Link ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Link-Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process.
57+
58+
The default and arguably most flexible way is to incorporate this concept as part of the Link-Satellite ETL since it does not require rework when additional Link-Satellites are associated with the Link. This means that each Link-Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
59+
3260
Depending on how the Link table is modelled (what kind of relationship it manages) the Link table may contains a relationship type attribute. If a link table contains multiple, or changing, relationships (types) this attributes is moved to the Link-Satellite table.
3361
Ending /closing relationships is always done in the Link-Satellite table, typically using a separate ETL process.
3462

3563
## Considerations and Consequences
64+
3665
Multiple passes on source data is likely to be required. In extreme cases a single source table might be used (branch out) to Hubs, Satellites, Links and Link Satellites.
37-
Known Uses
66+
3867
This type of ETL process is to be used for loading all link tables in both the Integration Area as well as the Interpretation Area. This is because the Link table is also used to relate raw (Integration Area) data and cleansed (Interpretation Area) data together.
3968

4069
## Related Patterns
41-
Design Pattern 006 – Generic – Using Start, Process and End Dates
42-
Design Pattern 008 – Data Vault – Loading Hub tables
43-
Design Pattern 009 – Data Vault – Loading Satellite tables
44-
Design Pattern 013 – Data Vault – Loading Link Satellite tables
45-
Discussion items (not yet to be implemented or used until final)
46-
None.
70+
71+
* Design Pattern - Generic - Using Start, Process and End Dates
72+
* [Design Pattern - Data Vault - Hub tables](xref:design-pattern-data-vault-hub-table)

design-patterns/design-pattern-generic-types-of-history.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,19 @@
11
# Design Pattern - Generic - Types of History
22

33
## Purpose
4+
45
This design pattern describes the definitions for the commonly used history storage concepts.
56

67
## Motivation
8+
79
Due to definitions changing over time and different definitions being made by different parties there usually is a lot of discussion about what exactly constitutes the different types of history. This design pattern aims to define these history types in order to provide the common ground for discussion.
810

911
This is also known as:
1012
* SCD; Slowly Changing Dimensions
1113
* Type 1,2,3,4 etc.
1214

1315
## Applicability
14-
Every situation where historical data is needed / stored or a discussion arises.
16+
Every situation where historical data is needed / stored or a discussion arises.
1517

1618
Depending on the Data Warehouse architecture, this can be needed in a variety of situations. But typically these concepts are applied in the integration and presentation layer of the Data Warehouse.
1719

@@ -20,7 +22,7 @@ The following history types are defined, some distinction is made where there ar
2022

2123
**Type 0**. No change, while uncommon it has to be mentioned that this passive approach sometimes is implemented when storage space is to be saved or only the initial state has to be preserved.
2224

23-
**Type 1 A**. Change only the latest record. This implementation of type 1 is implemented if there is limited interest in keeping a specific kind of history. A good example is spelling errors; only the latest record is updated in that case (if youre not interested in the wrong spelling for data quality purposes).
25+
**Type 1 - A**. Change only the latest record. This implementation of type 1 is implemented if there is limited interest in keeping a specific kind of history. A good example is spelling errors; only the latest record is updated in that case (if you're not interested in the wrong spelling for data quality purposes).
2426

2527
An example of the first instance of a type 1-A change:
2628
Old situation; a record exists for the logical key CHS (Cheese). The attribute Name is defined as a type 1(A) attribute.
@@ -39,7 +41,7 @@ DWH Key | Logical Key | Name | Colour | Start date | End date | Update date
3941
2 | CHS | Cheese | Yellow | 11-01-1996 | 04-01-2000 | 11-01-1996
4042
1 | CHS | Cheese | Yellow | 07-03-1994 | 10-01-1996 | 10-01-1996
4143

42-
**Type 1 B**. Update the entire history based on the latest situation. The previous example for the second version of type 1 is as follows:
44+
**Type 1 - B**. Update the entire history based on the latest situation. The previous example for the second version of type 1 is as follows:
4345
Old situation; a record exists for the logical key CHS (Cheese). The attribute Name is defined as a type 1(B) attribute.
4446

4547
DWH Key | Logical Key | Name | Colour | Start date | End date | Update date
@@ -48,7 +50,7 @@ DWH Key | Logical Key | Name | Colour | Start date | End date | Update date
4850
2 | CHS | Cheese | Yellow | 11-01-1996 | 04-01-2000 | 11-01-1996
4951
1 | CHS | Cheese | Yellow |07-03-1994 | 10-01-1996 | 10-01-1996
5052

51-
When at some point (at 24-06-2006) the name is changed to Old Cheese and the Name attribute is defined as type 1(B) the name is overwritten, resulting in the following:
53+
When at some point (at 24-06-2006) the name is changed to Old Cheese and the Name attribute is defined as type 1(B) the name is overwritten, resulting in the following:
5254

5355
DWH Key | Logical Key | Name | Colour | Start date | End date | Update date
5456
--- | --- | --- | --- | --- | --- | ---
@@ -85,41 +87,41 @@ DWH Key | Logical Key | Name | Previous Name | Colour | Update date
8587

8688
**Type 4**. This history tracking mechanism operates by using separate tables to store the history. One table contains the most recent version of the record and the history table contains some or all history.
8789

88-
**Type 5**. The type 5 method of tracking history uses versions of tables for every period in time. Also known as snapshotting. No example is supplied since its basically a copy of the entire table.
90+
**Type 5**. The type 5 method of tracking history uses versions of tables for every period in time. Also known as snapshotting. No example is supplied since it's basically a copy of the entire table.
8991

90-
**Type 6 / hybrid**. Also known as twin time stamping, the type 6 approach combines the concepts of type 1-B, type 2 and type 3 mechanisms (1+2+3=6!). In the following example the attribute combination is the name. It consists of two attributes.
92+
**Type 6 / hybrid**. Also known as twin time stamping, the type 6 approach combines the concepts of type 1-B, type 2 and type 3 mechanisms (1+2+3=6!). In the following example the attribute combination is the name. It consists of two attributes.
9193
A new record is inserted in the Data Warehouse table.
9294

93-
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
95+
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
9496
--- | --- | --- | --- | --- | --- | ---
9597
1 | CHS | Cheese | Cheese | Golden | 05-01-2000 | 31-12-9999
9698

97-
After some time the name is changed to Old Cheese. This leads to a SCD2 event where a new record is inserted and an old one is closed off. At the same time, the history of the existing type 3 attribute is overwritten by a type 1-B event.
99+
After some time the name is changed to Old Cheese. This leads to a SCD2 event where a new record is inserted and an old one is closed off. At the same time, the history of the existing type 3 attribute is overwritten by a type 1-B event.
98100

99-
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
101+
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
100102
--- | --- | --- | --- | --- | --- | ---
101103
2 | CHS | Old Cheese | Old Cheese | Golden | 20-07-2008 | 31-12-9999
102104
1 | CHS | Cheese | Old Cheese | Golden | 05-01-2000 | 19-07-2008
103105

104106
Now you can see the previous record and all related facts against both the current and historical name. When a new change occurs, the following happens:
105107

106-
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
108+
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
107109
--- | --- | --- | --- | --- | --- | ---
108110
3 | CHS | A+ Cheese | A+ Cheese | Golden | 13-03-2010 | 31-12-9999
109111
2 | CHS | Old Cheese | A+ Cheese | Golden | 20-07-2008 | 12-03-2010
110112
1 | CHS | Cheese | A+ Cheese | Golden | 05-01-2000 | 19-07-2008
111113

112114
## Implementation Guidelines
115+
113116
* Obviously, corresponding records are identified by the logical key.
114117
* Type 1-B and the corresponding concept in Type 6 usually require separate mappings to update the entire history. Special care from a performance perspective because it has to be avoided that the entire history will be rewritten over and over again when really only the latest situation for that logical key. This mapping will have to aggregate the dataset to merge the latest state per natural key with the target table, and it will have to run after the regular Type 2 processes.
115118
* Avoid using NULL in the end date attribute of the most recent record to indicate an open / recent record date. Some databases have troubles handling NULL values and it is best practice to avoid NULL values wherever possible, especially in dimensions.
116-
* It is advised to add an ‘current record indicator’ for quick querying and easy understanding.
117119
* Depending on the location in the Data Warehouse either tables or attributes may be defined for a specific history type. For instance, defining a table as SCD Type 2 means that a change in every attribute will lead to a new record (and closing an old one). In Data Marts the common approach is often to specify a history type per attribute. So a change in one attribute may lead to an SCD Type 2 event, but a change in another one may cause the history to be overwritten.
118120

119121
## Considerations and Consequences
120122
Not applicable.
121123

122124
## Related Patterns
123-
* Design Pattern 011 Kimball Multiple SCD2 time periods.
124-
* Design Pattern 005 Generic Current view on historical data.
125-
* Design Pattern 007 Kimball Receiving order of information and late and early arrivals.
125+
* Design Pattern 011 - Kimball - Multiple SCD2 time periods.
126+
* Design Pattern 005 - Generic - Current view on historical data.
127+
* Design Pattern 007 - Kimball - Receiving order of information and late and early arrivals.

0 commit comments

Comments
 (0)