You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: design-patterns/design-pattern-data-vault-hub.md
+9-1Lines changed: 9 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,7 @@
1
+
---
2
+
uid: design-pattern-data-vault-hub-table
3
+
---
4
+
1
5
# Design Pattern - Data Vault - Hub table
2
6
3
7
---
@@ -9,10 +13,12 @@ This design pattern requires a major update to refresh the content.
9
13
10
14
## Purpose
11
15
12
-
This Design Pattern describes how to load data into Data Vault Hub style tables. It is a specification of the Hub ETL process.
16
+
This design pattern describes how to define, and load data into, Data Vault Hub style tables.
13
17
14
18
## Motivation
15
19
20
+
A Data Vault Hub is the physical implementation of a Core Business Concept. These are the the key identified 'things' that can be meaningfully identified as part of an organization's business processes.
21
+
16
22
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
17
23
18
24
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
@@ -40,6 +46,8 @@ During the selection the key distribution approach is implemented to make sure a
40
46
41
47
## Implementation Guidelines
42
48
49
+
Hubs are core business concepts which must be immediately and uniquely identifiable through their name.
50
+
43
51
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
44
52
45
53
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
# Design Pattern - Data Vault - Loading Link tables
2
6
7
+
---
8
+
**NOTE**
9
+
10
+
This design pattern requires a major update to refresh the content.
11
+
12
+
---
13
+
3
14
## Purpose
4
-
This design pattern describes the loading process for ‘Link’ tables in the Data Vault concept.
15
+
16
+
This design pattern describes how to define, and load data into, Data Vault Link style tables.
5
17
6
18
## Motivation
7
-
The Link concept in Data Vault provides the flexibility of this data modeling approach. Links are sets of (hub) keys that indicate that a relationship between those Hubs has existed at some point in time. A Link table is similar in concept to the Hub table, but only stores key pairs. The structure of the relationship tables (including the Link table) is documented in the Integration FrameworkA120 – Integration Layer document of the Outline Architecture. Even though the Data Vault concept allows for adding attributes in the Link table it is strongly recommended (for flexibility reasons) to only store the generated hashkey (meaningless key), the Hub surrogate hashkeys and the date/time information. Doing this will ensure compatibility with both stationary facts (time dependent facts such as balances) and pure transactions.
8
-
Link to file Eru Marumaru
9
-
Also known As
10
-
Relationship table
19
+
20
+
A Link table in Data Vault is the physical implementation of a Natural Business Relationship. A Link uniquely identifies a relationship between Core Business Concepts (Hub tables in Data Vault).
21
+
22
+
The Link concept in Data Vault provides the flexibility of this data modeling approach. Links are sets of (hub) keys that indicate that a relationship between those Hubs has existed at some point in time.
23
+
24
+
A Link table is similar in concept to the Hub table, but only stores key pairs.
25
+
26
+
Even though the Data Vault concept allows for adding attributes in the Link table it is strongly recommended (for flexibility reasons) to only store the generated hash key (meaningless key), the Hub surrogate hash keys and the date/time information. Doing this will ensure compatibility with both stationary facts (time dependent facts such as balances) and pure transactions.
11
27
12
28
## Applicability
29
+
13
30
This pattern is only applicable for loading processes from the Staging Area into the Integration Area and from the Integration Area to the Interpretation Area. The pattern varies slightly for the type of Link table specified (such as Transactional Link, Same-As Link, Hierarchical Link or Low-Value Links) or whether it contains a degenerate attribute. In most cases the Link table will contain the default attributes (Link Key, Hub Keys and metadata attributes) but in the case of a pure transactional Link table it can contain the transaction attributes as well.
14
31
15
32
## Structure
16
-
The ETL process can be described as an ‘insert only’ set of the unique combination of Data Warehouse keys. Depending on the type of source table, the process will do the following:
33
+
34
+
The ETL process can be described as an 'insert only' set of the unique combination of Data Warehouse keys. Depending on the type of source table, the process will do the following:
35
+
17
36
Source Area to Integration Area: the process executes a SELECT DISTINCT query on business keys and performs key lookups (outer join) on the corresponding Hub tables to obtain the Hub Data Warehouse keys. The resulting key combination is then verified using a key lookup against the target Link table to verify if that specific combination of Data Warehouse keys already exists. If it exists, the row can be discarded, if not it can be inserted.
37
+
18
38
Integration Area to Interpretation Area: the process executes a SELECT DISTINCT query on Data Warehouse keys (likely after combining multiple tables first) and performs a key lookup against the target Link table to verify if that specific combination of Data Warehouse keys already exists. If it exists, the row can be discarded, if not it can be inserted.
19
39
The maintenance of the Interpretation Area can also be done as part of an (external) process or through Master Data Management. In this context, Link tables between Integration and Interpretation Area tables are very similar to cross-referencing tables.
40
+
20
41
The following diagram displays the ETL process for Link tables;
21
42
Business Insights > Design Pattern 010 - Data Vault - Loading Link tables > image2015-4-29 16:24:14.png
22
-
This image needs updating to reflect DV 2.0 (hashkey) useage Eru Marumaru
43
+
23
44
In a pure relational Link it is required that a dummy key is available in each corresponding Link-Satellite to complete the timelines. This is handled as part of the Link-Satellite processing as a Link can contain multiple Link-Satellites. Dummy records are only required to be inserted for each driving key as a view in time across the driving key is ultimately required. Inserting a dummy record for every Link key will cause issues in the timeline. This is explained in more detail in the Link-Satellite Design Pattern.
24
45
25
46
## Implementation Guidelines
47
+
26
48
Use a single ETL process, module or mapping to load the Link table, thus improving flexibility in processing. Every ETL process should have a distinct function.
49
+
27
50
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Link table; the other passes are needed to populate the Link Satellite tables (if any).
51
+
28
52
By default, create a sequence / meaningless key for each unique key combination in a Link table.
29
53
Link tables can be seen as the relationship equivalent of Hub tables; only distinct new key pairs are inserted.
30
54
Date/time information is copied from the Staging Area tables and not generated by the ETL process.
31
-
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Link ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Link-Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Link-Satellite ETL since it does not require rework when additional Link-Satellites are associated with the Link. This means that each Link-Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
55
+
56
+
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Link ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Link-Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process.
57
+
58
+
The default and arguably most flexible way is to incorporate this concept as part of the Link-Satellite ETL since it does not require rework when additional Link-Satellites are associated with the Link. This means that each Link-Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
59
+
32
60
Depending on how the Link table is modelled (what kind of relationship it manages) the Link table may contains a relationship type attribute. If a link table contains multiple, or changing, relationships (types) this attributes is moved to the Link-Satellite table.
33
61
Ending /closing relationships is always done in the Link-Satellite table, typically using a separate ETL process.
34
62
35
63
## Considerations and Consequences
64
+
36
65
Multiple passes on source data is likely to be required. In extreme cases a single source table might be used (branch out) to Hubs, Satellites, Links and Link Satellites.
37
-
Known Uses
66
+
38
67
This type of ETL process is to be used for loading all link tables in both the Integration Area as well as the Interpretation Area. This is because the Link table is also used to relate raw (Integration Area) data and cleansed (Interpretation Area) data together.
39
68
40
69
## Related Patterns
41
-
Design Pattern 006 – Generic – Using Start, Process and End Dates
42
-
Design Pattern 008 – Data Vault – Loading Hub tables
43
-
Design Pattern 009 – Data Vault – Loading Satellite tables
44
-
Design Pattern 013 – Data Vault – Loading Link Satellite tables
45
-
Discussion items (not yet to be implemented or used until final)
46
-
None.
70
+
71
+
* Design Pattern - Generic - Using Start, Process and End Dates
72
+
*[Design Pattern - Data Vault - Hub tables](xref:design-pattern-data-vault-hub-table)
Copy file name to clipboardExpand all lines: design-patterns/design-pattern-generic-types-of-history.md
+16-14Lines changed: 16 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,19 @@
1
1
# Design Pattern - Generic - Types of History
2
2
3
3
## Purpose
4
+
4
5
This design pattern describes the definitions for the commonly used history storage concepts.
5
6
6
7
## Motivation
8
+
7
9
Due to definitions changing over time and different definitions being made by different parties there usually is a lot of discussion about what exactly constitutes the different types of history. This design pattern aims to define these history types in order to provide the common ground for discussion.
8
10
9
11
This is also known as:
10
12
* SCD; Slowly Changing Dimensions
11
13
* Type 1,2,3,4 etc.
12
14
13
15
## Applicability
14
-
Every situation where historical data is needed / stored or a discussion arises.
16
+
Every situation where historical data is needed / stored or a discussion arises.
15
17
16
18
Depending on the Data Warehouse architecture, this can be needed in a variety of situations. But typically these concepts are applied in the integration and presentation layer of the Data Warehouse.
17
19
@@ -20,7 +22,7 @@ The following history types are defined, some distinction is made where there ar
20
22
21
23
**Type 0**. No change, while uncommon it has to be mentioned that this passive approach sometimes is implemented when storage space is to be saved or only the initial state has to be preserved.
22
24
23
-
**Type 1 – A**. Change only the latest record. This implementation of type 1 is implemented if there is limited interest in keeping a specific kind of history. A good example is spelling errors; only the latest record is updated in that case (if you’re not interested in the wrong spelling for data quality purposes).
25
+
**Type 1 - A**. Change only the latest record. This implementation of type 1 is implemented if there is limited interest in keeping a specific kind of history. A good example is spelling errors; only the latest record is updated in that case (if you're not interested in the wrong spelling for data quality purposes).
24
26
25
27
An example of the first instance of a type 1-A change:
26
28
Old situation; a record exists for the logical key CHS (Cheese). The attribute Name is defined as a type 1(A) attribute.
@@ -39,7 +41,7 @@ DWH Key | Logical Key | Name | Colour | Start date | End date | Update date
When at some point (at 24-06-2006) the name is changed to Old Cheese and the Name attribute is defined as type 1(B) the name is overwritten, resulting in the following:
53
+
When at some point (at 24-06-2006) the name is changed to Old Cheese and the Name attribute is defined as type 1(B) the name is overwritten, resulting in the following:
52
54
53
55
DWH Key | Logical Key | Name | Colour | Start date | End date | Update date
54
56
--- | --- | --- | --- | --- | --- | ---
@@ -85,41 +87,41 @@ DWH Key | Logical Key | Name | Previous Name | Colour | Update date
85
87
86
88
**Type 4**. This history tracking mechanism operates by using separate tables to store the history. One table contains the most recent version of the record and the history table contains some or all history.
87
89
88
-
**Type 5**. The type 5 method of tracking history uses versions of tables for every period in time. Also known as ‘snapshotting’. No example is supplied since it’s basically a copy of the entire table.
90
+
**Type 5**. The type 5 method of tracking history uses versions of tables for every period in time. Also known as �snapshotting�. No example is supplied since it's basically a copy of the entire table.
89
91
90
-
**Type 6 / hybrid**. Also known as ‘twin time stamping’, the type 6 approach combines the concepts of type 1-B, type 2 and type 3 mechanisms (1+2+3=6!). In the following example the attribute combination is the name. It consists of two attributes.
92
+
**Type 6 / hybrid**. Also known as �twin time stamping�, the type 6 approach combines the concepts of type 1-B, type 2 and type 3 mechanisms (1+2+3=6!). In the following example the attribute combination is the name. It consists of two attributes.
91
93
A new record is inserted in the Data Warehouse table.
92
94
93
-
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
95
+
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
After some time the name is changed to Old Cheese. This leads to a SCD2 event where a new record is inserted and an old one is closed off. At the same time, the history of the existing type 3 attribute is overwritten by a type 1-B event.
99
+
After some time the name is changed to Old Cheese. This leads to a SCD2 event where a new record is inserted and an old one is closed off. At the same time, the history of the existing type 3 attribute is overwritten by a type 1-B event.
98
100
99
-
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
101
+
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
100
102
--- | --- | --- | --- | --- | --- | ---
101
103
2 | CHS | Old Cheese | Old Cheese | Golden | 20-07-2008 | 31-12-9999
102
104
1 | CHS | Cheese | Old Cheese | Golden | 05-01-2000 | 19-07-2008
103
105
104
106
Now you can see the previous record and all related facts against both the current and historical name. When a new change occurs, the following happens:
105
107
106
-
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
108
+
DWH Key | Logical Key | Name | Current Name | Colour | Start date | End date
* Obviously, corresponding records are identified by the logical key.
114
117
* Type 1-B and the corresponding concept in Type 6 usually require separate mappings to update the entire history. Special care from a performance perspective because it has to be avoided that the entire history will be rewritten over and over again when really only the latest situation for that logical key. This mapping will have to aggregate the dataset to merge the latest state per natural key with the target table, and it will have to run after the regular Type 2 processes.
115
118
* Avoid using NULL in the end date attribute of the most recent record to indicate an open / recent record date. Some databases have troubles handling NULL values and it is best practice to avoid NULL values wherever possible, especially in dimensions.
116
-
* It is advised to add an ‘current record indicator’ for quick querying and easy understanding.
117
119
* Depending on the location in the Data Warehouse either tables or attributes may be defined for a specific history type. For instance, defining a table as SCD Type 2 means that a change in every attribute will lead to a new record (and closing an old one). In Data Marts the common approach is often to specify a history type per attribute. So a change in one attribute may lead to an SCD Type 2 event, but a change in another one may cause the history to be overwritten.
0 commit comments