You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 1000_Design_Patterns/Design Pattern - Data Vault - Creating Dimensions from Hub tables.md
+22-21Lines changed: 22 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,23 @@
1
1
# Design Pattern - Data Vault - Creating Dimensions from Hub tables
2
2
3
3
## Purpose
4
-
This design pattern describes how to create a typical ‘Type 2 Dimension’ table (Dimensional Modelling) from a Data Vault or Hybrid EDW model.
5
-
Motivation
6
-
To move from a Data Vault (or other Hybrid) model to a Kimball-style Star Schema or similar requires various tables that store historical data to be joined to each other. This is a recurring step which, if done properly, makes it easy to change dimension structures without losing history. Merging various historic sets of data is seen as one of the more complex steps in a Data Vault (or similar) environment. The pattern is called ‘creating Dimensions from Hub’ tables because Hubs are the main entities which are linked together to form a Dimension using their historical information and relationships.
4
+
This design pattern describes how to create a typical ‘Type 2 Dimension’ table (Dimensional Modelling) from a Data Vault or Hybrid EDW model.
5
+
6
+
## Motivation
7
+
To move from a Data Vault (or other Hybrid) model to a Kimball-style Star Schema or similar requires various tables that store historical data to be joined to each other. This is a recurring step which, if done properly, makes it easy to change dimension structures without losing history. Merging various historic sets of data is seen as one of the more complex steps in a Data Vault (or similar) environment. The pattern is called ‘creating Dimensions from Hub’ tables because Hubs are the main entities which are linked together to form a Dimension using their historical information and relationships.
7
8
Also known as
8
9
Dimensions / Dimensional Modelling
9
10
Gaps and islands
10
11
Timelines
11
-
Applicability
12
-
This pattern is only applicable for loading processes from source systems or files to the Reporting Structure Area (of the Presentation Layer). The Helper Area may use similar concepts but since this is a ‘free-for-all’ part of the ETL Framework it is not mandatory to follow this Design Pattern.
12
+
## Applicability
13
+
This pattern is only applicable for loading processes from source systems or files to the Reporting Structure Area (of the Presentation Layer). The Helper Area may use similar concepts but since this is a ‘free-for-all’ part of the ETL Framework it is not mandatory to follow this Design Pattern.
13
14
Structure
14
-
Creating Dimensions from a Data Vault model essentially means joining the various Hub, Link and Satellite tables together to create a certain hierarchy. In the example displayed in the following diagram the Dimension that can be generated is a ‘Product’ dimension with the Distribution Channel as a higher level in this dimension.
15
+
Creating Dimensions from a Data Vault model essentially means joining the various Hub, Link and Satellite tables together to create a certain hierarchy. In the example displayed in the following diagram the Dimension that can be generated is a ‘Product’ dimension with the Distribution Channel as a higher level in this dimension.
15
16
16
17
Business Insights > Design Pattern 019 - Creating Dimensions from Hub tables > BI7.png
17
18
18
19
Figure 1: Example Data Vault model
19
-
Creating dimensions by joining tables with history means that the overlap in timelines (effective and expiry dates) will be ‘cut’ in multiple records with smaller intervals. This is explained using the following sample datasets, only the tables which contain ‘history’ are shown.
20
+
Creating dimensions by joining tables with history means that the overlap in timelines (effective and expiry dates) will be ‘cut’ in multiple records with smaller intervals. This is explained using the following sample datasets, only the tables which contain ‘history’ are shown.
20
21
21
22
SAT Product:
22
23
Key
@@ -34,17 +35,17 @@ Cheese
34
35
01-01-2009
35
36
05-06-2010
36
37
37
-
Before being joined to the other sets this Satellite table is joined to the Hub table first. The Hub table maps the Data Warehouse key ‘73’ to the business key ‘CHS’.
38
+
Before being joined to the other sets this Satellite table is joined to the Hub table first. The Hub table maps the Data Warehouse key ‘73’ to the business key ‘CHS’.
38
39
73
39
-
Cheese – Yellow
40
+
Cheese – Yellow
40
41
05-06-2010
41
42
04-04-2011
42
43
73
43
-
Cheese – Gold
44
+
Cheese – Gold
44
45
04-04-2011
45
46
31-12-9999
46
47
47
-
SAT Product –Channel (Link-Satellite):
48
+
SAT Product –Channel (Link-Satellite):
48
49
Link Key
49
50
Product Key
50
51
Channel Key
@@ -74,7 +75,7 @@ When merging these to data sets into a dimension the overlaps in time are calcul
74
75
Business Insights > Design Pattern 019 - Creating Dimensions from Hub tables > BI8.png
75
76
Figure 2: Timelines
76
77
77
-
In other words, the merging of both the historic data sets where one has 4 records (time periods) and the other one has 3 records (time periods) results into a new set that has 6 (‘smaller’) records. This gives the following result data set (changes are highlighted):\
78
+
In other words, the merging of both the historic data sets where one has 4 records (time periods) and the other one has 3 records (time periods) results into a new set that has 6 (‘smaller’) records. This gives the following result data set (changes are highlighted):\
78
79
Dimension Key
79
80
Product Key
80
81
Product
@@ -155,17 +156,17 @@ When creating a standard Dimension table it is recommended to assign new surroga
155
156
The original Integration Layer keys remain attributes of the new Dimension table.
156
157
Creating a Type 1 Dimension is easier; only the most recent records can be joined.
157
158
Joining has to be done with < and > selections, which not every ETL tool supports (easily). This may require SQL overrides.
158
-
Some ETL tools or databases make the WHERE clause a bit more readable by providing a ‘greatest’ or ‘smallest’ function.
159
-
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
159
+
Some ETL tools or databases make the WHERE clause a bit more readable by providing a ‘greatest’ or ‘smallest’ function.
160
+
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
160
161
Consequences
161
-
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
162
+
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
162
163
Known uses
163
164
This type of ETL process is to be used to join historical tables together in the Integration Layer.
164
165
Related patterns
165
-
Design Pattern 002 – Generic – Types of history
166
-
Design Pattern 006 – Generic – Using Start, Process and End dates.
167
-
Design Pattern 008 – Data Vault – Loading Hub tables
168
-
Design Pattern 009 – Data Vault – Loading Satellite tables
169
-
Design Pattern 010 – Data Vault – Loading Link tables
166
+
Design Pattern 002 – Generic – Types of history
167
+
Design Pattern 006 – Generic – Using Start, Process and End dates.
168
+
Design Pattern 008 – Data Vault – Loading Hub tables
169
+
Design Pattern 009 – Data Vault – Loading Satellite tables
170
+
Design Pattern 010 – Data Vault – Loading Link tables
170
171
Discussion items (not yet to be implemented or used until final)
0 commit comments