You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 1000_Design_Patterns/Design Pattern - Data Vault - Creating Dimensions from Hub tables.md
+10-5Lines changed: 10 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Design Pattern - Data Vault - Creating Dimensions from Hub tables
1
+
# Design Pattern - Data Vault - Simple Date Math (Joining two Time-Variant Tables
2
2
3
3
## Purpose
4
4
This design pattern describes how to create a typical ‘Type 2 Dimension’ table (Dimensional Modelling) from a Data Vault or Hybrid EDW model.
@@ -9,9 +9,11 @@ Also known as
9
9
Dimensions / Dimensional Modelling
10
10
Gaps and islands
11
11
Timelines
12
+
12
13
## Applicability
13
14
This pattern is only applicable for loading processes from source systems or files to the Reporting Structure Area (of the Presentation Layer). The Helper Area may use similar concepts but since this is a ‘free-for-all’ part of the ETL Framework it is not mandatory to follow this Design Pattern.
14
-
Structure
15
+
16
+
## Structure
15
17
Creating Dimensions from a Data Vault model essentially means joining the various Hub, Link and Satellite tables together to create a certain hierarchy. In the example displayed in the following diagram the Dimension that can be generated is a ‘Product’ dimension with the Distribution Channel as a higher level in this dimension.
16
18
17
19
Business Insights > Design Pattern 019 - Creating Dimensions from Hub tables > BI7.png
@@ -150,19 +152,22 @@ WHERE
150
152
ELSE D.EXPIRY_DATE -- smallest of the two expiry dates
151
153
END)
152
154
153
-
Implementation guidelines
155
+
## Implementation guidelines
154
156
The easiest way to join multiple tables is a cascading set based approach. This is done by joining the Hub and Satellite and treating this as a single set which is joined against another similar set of data (for instance a Link and Link-Satellite). The result of this is a new set of consistent timelines for a certain grain of information. This set can be treated as a single set again and joined with the next set (for instance a Hub and Satellite) and so forth.
155
157
When creating a standard Dimension table it is recommended to assign new surrogate keys for every dimension record. The only reason for this is to prevent a combination of Integration Layer surrogate keys to be present in the associated Fact table. The range of keys can become very wide. This also fits in with the classic approach towards loading Facts and Dimensions where the Fact table ETL performs a key lookup towards the Dimension table. Using Data Vault as Integration Layer opens up other options as well but this is a well-known (and understood) type of ETL.
156
158
The original Integration Layer keys remain attributes of the new Dimension table.
157
159
Creating a Type 1 Dimension is easier; only the most recent records can be joined.
158
160
Joining has to be done with < and > selections, which not every ETL tool supports (easily). This may require SQL overrides.
159
161
Some ETL tools or databases make the WHERE clause a bit more readable by providing a ‘greatest’ or ‘smallest’ function.
160
162
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
161
-
Consequences
163
+
164
+
## Considerations and consequences
162
165
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
163
166
Known uses
167
+
164
168
This type of ETL process is to be used to join historical tables together in the Integration Layer.
165
-
Related patterns
169
+
170
+
## Related patterns
166
171
Design Pattern 002 – Generic – Types of history
167
172
Design Pattern 006 – Generic – Using Start, Process and End dates.
168
173
Design Pattern 008 – Data Vault – Loading Hub tables
0 commit comments