Skip to content

Commit 580de03

Browse files
committed
Initial creation
1 parent 66876b0 commit 580de03

File tree

37 files changed

+1462
-0
lines changed

37 files changed

+1462
-0
lines changed
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Design Pattern - Data Vault - Creating Dimensions from Hub tables
2+
3+
## Purpose
4+
This design pattern describes how to create a typical ‘Type 2 Dimension’ table (Dimensional Modelling) from a Data Vault or Hybrid EDW model.
5+
Motivation
6+
To move from a Data Vault (or other Hybrid) model to a Kimball-style Star Schema or similar requires various tables that store historical data to be joined to each other. This is a recurring step which, if done properly, makes it easy to change dimension structures without losing history. Merging various historic sets of data is seen as one of the more complex steps in a Data Vault (or similar) environment. The pattern is called ‘creating Dimensions from Hub’ tables because Hubs are the main entities which are linked together to form a Dimension using their historical information and relationships.
7+
Also known as
8+
Dimensions / Dimensional Modelling
9+
Gaps and islands
10+
Timelines
11+
Applicability
12+
This pattern is only applicable for loading processes from source systems or files to the Reporting Structure Area (of the Presentation Layer). The Helper Area may use similar concepts but since this is a ‘free-for-all’ part of the ETL Framework it is not mandatory to follow this Design Pattern.
13+
Structure
14+
Creating Dimensions from a Data Vault model essentially means joining the various Hub, Link and Satellite tables together to create a certain hierarchy. In the example displayed in the following diagram the Dimension that can be generated is a ‘Product’ dimension with the Distribution Channel as a higher level in this dimension.
15+
16+
Business Insights > Design Pattern 019 - Creating Dimensions from Hub tables > BI7.png
17+
18+
Figure 1: Example Data Vault model
19+
Creating dimensions by joining tables with history means that the overlap in timelines (effective and expiry dates) will be ‘cut’ in multiple records with smaller intervals. This is explained using the following sample datasets, only the tables which contain ‘history’ are shown.
20+
21+
SAT Product:
22+
Key
23+
Product Name
24+
Effective Date
25+
Expiry Date
26+
27+
The first record is a dummy record created together with the Hub record. This was updated as part of the history / SCD updates.
28+
73
29+
- (dummy)
30+
01-01-1900
31+
01-01-2009
32+
73
33+
Cheese
34+
01-01-2009
35+
05-06-2010
36+
37+
Before being joined to the other sets this Satellite table is joined to the Hub table first. The Hub table maps the Data Warehouse key ‘73’ to the business key ‘CHS’.
38+
73
39+
Cheese – Yellow
40+
05-06-2010
41+
04-04-2011
42+
73
43+
Cheese – Gold
44+
04-04-2011
45+
31-12-9999
46+
47+
SAT Product –Channel (Link-Satellite):
48+
Link Key
49+
Product Key
50+
Channel Key
51+
Effective Date
52+
Expiry Date
53+
54+
This set indicates that the product has been moved to a different sales channel over time.
55+
56+
1
57+
73
58+
-1 (dummy)
59+
01-01-1900
60+
01-01-2010
61+
2
62+
73
63+
1
64+
01-01-2010
65+
04-03-2011
66+
3
67+
73
68+
2
69+
04-03-2011
70+
31-12-9999
71+
72+
When merging these to data sets into a dimension the overlaps in time are calculated:
73+
74+
Business Insights > Design Pattern 019 - Creating Dimensions from Hub tables > BI8.png
75+
Figure 2: Timelines
76+
77+
In other words, the merging of both the historic data sets where one has 4 records (time periods) and the other one has 3 records (time periods) results into a new set that has 6 (‘smaller’) records. This gives the following result data set (changes are highlighted):\
78+
Dimension Key
79+
Product Key
80+
Product
81+
Channel Key
82+
Effective Date
83+
Expiry Date
84+
1
85+
73
86+
-
87+
-1
88+
01-01-1900
89+
01-01-2009
90+
2
91+
73
92+
Cheese
93+
-1
94+
01-01-2009
95+
01-01-2010
96+
3
97+
73
98+
Cheese
99+
1
100+
01-01-2010
101+
05-06-2010
102+
4
103+
73
104+
Cheese-Yellow
105+
1
106+
05-06-2010
107+
03-04-2011
108+
5
109+
73
110+
Cheese-Yellow
111+
2
112+
03-04-2011
113+
04-04-2011
114+
6
115+
73
116+
Cheese-Gold
117+
2
118+
04-04-2011
119+
31-12-9999
120+
121+
This result can be achieved by joining the tables on their usual keys and calculating the overlapping time ranges:
122+
SELECT
123+
B.PRODUCT_NAME,
124+
C.CHANNEL_KEY,
125+
(CASE
126+
WHEN B.EFFECTIVE_DATE > D.EFFECTIVE_DATE
127+
THEN B.EFFECTIVE_DATE
128+
ELSE D.EFFECTIVE_DATE
129+
END) AS EFFECTIVE_DATE, -- greatest of the two effective dates
130+
(CASE
131+
WHEN B.EXPIRY_DATE < D.EXPIRY_DATE
132+
THEN B.EXPIRY_DATE
133+
ELSE D.EXPIRY_DATE
134+
END) AS EXPIRY_DATE -- smallest of the two expiry dates
135+
FROM HUB_PRODUCT A
136+
JOIN SAT_PRODUCT B ON A.PRODUCT_SK=B.PRODUCT_SK
137+
JOIN LINK_PRODUCT_CHANNEL C ON A.PRODUCT_SK=C.PRODUCT_SK
138+
JOIN SAT_LINK_PRODUCT_CHANNEL D ON D.PRODUCT_CHANNEL_SK=C.PRODUCT_CHANNEL_SK
139+
WHERE
140+
(CASE
141+
WHEN B.EFFECTIVE_DATE > D.EFFECTIVE_DATE
142+
THEN B.EFFECTIVE_DATE
143+
ELSE D.EFFECTIVE_DATE
144+
END) -- greatest of the two effective dates
145+
<
146+
(CASE
147+
WHEN B.EXPIRY_DATE < D.EXPIRY_DATE
148+
THEN B.EXPIRY_DATE
149+
ELSE D.EXPIRY_DATE -- smallest of the two expiry dates
150+
END)
151+
152+
Implementation guidelines
153+
The easiest way to join multiple tables is a cascading set based approach. This is done by joining the Hub and Satellite and treating this as a single set which is joined against another similar set of data (for instance a Link and Link-Satellite). The result of this is a new set of consistent timelines for a certain grain of information. This set can be treated as a single set again and joined with the next set (for instance a Hub and Satellite) and so forth.
154+
When creating a standard Dimension table it is recommended to assign new surrogate keys for every dimension record. The only reason for this is to prevent a combination of Integration Layer surrogate keys to be present in the associated Fact table. The range of keys can become very wide. This also fits in with the classic approach towards loading Facts and Dimensions where the Fact table ETL performs a key lookup towards the Dimension table. Using Data Vault as Integration Layer opens up other options as well but this is a well-known (and understood) type of ETL.
155+
The original Integration Layer keys remain attributes of the new Dimension table.
156+
Creating a Type 1 Dimension is easier; only the most recent records can be joined.
157+
Joining has to be done with < and > selections, which not every ETL tool supports (easily). This may require SQL overrides.
158+
Some ETL tools or databases make the WHERE clause a bit more readable by providing a ‘greatest’ or ‘smallest’ function.
159+
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
160+
Consequences
161+
This approach requires the timelines in all tables to be complete, ensuring referential integrity in the central Data Vault model. This means that every Hub has to have a record in the Satellite table with a start date of ‘01-01-1900’ and one which ends at ‘31-12-9999’ (can be the same record if there is no history yet). Without this dummy record to complete the timelines the query to calculate the overlaps will become very complex. SQL filters the records in the original WHERE clause before joining to the other history set. This requires the selection on the date range to be done on the JOIN clause but makes it impossible to get the EXPIRY_DATE correct in one pass. The solution with this approach is to only select the EFFECTIVE_DATE values, order these, and join this dataset back to itself to be able to compare the previous row (or the next depending on the sort) and derive the EXPIRY_DATE. In this context the solution to add dummy records to complete the timelines is an easier solution which also improves the integrity of the data in the Data Vault model.
162+
Known uses
163+
This type of ETL process is to be used to join historical tables together in the Integration Layer.
164+
Related patterns
165+
Design Pattern 002 – Generic – Types of history
166+
Design Pattern 006 – Generic – Using Start, Process and End dates.
167+
Design Pattern 008 – Data Vault – Loading Hub tables
168+
Design Pattern 009 – Data Vault – Loading Satellite tables
169+
Design Pattern 010 – Data Vault – Loading Link tables
170+
Discussion items (not yet to be implemented or used until final)
171+
None.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Design Pattern - Data Vault - Loading Hub tables
2+
3+
## Purpose
4+
This Design Pattern describes how to load data into Data Vault Hub style entities.
5+
Motivation
6+
Loading data into Hub tables is a relatively straightforward process with a fixed location in the scheduling of loading data from the Staging Layer to the Integration Layer. It is a vital component of the Data Warehouse architecture, making sure that Data Warehouse keys are distributed properly and at the right point in time. Decoupling key distribution and historical information is an essential requirement for parallel processing and for reducing dependencies in the loading process. This pattern specifies how this process works and why it is important to follow. In a Data Vault based Enterprise Data Warehouse solution, the Hub tables (and corresponding ETL) are the only places where Data Warehouse keys are distributed.
7+
Also known as
8+
Hub (Data Vault modelling concept)
9+
Surrogate Key (SK) or Hash Key (HSH) distribution
10+
Data Warehouse key distribution
11+
Applicability
12+
This pattern is applicable for the process of loading from the Staging Layer into the Integration Area Hub tables only.
13+
Structure
14+
The ETL process can be described as an ‘insert only’ set of the unique business keys. The process performs a SELECT DISTINCT on the Staging Area table and a key lookup to retrieve the OMD_RECORD_SOURCE_ID based on the value in the Staging Layer table. If no entry for the record source is found the ETL process is set to fail because this indicates a major error in the ETL Framework configuration (i.e. this must be tested during unit and UAT testing).
15+
Using this value and the source business key the process performs a key lookup (outer join) to verify if that specific business key already exists in the target Hub table (for that particular record source). If it exists, the row can be discarded, if not it can be inserted.
16+
Business Insights > Design Pattern 008 - Data Vault - Loading Hub tables > image2015-4-29 14:54:58.png
17+
18+
Additionally, for every new Data Warehouse key a corresponding initial (dummy) Satellite record must be created to ensure complete timelines. Depending on the available technology this can be implemented as part of the Hub or Satellite Module but as each Hub can contain multiple Satellites it is recommended to be implemented in the Satellite process. This is explained in more detail in the Implementation Guidelines and Consequences section.
19+
The Hub ETL processes are the first ones that need to be executed in the Integration Area. Once the Hub tables have been populated or updated, the related Satellite and Link tables can be run in parallel. This is displayed in the following diagram:
20+
Business Insights > Design Pattern 008 - Data Vault - Loading Hub tables > BI2.png
21+
Figure 2: Dependencies
22+
Logically the creation of the initial Satellite record is part of the ETL process for Hub tables and is a prerequisite for further processing of the Satellites.
23+
Implementation guidelines
24+
Use a single ETL process, module or mapping to load the Hub table, thus improving flexibility in processing. This means that no Hub keys will be distributed as part of another ETL process.
25+
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
26+
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
27+
Do not tag every record with the system date/time (sysdate as OMD_INSERT_DATETIME) but copy this value from the Staging Area. This improves ETL flexibility. The Staging Area ETL is designed to label every record which is processed by the same module with the correct date/time: the date/time the record has been loaded into the Data Warehouse environment (event date/time). The OMD model will track when records have been loaded physically through the Insert Module Instance ID.
28+
Multiple ETL processes may load the same business key into the corresponding Hub table if the business key exists in more than one table. This also means that ETL software must implement dynamic caching to avoid duplicate inserts when running similar processes in parallel.
29+
By default the DISTINCT function is executed on database level to reserve resources for the ETL engine but this can be executed in ETL as well if this supports proper resource distribution (i.e. light database server but powerful ETL server).
30+
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
31+
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is ‘self-standing’ and meaningful.
32+
To cater for a situation where multiple OMD_INSERT_DATETIME values exist for a single business key, the minimum OMD_INSERT_DATETIME should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum OMD_INSERT_DATETIME in one step.
33+
Consequences
34+
Multiple passes on source data is likely to be required: once for Hub tables and subsequently for Link and Satellite tables. Defining Hub ETL processes in the atomic way as defined in this Design Pattern means that many files load data to the same central Hub table; all processes will be very similar with the only difference the mapping between the source attribute which represents the business key and the Hub counterpart.
35+
A single Hub may be loaded by many Modules from a single source system, and there may be several Satellites for the source system hanging off this Hub. It needs to be ensured that all corresponding Satellites are populated by the Hub ETL.
36+
Known uses
37+
This type of ETL process is to be used in all Hub or Surrogate Key tables in the Integration Area. The Interpretation Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
38+
Related patterns
39+
Design Pattern 006 – Generic – Using Start, Process and End Dates
40+
Design Pattern 009 – Data Vault – Loading Satellite tables
41+
Design Pattern 010 – Data Vault – Loading Link tables
42+
Design Pattern 023 – Data Vault – Missing keys and placeholders
43+
44+
Discussion items (not yet to be implemented or used until final)
45+
The OMD_INSERT_DATETIME that represents the implementation of the Event Date/Time concept is currently populated in a different way than similar OMD information (such as the OMD_UPDATE_DATETIME). It may be easier to introduce an OMD_EVENT_DATETIME attribute that captures this information.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Design Pattern - Data Vault - Loading Link Satellite tables
2+
3+
## Purpose
4+
This Design Pattern describes how to load data into Link-Satellite tables within a ‘Data Vault’ EDW architecture. In Data Vault, Link-Satellite tables manage the change for relationships over time.
5+
Motivation
6+
7+
Also known as
8+
Link-Satellite (Data Vault modelling concept).
9+
History or INT tables.
10+
Applicability
11+
This pattern is only applicable for loading data to Link-Satellite tables from:
12+
The Staging Area into the Integration Area.
13+
The Integration Area into the Interpretation Area.
14+
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
15+
Structure
16+
Standard Link-Satellites use the Driving Key concept to manage the ending of ‘old’ relationships.
17+
Implementation guidelines
18+
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
19+
Select all records for the Link Satellite which have more than one open effective date / current record indicator but are not the most recent (because that record does not need to be closed
20+
WITH MyCTE (<Link SK>, <Driving Key SK>, OMD_EFFECTIVE_DATE, OMD_EXPIRY_DATE, RowVersion)
21+
AS (
22+
SELECT
23+
A.<Link SK>, B.<Driving Key SK>, A.OMD_EFFECTIVE_DATE, A.OMD_EXPIRY_DATE,
24+
DENSE_RANK() OVER(PARTITION BY B.<Driving Key SK> ORDER BY B.<Link SK>, OMD_EFFECTIVE_DATE ASC) RowVersion
25+
FROM <Link Sat table> A
26+
JOIN <Link table> B ON A.<Link SK>=B.<Link SK>
27+
JOIN (
28+
SELECT <Driving Key SK>
29+
FROM <Link Sat table> A
30+
JOIN <Link table> B ON A.<Link SK>=B.<Link SK>
31+
WHERE A.OMD_EXPIRY_DATE = '99991231'
32+
GROUP BY <Driving Key SK>
33+
HAVING COUNT(*) > 1
34+
) C ON B.<Driving Key SK> = C.<Driving Key SK>
35+
)
36+
SELECT
37+
BASE.<Link SK>
38+
,CASE WHEN LAG.OMD_EFFECTIVE_DATE IS NULL THEN '19000101' ELSE BASE.OMD_EFFECTIVE_DATE END AS OMD_EFFECTIVE_DATE
39+
,CASE WHEN LEAD.OMD_EFFECTIVE_DATE IS NULL THEN '99991231' ELSE LEAD.OMD_EFFECTIVE_DATE END AS OMD_EXPIRY_DATE
40+
,CASE WHEN LEAD.OMD_EFFECTIVE_DATE IS NULL THEN 'Y' ELSE 'N' END AS OMD_CURRENT_RECORD_INDICATOR
41+
FROM MyCTE BASE
42+
LEFT JOIN MyCTE LEAD ON BASE.<Driving Key SK> = LEAD.<Driving Key SK>
43+
AND BASE.RowVersion = LEAD.RowVersion-1
44+
LEFT JOIN MyCTE LAG ON BASE.<Driving Key SK> = LAG.<Driving Key SK>
45+
AND BASE.RowVersion = LAG.RowVersion+1
46+
WHERE BASE.OMD_EXPIRY_DATE = '99991231'
47+
Consequences
48+
Multiple passes on source data are likely to be required.
49+
Known uses
50+
Related patterns
51+
Design Pattern 006 – Using Start, Process and End Dates
52+
Design Pattern 009 – Loading Satellite tables.
53+
Design Pattern 010 – Loading Link tables.
54+
Discussion items (not yet to be implemented or used until final)
55+
None.

0 commit comments

Comments
 (0)