You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Design_Patterns/Design Pattern - Data Vault - Loading Hub tables.md
+12-13Lines changed: 12 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,24 @@
1
1
# Design Pattern - Data Vault - Loading Hub tables
2
2
3
3
## Purpose
4
-
This Design Pattern describes how to load data into Data Vault Hub style entities.
4
+
This Design Pattern describes how to load data into Data Vault Hub style tables. It is a specification of the Hub ETL process.
5
5
6
6
## Motivation
7
-
Loading data into Hub tables is a relatively straightforward process with a set location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer. It is a vital component of the Data Warehouse architecture, making sure that Data Warehouse keys are distributed properly and at the right point in time.
7
+
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
8
8
9
-
Decoupling key distribution and historical information is an essential requirement for reducing dependencies in the loading process and enabling flexible storage design in the Data Warehouse.
9
+
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
10
10
11
-
This pattern specifies how the Hub ETL process works and why it is important to follow.
12
-
13
-
In a Data Vault based Enterprise Data Warehouse solution, the Hub tables (and corresponding ETL) are the only places where Data Warehouse keys are distributed.
11
+
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
14
12
15
13
Also known as:
16
14
15
+
- Core Business Concept (Ensemble modelling)
17
16
- Hub (Data Vault modelling concept)
18
-
- Surrogate Key (SK) or Hash Key (HSH) distribution
17
+
- Surrogate Key (SK) or Hash Key (HSH) distribution, as commonly used implementations of the concept
19
18
- Data Warehouse key distribution
20
19
21
20
## Applicability
22
-
This pattern is applicable for the process of loading from the Staging Layer into the Integration Area Hub tables. It is used in all Hub in the Integration Layer. Derived (Business Data Vault) Hub tables follow the same pattern, but with business logic applied.
21
+
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub ETL processes follow the same pattern.
23
22
24
23
## Structure
25
24
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an ‘insert only’ of the unique business keys that are not yet in the the target Hub.
@@ -44,15 +43,15 @@ The logic to create the initial (dummy) Satellite record can both be implemented
44
43
45
44
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is ‘self-standing’ and meaningful.
46
45
47
-
To cater for a situation where multiple OMD_INSERT_DATETIME values exist for a single business key, the minimum OMD_INSERT_DATETIME should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum OMD_INSERT_DATETIME in one step.
46
+
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
48
47
49
48
## Considerations and Consequences
50
49
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
51
50
52
51
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
53
52
54
53
## Related Patterns
55
-
Design Pattern 006 – Generic – Using Start, Process and End Dates
56
-
Design Pattern 009 – Data Vault – Loading Satellite tables
57
-
Design Pattern 010 – Data Vault – Loading Link tables
58
-
Design Pattern 023 – Data Vault – Missing keys and placeholders
54
+
*Design Pattern 006 – Generic – Using Start, Process and End Dates
55
+
*Design Pattern 009 – Data Vault – Loading Satellite tables
56
+
*Design Pattern 010 – Data Vault – Loading Link tables
57
+
*Design Pattern 023 – Data Vault – Missing keys and placeholders
Copy file name to clipboardExpand all lines: Design_Patterns/Design Pattern - Data Vault - Loading Link Satellite tables.md
+17-20Lines changed: 17 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,57 +5,54 @@ This Design Pattern describes how to load data into Link-Satellite tables within
5
5
6
6
## Motivation
7
7
8
-
Also known as
9
-
Link-Satellite (Data Vault modelling concept).
10
-
History or INT tables.
8
+
To provide a generic approach for loading Link Satellites.
11
9
12
10
## Applicability
11
+
13
12
This pattern is only applicable for loading data to Link-Satellite tables from:
14
-
The Staging Area into the Integration Area.
15
-
The Integration Area into the Interpretation Area.
16
-
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
13
+
14
+
* The Staging Area into the Integration Area.
15
+
* The Integration Area into the Interpretation Area.
16
+
* The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
17
17
18
18
## Structure
19
19
Standard Link-Satellites use the Driving Key concept to manage the ending of ‘old’ relationships.
20
20
21
21
## Implementation Guidelines
22
22
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
23
23
Select all records for the Link Satellite which have more than one open effective date / current record indicator but are not the most recent (because that record does not need to be closed
24
-
WITH MyCTE (<LinkSK>, <DrivingKeySK>, OMD_EFFECTIVE_DATE, OMD_EXPIRY_DATE, RowVersion)
24
+
WITH MyCTE (<LinkSK>, <DrivingKeySK>, <Effective Date/Time>, <Expiry Date/Time>, RowVersion)
Copy file name to clipboardExpand all lines: Design_Patterns/Design Pattern - Data Vault - Loading Satellite tables.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,10 +32,10 @@ Figure 2: Dependencies
32
32
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
33
33
The process in Figure 1 shows the entire ETL in one single process. For specific tools this way of developing ETL might be relatively inefficient. Therefore, the process can also be broken up into two separate mappings; one for inserts and one for updates. Logically the same actions will be executed, but physically two separate mappings can be used. This can be done in two ways:
34
34
Follow the same logic, with the same selects, but place filters for the update and insert branches. This leads to an extra pass on the source table, at the possible benefit of running the processes in parallel.
35
-
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open OMD_EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the OMD_EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open. As sample query for this selection is:
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open. As sample query for this selection is:
0 commit comments