You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Design Pattern - Data Vault - Loading Hub tables
1
+
# Design Pattern - Data Vault - Hub table
2
2
3
3
## Purpose
4
+
4
5
This Design Pattern describes how to load data into Data Vault Hub style tables. It is a specification of the Hub ETL process.
5
6
6
7
## Motivation
7
-
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
8
+
9
+
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
8
10
9
11
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
10
12
11
-
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
13
+
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
12
14
13
15
Also known as:
14
16
@@ -18,17 +20,20 @@ Also known as:
18
20
- Data Warehouse key distribution
19
21
20
22
## Applicability
23
+
21
24
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub ETL processes follow the same pattern.
22
25
23
26
## Structure
24
-
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an ‘insert only’ of the unique business keys that are not yet in the the target Hub.
27
+
28
+
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an �insert only� of the unique business keys that are not yet in the the target Hub.
25
29
26
30
The process performs a distinct selection on the business key attribute(s) in the Staging Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
27
31
28
32
During the selection the key distribution approach is implemented to make sure a dedicated Data Warehouse key is created. This can be an integer value, a hash key (i.e. MD5 or SHA1) or a natural business key.
29
33
30
34
## Implementation Guidelines
31
-
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
35
+
36
+
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
32
37
33
38
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
34
39
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
@@ -41,17 +46,19 @@ By default the DISTINCT function is executed on database level to reserve resour
41
46
42
47
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
43
48
44
-
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is ‘self-standing’ and meaningful.
49
+
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is �self-standing� and meaningful.
45
50
46
51
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
47
52
48
53
## Considerations and Consequences
54
+
49
55
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
50
56
51
57
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
52
58
53
59
## Related Patterns
54
-
* Design Pattern 006 – Generic – Using Start, Process and End Dates
# Design Pattern - Data Vault - Loading Satellites tables
1
+
# Design Pattern - Data Vault - Satellites table
2
+
3
+
---
4
+
**NOTE**
5
+
6
+
This design pattern requires a major update.
7
+
8
+
---
2
9
3
10
## Purpose
4
-
This Design Pattern describes how to load data into Satellite tables within a ‘Data Vault’ EDW architecture. The concept can be applied to any SCD2 mechanism as well.
11
+
12
+
This Design Pattern describes how to represent, or load data into, Satellite tables using Data Vault methodology.
5
13
6
14
## Motivation
7
-
The Design Pattern to load data into Satellite style tables aims to simplify and streamline the way ETL design is done for these tables. The process is essentially straightforward and does not require any business logic other than the definition of the business key. This is already done as part of the data modelling and Hub definition steps.
8
-
Also known as
9
-
Satellite (Data Vault modelling concept).
10
-
History or INT tables.
15
+
16
+
The Design Pattern for Satellite tables contain context, descriptive properties that describe a Data Vault 'Hub' table. They
11
17
12
18
## Applicability
19
+
13
20
This pattern is only applicable for loading data to Satellite tables from:
14
21
The Staging Area into the Integration Area.
15
22
The Integration Area into the Interpretation Area.
16
23
The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area Satellite tables.
17
24
18
25
## Structure
26
+
19
27
The ETL process can be described as a slowly changing dimension / history update of all attributes except the business key (which is stored in the Hub table). This is explained in the following diagram. Most attribute values, including some of the ETL process control values are copied from the Staging Area table. This includes:
20
28
Load Date / Time Stamp (used for the target Effective Date / Time and potentially the Update Date / TimeE attributes).
21
29
Source Row Id.
22
-
The following diagram will detail this process and address how the other ETL process control attributes are handled.
23
-
24
-
Business Insights > Design Pattern 009 - Data Vault - Loading Satellite Tables > BI Docs.png
25
-
26
-
Figure 1: Satellite ETL process
27
-
The Satellite ETL processes can only be run after the Hub process has finished, but can run in parallel with the Link ETL process. This is displayed in the following diagram:
28
-
Business Insights > Design Pattern 009 - Data Vault - Loading Satellite Tables > BI.png
29
-
Figure 2: Dependencies
30
30
31
31
## Implementation Guidelines
32
+
32
33
Multiple passes of the same source table or file are usually required. The first pass will insert new keys in the Hub table; the other passes are needed to populate the Satellite and Link tables.
34
+
33
35
The process in Figure 1 shows the entire ETL in one single process. For specific tools this way of developing ETL might be relatively inefficient. Therefore, the process can also be broken up into two separate mappings; one for inserts and one for updates. Logically the same actions will be executed, but physically two separate mappings can be used. This can be done in two ways:
36
+
34
37
Follow the same logic, with the same selects, but place filters for the update and insert branches. This leads to an extra pass on the source table, at the possible benefit of running the processes in parallel.
35
-
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open. As sample query for this selection is:
38
+
39
+
Only run the insert branch and automatically update the end dates based on the existing information in the Satellite. This process selects all records in the Satellite which have more than one open EXPIRY_DATE (this is the case after running the insert branch separately), sorts the records in order and uses the EFFECTIVE_DATE from the previous record to close the next one. This introduces a dependency between the insert and update branch, but will run faster. An extra benefit is that this also closes off any previous records that were left open.
@@ -41,15 +48,17 @@ WHERE ( satellite.<Expiry Date/Time> IS NULL AND
41
48
AND a.FIRM_LEDTS IS NULL)
42
49
)
43
50
ORDER BY 1,2 DESC
51
+
44
52
If you have a Change Data Capture based source, the attribute comparison is not required because the source system supplies the information whether the record in the Staging Area is new, updated or deleted.
53
+
45
54
Use hash values to detect changes, instead of comparing attributes separately. The hash value is created from all attributes except the business key and ETL process control values.
46
55
47
56
## Considerations and Consequences
57
+
48
58
Multiple passes on source data are likely to be required.
49
-
Known uses
50
-
This type of ETL process is to be used in all Hub or SK tables in the Integration Area. The Cleansing Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
51
59
52
60
## Related Patterns
53
-
Design Pattern 006 – Generic – Using Start, Process and End Dates
54
-
Design Pattern 009 – Data Vault – Loading Satellite tables.
55
-
Design Pattern 010 – Data Vault – Loading Link tables.
61
+
62
+
Design Pattern 006 - Generic - Using Start, Process and End Dates
63
+
Design Pattern 009 - Data Vault - Loading Satellite tables
64
+
Design Pattern 010 - Data Vault - Loading Link tables
0 commit comments