Skip to content

Commit 257d90e

Browse files
committed
Revisiting Hub pattern - very old and needs refresh
1 parent afa17ab commit 257d90e

File tree

2 files changed

+40
-29
lines changed

2 files changed

+40
-29
lines changed

1000_Design_Patterns/Design Pattern - Data Vault - Loading Hub tables.md

Lines changed: 30 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,42 +4,52 @@
44
This Design Pattern describes how to load data into Data Vault Hub style entities.
55

66
## Motivation
7-
Loading data into Hub tables is a relatively straightforward process with a fixed location in the scheduling of loading data from the Staging Layer to the Integration Layer. It is a vital component of the Data Warehouse architecture, making sure that Data Warehouse keys are distributed properly and at the right point in time. Decoupling key distribution and historical information is an essential requirement for parallel processing and for reducing dependencies in the loading process. This pattern specifies how this process works and why it is important to follow. In a Data Vault based Enterprise Data Warehouse solution, the Hub tables (and corresponding ETL) are the only places where Data Warehouse keys are distributed.
8-
Also known as
9-
Hub (Data Vault modelling concept)
10-
Surrogate Key (SK) or Hash Key (HSH) distribution
11-
Data Warehouse key distribution
7+
Loading data into Hub tables is a relatively straightforward process with a set location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer. It is a vital component of the Data Warehouse architecture, making sure that Data Warehouse keys are distributed properly and at the right point in time.
8+
9+
Decoupling key distribution and historical information is an essential requirement for reducing dependencies in the loading process and enabling flexible storage design in the Data Warehouse.
10+
11+
This pattern specifies how the Hub ETL process works and why it is important to follow.
12+
13+
In a Data Vault based Enterprise Data Warehouse solution, the Hub tables (and corresponding ETL) are the only places where Data Warehouse keys are distributed.
14+
15+
Also known as:
16+
17+
- Hub (Data Vault modelling concept)
18+
- Surrogate Key (SK) or Hash Key (HSH) distribution
19+
- Data Warehouse key distribution
1220

1321
## Applicability
14-
This pattern is applicable for the process of loading from the Staging Layer into the Integration Area Hub tables only.
22+
This pattern is applicable for the process of loading from the Staging Layer into the Integration Area Hub tables. It is used in all Hub in the Integration Layer. Derived (Business Data Vault) Hub tables follow the same pattern, but with business logic applied.
1523

1624
## Structure
17-
The ETL process can be described as an ‘insert only’ set of the unique business keys. The process performs a SELECT DISTINCT on the Staging Area table and a key lookup to retrieve the OMD_RECORD_SOURCE_ID based on the value in the Staging Layer table. If no entry for the record source is found the ETL process is set to fail because this indicates a major error in the ETL Framework configuration (i.e. this must be tested during unit and UAT testing).
18-
Using this value and the source business key the process performs a key lookup (outer join) to verify if that specific business key already exists in the target Hub table (for that particular record source). If it exists, the row can be discarded, if not it can be inserted.
19-
Business Insights > Design Pattern 008 - Data Vault - Loading Hub tables > image2015-4-29 14:54:58.png
25+
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an ‘insert only’ of the unique business keys that are not yet in the the target Hub.
26+
27+
The process performs a distinct selection on the business key attribute(s) in the Staging Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
2028

21-
Additionally, for every new Data Warehouse key a corresponding initial (dummy) Satellite record must be created to ensure complete timelines. Depending on the available technology this can be implemented as part of the Hub or Satellite Module but as each Hub can contain multiple Satellites it is recommended to be implemented in the Satellite process. This is explained in more detail in the Implementation Guidelines and Consequences section.
22-
The Hub ETL processes are the first ones that need to be executed in the Integration Area. Once the Hub tables have been populated or updated, the related Satellite and Link tables can be run in parallel. This is displayed in the following diagram:
23-
Business Insights > Design Pattern 008 - Data Vault - Loading Hub tables > BI2.png
24-
Figure 2: Dependencies
25-
Logically the creation of the initial Satellite record is part of the ETL process for Hub tables and is a prerequisite for further processing of the Satellites.
29+
During the selection the key distribution approach is implemented to make sure a dedicated Data Warehouse key is created. This can be an integer value, a hash key (i.e. MD5 or SHA1) or a natural business key.
2630

2731
## Implementation Guidelines
28-
Use a single ETL process, module or mapping to load the Hub table, thus improving flexibility in processing. This means that no Hub keys will be distributed as part of another ETL process.
32+
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
33+
2934
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
3035
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
31-
Do not tag every record with the system date/time (sysdate as OMD_INSERT_DATETIME) but copy this value from the Staging Area. This improves ETL flexibility. The Staging Area ETL is designed to label every record which is processed by the same module with the correct date/time: the date/time the record has been loaded into the Data Warehouse environment (event date/time). The OMD model will track when records have been loaded physically through the Insert Module Instance ID.
36+
37+
The Load Date / Time Stamp (LDTS) is copied (inherited) from the Staging Layer. This improves ETL flexibility. The Staging Area ETL is designed to label every record which is processed by the same module with the correct date/time: the date/time the record has been loaded into the Data Warehouse environment (event date/time). The ETL process control framework will track when records have been loaded physically through the Insert Module Instance ID.
38+
3239
Multiple ETL processes may load the same business key into the corresponding Hub table if the business key exists in more than one table. This also means that ETL software must implement dynamic caching to avoid duplicate inserts when running similar processes in parallel.
40+
3341
By default the DISTINCT function is executed on database level to reserve resources for the ETL engine but this can be executed in ETL as well if this supports proper resource distribution (i.e. light database server but powerful ETL server).
42+
3443
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
44+
3545
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is ‘self-standing’ and meaningful.
46+
3647
To cater for a situation where multiple OMD_INSERT_DATETIME values exist for a single business key, the minimum OMD_INSERT_DATETIME should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum OMD_INSERT_DATETIME in one step.
3748

3849
## Considerations and Consequences
39-
Multiple passes on source data is likely to be required: once for Hub tables and subsequently for Link and Satellite tables. Defining Hub ETL processes in the atomic way as defined in this Design Pattern means that many files load data to the same central Hub table; all processes will be very similar with the only difference the mapping between the source attribute which represents the business key and the Hub counterpart.
40-
A single Hub may be loaded by many Modules from a single source system, and there may be several Satellites for the source system hanging off this Hub. It needs to be ensured that all corresponding Satellites are populated by the Hub ETL.
41-
Known uses
42-
This type of ETL process is to be used in all Hub or Surrogate Key tables in the Integration Area. The Interpretation Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
50+
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
51+
52+
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
4353

4454
## Related Patterns
4555
Design Pattern 006 – Generic – Using Start, Process and End Dates

README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
# Data Integration Framework
2-
Architecture and patterns for Data Integration. Working on collaboratively maintaining architecture patterns.
2+
Standards for Data Integration architecture and patterns, to enable working on collaboratively maintaining the pattern body of knowledge.
33

44
The pattern structure (Design and Solution Pattern layout) always is as follows:
55

6-
* Title, the name of the patterns
7-
* Purpose, a short statement what the pattern is trying to achieve or explain. What is the intent?
8-
* Motivation, a short overview of the background and relevance of the pattern. Why is there a need?
9-
* Applicability, a listing of where this pattern can be expected to play a role.
10-
* Structure, the main section with the pattern details.
11-
* Implementation Guidelines, any references to how to implement this pattern (Design Patterns only). Note that the Solution Pattern is intended to explain the specifics in a technical context. This is meant to capture any generic topics.
12-
* Considerations and consequences, meant to offer some alternative views and experiences as to what it means to take a certain decision.
13-
* Related Patterns, any references towards futher reading and related content.
6+
* **Title**, the name of the pattern
7+
* **Purpose**, a short statement what the pattern is trying to achieve or explain. What is the intent?
8+
* **Motivation**, a short overview of the background and relevance of the pattern. Why is there a need?
9+
* **Applicability**, a listing of where this pattern can be expected to play a role.
10+
* **Structure**, the main section with the pattern details.
11+
* I**mplementation guidelines**, any references to how to implement this pattern (Design Patterns only). Note that the Solution Pattern is intended to explain the specifics in a technical context. This is meant to capture any generic topics.
12+
* **Considerations and consequences**, meant to offer some alternative views and experiences as to what it means to take a certain decision.
13+
* **Related patterns**, any references towards further reading and related content.
1414

1515
The Title is in Header 1 format, the sections are in Header 2 format.
16+

0 commit comments

Comments
 (0)