You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-5Lines changed: 3 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,14 @@
1
-
# Data Solution Framework
1
+
# Design and Solution Patterns
2
2
3
-
A documentation library containing reusable design- and solution patterns.
3
+
A documentation library containing reusable design- and solution patterns, supporting [Data Engine Thinking](https://dataenginethinking.com/en/).
4
4
5
5
## Getting started
6
6
7
7
Please have a look at [the introduction documentation](./docs/index.md) to get started!
8
8
9
-
The automatically-generated documentation site is available on [GitHub pages](https://data-solution-automation-engine.github.io/data-solution-framework/).
10
-
11
9
## Implementation
12
10
13
-
This repository is intended to be cloned and modified for organisation-specific scenarios. All files are text-based (MarkDown format, by default) for convenient editing and collaboration using Git. A DocFX file is also provided to generate static HTML from the repository's contents.
11
+
This repository is intended to be cloned and modified for organization-specific scenarios. All files are text-based (MarkDown format, by default) for convenient editing and collaboration using Git. A DocFX file is also provided to generate static HTML from the repository's contents.
14
12
15
13
To generate the content as a website (on localhost port 8081), please run the following from the 'docs' directory of the repository:
> This design pattern requires a major update to refresh the content.
9
-
10
7
## Purpose
11
8
12
9
This design pattern describes how to define, and load data into, Data Vault Hub style tables.
13
10
14
11
## Motivation
15
12
16
-
A Data Vault Hub is the physical implementation of a Core Business Concept. These are the the key identified 'things' that can be meaningfully identified as part of an organization's business processes.
13
+
A Data Vault Hub is the physical implementation of a Core Business Concept (CBC). These are the essential 'things' that can be meaningfully identified as part of an organization's business processes.
17
14
18
15
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
19
16
20
-
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
17
+
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly, and at the right point in time.
21
18
22
19
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
23
20
24
21
Also known as:
25
22
26
-
- Core Business Concept (Ensemble modelling)
27
-
- Hub (Data Vault modelling concept)
28
-
- Surrogate Key (SK) or Hash Key (HSH) distribution, as commonly used implementations of the concept
29
-
- Data Warehouse key distribution
23
+
- Core Business Concept (Ensemble Modeling).
24
+
- Hub (Data Vault Modeling concept).
25
+
- Surrogate Key (SK) or Hash Key (HSH) distribution, as commonly used implementations of the concept.
26
+
- Data Warehouse key distribution.
30
27
31
28
## Applicability
32
29
33
-
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub ETL processes follow the same pattern.
30
+
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub data logistics processes follow the same pattern.
34
31
35
32
## Structure
36
33
37
-
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an 'insert only' of the unique business keys that are not yet in the the target Hub.
34
+
A Hub table contains the unique list of business key, and the corresponding Hub data logistics process can be described as an 'insert only' of the unique business keys that are not yet in the the target Hub.
38
35
39
-
The process performs a distinct selection on the business key attribute(s) in the Staging Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
36
+
The process performs a distinct selection on the business key attribute(s) in the Landing Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
40
37
41
38
During the selection the key distribution approach is implemented to make sure a dedicated Data Warehouse key is created. This can be an integer value, a hash key (i.e. MD5 or SHA1) or a natural business key.
42
39
43
40
## Implementation guidelines
44
41
45
-
Hubs are core business concepts which must be immediately and uniquely identifiable through their name.
42
+
Hubs must be immediately and uniquely identifiable through their name.
46
43
47
-
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
44
+
Loading a Hub table from a specific Staging Layer table is a single, modular, data logistics process. This is a requirement for flexibility in loading information as it enables full parallel processing.
48
45
49
46
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
50
-
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
47
+
The designated business key (sometimes the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
51
48
52
-
The Load Date / Time Stamp (LDTS) is copied (inherited) from the Staging Layer. This improves ETL flexibility. The Staging Area ETL is designed to label every record which is processed by the same module with the correct date/time: the date/time the record has been loaded into the Data Warehouse environment (event date/time). The ETL process control framework will track when records have been loaded physically through the Insert Module Instance ID.
49
+
The Inscription Timestamp is copied (inherited) from the Staging Layer. This improves data logistics flexibility. The Landing Area data logistics is designed to label every record which is processed by the same module with the correct timestamp, indicating when the record has been loaded into the Data Warehouse environment. The data logistics process control framework will track when records have been loaded physically through the Audit Trail Id.
53
50
54
-
Multiple ETL processes may load the same business key into the corresponding Hub table if the business key exists in more than one table. This also means that ETL software must implement dynamic caching to avoid duplicate inserts when running similar processes in parallel.
51
+
Multiple data logistics processes may load the same business key into the corresponding Hub table if the business key exists in more than one table. This also means that data logistics software must implement dynamic caching to avoid duplicate inserts when running similar processes in parallel.
55
52
56
-
By default the DISTINCT function is executed on database level to reserve resources for the ETL engine but this can be executed in ETL as well if this supports proper resource distribution (i.e. light database server but powerful ETL server).
53
+
By default the DISTINCT function is executed on database level to reserve resources for the data logistics engine, but this can be executed inline in data logistics as well if this supports proper resource distribution (i.e. light database server but powerful data logistics server).
57
54
58
-
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
55
+
The logic to create the initial (dummy) zero key record can both be implemented as part of the Hub data logistics process, as a separate data logistics process which queries all keys that have no corresponding dummy, or issued when the Hub table is created.
59
56
60
-
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is self-standing and meaningful.
57
+
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organization (and systems) and is self-standing and meaningful.
61
58
62
-
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
59
+
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in data logistics logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
63
60
64
61
## Considerations and consequences
65
62
66
63
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
67
64
68
-
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
65
+
Defining Hub data logistics processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
66
+
67
+
Misidentifying business keys or over-keying leads to proliferation of Hubs and downstream complexity; validate business key selection with domain experts.
68
+
69
+
Hash-key generation must be deterministic and collision-resistant; standardize hash inputs (case, trim, date formats) to avoid false duplicates.
69
70
70
71
## Related patterns
71
72
72
-
-[Design Pattern - Logical - Core Business Concept](xref:design-pattern-logical-core-business-concept)
73
-
- Design Pattern 006 - Generic - Using Start, Process and End Dates
Copy file name to clipboardExpand all lines: docs/design-patterns/design-pattern-data-vault-link-satellite.md
+16-6Lines changed: 16 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,20 +23,30 @@ To provide a generic approach for loading Link Satellites.
23
23
24
24
This pattern is only applicable for loading data to Link-Satellite tables from:
25
25
26
-
* The Staging Area into the Integration Area.
26
+
* The Landing Area into the Integration Area.
27
27
* The Integration Area into the Interpretation Area.
28
-
* The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
28
+
* The only difference to the specified data logistics template is any business logic required in the mappings towards the Interpretation Area tables.
29
29
30
30
## Structure
31
31
32
-
Standard Link-Satellites use the Driving Key concept to manage the ending of old relationships.
32
+
Standard Link-Satellites use the Driving Key concept to manage the ending of old relationships. The driving key defines which key in the Link controls history (for example, the transaction id) so that related attributes expire correctly when that driving key changes.
33
33
34
34
## Implementation guidelines
35
35
36
+
* Identify the driving key for the Link; use it to manage effective/expiry dates in the Link-Satellite.
37
+
* Insert-only SCD2 pattern: close the current record (set expiry), insert a new record when the driving key/value combination changes.
38
+
* Carry metadata from the Link (load timestamps, source identifiers) to keep lineage intact.
39
+
* Use hash keys/checksums to detect attribute changes when applicable; avoid unnecessary updates.
40
+
* Keep Link-Satellites narrow—store only relationship attributes (e.g., status, type, reason) and not Hub-level attributes.
41
+
36
42
## Considerations and consequences
37
43
44
+
* Choosing the wrong driving key results in incorrect timelines; validate with business owners.
45
+
* Link-Satellites are optional; if relationship attributes are stable or modeled elsewhere, they may be unnecessary.
46
+
* Overuse of Link-Satellites can add joins and complexity; apply only when relationship history is required.
47
+
38
48
## Related patterns
39
49
40
-
* Design Pattern - Using Start, Process and End Dates
41
-
* Design Pattern - Satellite
42
-
* Design Pattern - Link
50
+
* Design Pattern - Using Start, Process and End Dates.
0 commit comments