Skip to content

Commit 8db2ee6

Browse files
committed
Tidy-ups
1 parent 64c218f commit 8db2ee6

File tree

66 files changed

+1008
-1990
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+1008
-1990
lines changed

README.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,14 @@
1-
# Data Solution Framework
1+
# Design and Solution Patterns
22

3-
A documentation library containing reusable design- and solution patterns.
3+
A documentation library containing reusable design- and solution patterns, supporting [Data Engine Thinking](https://dataenginethinking.com/en/).
44

55
## Getting started
66

77
Please have a look at [the introduction documentation](./docs/index.md) to get started!
88

9-
The automatically-generated documentation site is available on [GitHub pages](https://data-solution-automation-engine.github.io/data-solution-framework/).
10-
119
## Implementation
1210

13-
This repository is intended to be cloned and modified for organisation-specific scenarios. All files are text-based (MarkDown format, by default) for convenient editing and collaboration using Git. A DocFX file is also provided to generate static HTML from the repository's contents.
11+
This repository is intended to be cloned and modified for organization-specific scenarios. All files are text-based (MarkDown format, by default) for convenient editing and collaboration using Git. A DocFX file is also provided to generate static HTML from the repository's contents.
1412

1513
To generate the content as a website (on localhost port 8081), please run the following from the 'docs' directory of the repository:
1614

docs/design-patterns/design-pattern-data-vault-hub.md

Lines changed: 29 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4,73 +4,75 @@ uid: design-pattern-data-vault-hub
44

55
# Design Pattern - Data Vault - Hub
66

7-
> [!WARNING]
8-
> This design pattern requires a major update to refresh the content.
9-
107
## Purpose
118

129
This design pattern describes how to define, and load data into, Data Vault Hub style tables.
1310

1411
## Motivation
1512

16-
A Data Vault Hub is the physical implementation of a Core Business Concept. These are the the key identified 'things' that can be meaningfully identified as part of an organization's business processes.
13+
A Data Vault Hub is the physical implementation of a Core Business Concept (CBC). These are the essential 'things' that can be meaningfully identified as part of an organization's business processes.
1714

1815
Loading data into Hub tables is a relatively straightforward process with a clearly defined location in the architecture: it is applied when loading data from the Staging Layer to the Integration Layer.
1916

20-
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly and at the right point in time.
17+
The Hub is a vital component of a Data Vault solution, making sure that Data Warehouse keys are distributed properly, and at the right point in time.
2118

2219
Decoupling key distribution and managing historical information (changes over time) is essential to reduce loading dependencies. It also simplifies (flexible) storage design in the Data Warehouse.
2320

2421
Also known as:
2522

26-
- Core Business Concept (Ensemble modelling)
27-
- Hub (Data Vault modelling concept)
28-
- Surrogate Key (SK) or Hash Key (HSH) distribution, as commonly used implementations of the concept
29-
- Data Warehouse key distribution
23+
- Core Business Concept (Ensemble Modeling).
24+
- Hub (Data Vault Modeling concept).
25+
- Surrogate Key (SK) or Hash Key (HSH) distribution, as commonly used implementations of the concept.
26+
- Data Warehouse key distribution.
3027

3128
## Applicability
3229

33-
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub ETL processes follow the same pattern.
30+
This pattern is applicable for the process of loading from the Staging Layer into Hub tables. It is used in all Hubs in the Integration Layer. Derived (Business Data Vault) Hub data logistics processes follow the same pattern.
3431

3532
## Structure
3633

37-
A Hub table contains the unique list of business key, and the corresponding Hub ETL process can be described as an 'insert only' of the unique business keys that are not yet in the the target Hub.
34+
A Hub table contains the unique list of business key, and the corresponding Hub data logistics process can be described as an 'insert only' of the unique business keys that are not yet in the the target Hub.
3835

39-
The process performs a distinct selection on the business key attribute(s) in the Staging Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
36+
The process performs a distinct selection on the business key attribute(s) in the Landing Area table and performs a key lookup to verify if the available business keys already exists in the target Hub table. If the business key already exists the row can be discarded, if not it can be inserted.
4037

4138
During the selection the key distribution approach is implemented to make sure a dedicated Data Warehouse key is created. This can be an integer value, a hash key (i.e. MD5 or SHA1) or a natural business key.
4239

4340
## Implementation guidelines
4441

45-
Hubs are core business concepts which must be immediately and uniquely identifiable through their name.
42+
Hubs must be immediately and uniquely identifiable through their name.
4643

47-
Loading a Hub table from a specific Staging Layer table is a single, modular, ETL process. This is a requirement for flexibility in loading information as it enables full parallel processing.
44+
Loading a Hub table from a specific Staging Layer table is a single, modular, data logistics process. This is a requirement for flexibility in loading information as it enables full parallel processing.
4845

4946
Multiple passes of the same source table or file are usually required for various tasks. The first pass will insert new keys in the Hub table; the other passes may be needed to populate the Satellite and Link tables.
50-
The designated business key (usually the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
47+
The designated business key (sometimes the source natural key, but not always!) is the ONLY non-process or Data Warehouse related attribute in the Hub table.
5148

52-
The Load Date / Time Stamp (LDTS) is copied (inherited) from the Staging Layer. This improves ETL flexibility. The Staging Area ETL is designed to label every record which is processed by the same module with the correct date/time: the date/time the record has been loaded into the Data Warehouse environment (event date/time). The ETL process control framework will track when records have been loaded physically through the Insert Module Instance ID.
49+
The Inscription Timestamp is copied (inherited) from the Staging Layer. This improves data logistics flexibility. The Landing Area data logistics is designed to label every record which is processed by the same module with the correct timestamp, indicating when the record has been loaded into the Data Warehouse environment. The data logistics process control framework will track when records have been loaded physically through the Audit Trail Id.
5350

54-
Multiple ETL processes may load the same business key into the corresponding Hub table if the business key exists in more than one table. This also means that ETL software must implement dynamic caching to avoid duplicate inserts when running similar processes in parallel.
51+
Multiple data logistics processes may load the same business key into the corresponding Hub table if the business key exists in more than one table. This also means that data logistics software must implement dynamic caching to avoid duplicate inserts when running similar processes in parallel.
5552

56-
By default the DISTINCT function is executed on database level to reserve resources for the ETL engine but this can be executed in ETL as well if this supports proper resource distribution (i.e. light database server but powerful ETL server).
53+
By default the DISTINCT function is executed on database level to reserve resources for the data logistics engine, but this can be executed inline in data logistics as well if this supports proper resource distribution (i.e. light database server but powerful data logistics server).
5754

58-
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
55+
The logic to create the initial (dummy) zero key record can both be implemented as part of the Hub data logistics process, as a separate data logistics process which queries all keys that have no corresponding dummy, or issued when the Hub table is created.
5956

60-
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is self-standing and meaningful.
57+
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organization (and systems) and is self-standing and meaningful.
6158

62-
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
59+
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in data logistics logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
6360

6461
## Considerations and consequences
6562

6663
Multiple passes on the same Staging Layer data set are likely to be required: once for the Hub table(s) but also for any corresponding Link and Satellite tables.
6764

68-
Defining Hub ETL processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
65+
Defining Hub data logistics processes as atomic modules, as defined in this Design Pattern, means that many Staging Layer tables load data to the same central Hub table. All processes will be very similar with the only difference being the mapping between the Staging Layer business key attribute and the target Hub business key counterpart.
66+
67+
Misidentifying business keys or over-keying leads to proliferation of Hubs and downstream complexity; validate business key selection with domain experts.
68+
69+
Hash-key generation must be deterministic and collision-resistant; standardize hash inputs (case, trim, date formats) to avoid false duplicates.
6970

7071
## Related patterns
7172

72-
- [Design Pattern - Logical - Core Business Concept](xref:design-pattern-logical-core-business-concept)
73-
- Design Pattern 006 - Generic - Using Start, Process and End Dates
74-
- Design Pattern 009 - Data Vault - Loading Satellite tables
75-
- Design Pattern 010 - Data Vault - Loading Link tables
76-
- Design Pattern 023 - Data Vault - Missing keys and placeholders
73+
- [Design Pattern - Logical - Core Business Concept](xref:design-pattern-logical-core-business-concept).
74+
- Design Pattern 006 - Generic - Using Start, Process and End Dates.
75+
- Design Pattern 009 - Data Vault - Loading Satellite tables.
76+
- Design Pattern 010 - Data Vault - Loading Link tables.
77+
- Design Pattern 023 - Data Vault - Missing keys and placeholders.
78+

docs/design-patterns/design-pattern-data-vault-link-satellite-driving-key.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,6 @@ To avoid data redundancy, it is recommended to manage this process into the targ
4242

4343
## Related patterns
4444

45-
* Design Pattern - Using Start, Process and End Dates
46-
* Design Pattern - Satellite
47-
* Design Pattern - Link
45+
* Design Pattern - Using Start, Process and End Dates.
46+
* Design Pattern - Satellite.
47+
* Design Pattern - Link.

docs/design-patterns/design-pattern-data-vault-link-satellite.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,20 +23,30 @@ To provide a generic approach for loading Link Satellites.
2323

2424
This pattern is only applicable for loading data to Link-Satellite tables from:
2525

26-
* The Staging Area into the Integration Area.
26+
* The Landing Area into the Integration Area.
2727
* The Integration Area into the Interpretation Area.
28-
* The only difference to the specified ETL template is any business logic required in the mappings towards the Interpretation Area tables.
28+
* The only difference to the specified data logistics template is any business logic required in the mappings towards the Interpretation Area tables.
2929

3030
## Structure
3131

32-
Standard Link-Satellites use the Driving Key concept to manage the ending of old relationships.
32+
Standard Link-Satellites use the Driving Key concept to manage the ending of old relationships. The driving key defines which key in the Link controls history (for example, the transaction id) so that related attributes expire correctly when that driving key changes.
3333

3434
## Implementation guidelines
3535

36+
* Identify the driving key for the Link; use it to manage effective/expiry dates in the Link-Satellite.
37+
* Insert-only SCD2 pattern: close the current record (set expiry), insert a new record when the driving key/value combination changes.
38+
* Carry metadata from the Link (load timestamps, source identifiers) to keep lineage intact.
39+
* Use hash keys/checksums to detect attribute changes when applicable; avoid unnecessary updates.
40+
* Keep Link-Satellites narrow—store only relationship attributes (e.g., status, type, reason) and not Hub-level attributes.
41+
3642
## Considerations and consequences
3743

44+
* Choosing the wrong driving key results in incorrect timelines; validate with business owners.
45+
* Link-Satellites are optional; if relationship attributes are stable or modeled elsewhere, they may be unnecessary.
46+
* Overuse of Link-Satellites can add joins and complexity; apply only when relationship history is required.
47+
3848
## Related patterns
3949

40-
* Design Pattern - Using Start, Process and End Dates
41-
* Design Pattern - Satellite
42-
* Design Pattern - Link
50+
* Design Pattern - Using Start, Process and End Dates.
51+
* Design Pattern - Satellite.
52+
* Design Pattern - Link.

0 commit comments

Comments
 (0)