You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/design-patterns/design-pattern-data-vault-hub.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,7 @@ By default the DISTINCT function is executed on database level to reserve resour
57
57
58
58
The logic to create the initial (dummy) Satellite record can both be implemented as part of the Hub ETL process, as a separate ETL process which queries all keys that have no corresponding dummy or as part of the Satellite ETL process. This depends on the capabilities of the ETL software since not all are able to provide and reuse sequence generators or able to write to multiple targets in one process. The default and arguably most flexible way is to incorporate this concept as part of the Satellite ETL since it does not require rework when additional Satellites are associated with the Hub. This means that each Satellite ETL must perform a check if a dummy record exists before starting the standard process (and be able to roll back the dummy records if required).
59
59
60
-
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is �self-standing� and meaningful.
60
+
When modeling the Hub tables try to be conservative when defining the business keys. Not every foreign key in the source indicates a business key and therefore a Hub table. A true business key is a concept that is known and used throughout the organisation (and systems) and is self-standing and meaningful.
61
61
62
62
To cater for a situation where multiple Load Date / Time stamp values exist for a single business key, the minimum Load Date / Time stamp should be the value passed through with the HUB record. This can be implemented in ETL logic, or passed through to the database. When implemented at a database level, instead of using a SELECT DISTINCT, using the MIN function with a GROUP BY the business key can achieve both a distinct selection, and minimum Load Date / Time stamp in one step.
Copy file name to clipboardExpand all lines: docs/design-patterns/design-pattern-data-vault-missing-keys-and-placeholders.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,22 +19,22 @@ This pattern is only applicable for loading data into the Integration Area table
19
19
20
20
## Structure
21
21
22
-
The Enterprise Data Warehouse architecture specifies that �hard� business rules are implemented on the way into the Data Warehouse (the process from the Staging Area into the Integration Area) whereas �soft� business rules are implemented from the Integration Layer to the Interpretation Area and/or the Presentation Layer (on the way out).
23
-
Using placeholders is a �hard� business rule because no-one can interpret the meaning of a NULL value. SQL cannot deal with NULL values very well and because of this allowing NULL values increases the complexity of the queries against the Integration Area (potentially using outer joins). This is the reason why NULL values are remapped on the way into the Integration Area and ultimately why this kind of (hard) business logic is allowed here.
22
+
The Enterprise Data Warehouse architecture specifies that hard business rules are implemented on the way into the Data Warehouse (the process from the Staging Area into the Integration Area) whereas soft business rules are implemented from the Integration Layer to the Interpretation Area and/or the Presentation Layer (on the way out).
23
+
Using placeholders is a hard business rule because no-one can interpret the meaning of a NULL value. SQL cannot deal with NULL values very well and because of this allowing NULL values increases the complexity of the queries against the Integration Area (potentially using outer joins). This is the reason why NULL values are remapped on the way into the Integration Area and ultimately why this kind of (hard) business logic is allowed here.
24
24
25
25
For example, here are some reasons how NULL values can be presented instead of business keys:
26
-
The source declares them as optional Foreign Keys; for instance when �X� is true, then the business key is populated. Otherwise the business key remains NULL.
26
+
The source declares them as optional Foreign Keys; for instance when X is true, then the business key is populated. Otherwise the business key remains NULL.
27
27
The source declares them as required but the declaration is broken or not enforced (there is an error in the source application that allows NULLS when it shouldn't).
28
28
Implementation guidelines
29
-
NULL/unknown/undefined business key values can be mapped to various placeholder surrogate key values (-1 to -7 surrogate key values) with descriptions like �Not Applicable�, �Unknown� or anything that fits the business key domain. The taxonomy usable for most situations is (not all values are applicable in all situations):
30
-
Missing (-1): the root node and supertype of all �missing� information, it encompasses:
31
-
Missing value (-2): supertype of all missing values. Can be �Unknown� or �Not Applicable�:
29
+
NULL/unknown/undefined business key values can be mapped to various placeholder surrogate key values (-1 to -7 surrogate key values) with descriptions like Not Applicable, Unknown or anything that fits the business key domain. The taxonomy usable for most situations is (not all values are applicable in all situations):
30
+
Missing (-1): the root node and supertype of all missing information, it encompasses:
31
+
Missing value (-2): supertype of all missing values. Can be Unknown or Not Applicable:
32
32
Not Applicable (-3).
33
33
Unknown (-4).
34
34
Missing Attribute/Column (-5): supertype of all missing values due to missing attributes:
35
35
Missing Source Attribute (Non recordable Source) (-6). Used when source fails to supply attribute/column
36
36
Missing Target Attribute (Non recordable DWH Attribute) (-7). Used for temporal data that falls before the deployment of the attribute.
37
-
Deciding between the various types of �unknown� is a business question that is decided based on how the source database works.
37
+
Deciding between the various types of unknown is a business question that is decided based on how the source database works.
38
38
39
39
## Considerations and Consequences
40
40
The Hubs must be pre-populated with the placeholder values (records).
@@ -44,4 +44,4 @@ Known uses
44
44
This type of ETL process is to be used in all Hub or Surrogate Key tables in the Integration Area. The Interpretation Area Hub tables, if used, have similar characteristics but the ETL process contains business logic.
45
45
46
46
## Related Patterns
47
-
Design Pattern 008 � Data Vault � Loading Hub tables.
0 commit comments