Skip to content

Commit 71f5e6a

Browse files
committed
Updated landing & fundamental requirements patterns
1 parent 2465552 commit 71f5e6a

File tree

2 files changed

+72
-15
lines changed

2 files changed

+72
-15
lines changed
Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,26 @@
1-
# Design Pattern - Generic - Essential ETL Requirements
1+
# Design Pattern - Generic - Fundamental ETL Requirements
22

33
## Purpose
4-
The purpose of this Design Pattern is to define a set of minimal requirements every single ETL process (mapping, module or package) should conform to. These essential design guidelines impact how all ETL processes behave under the architecture of the Data Integration Framework.
4+
The purpose of this Design Pattern is to define a set of minimal requirements every single data integration 'Extract Transform Load' - or ETL) process (i.e. mapping, module, package, data pipeline) should conform to. These essential design guidelines direct how all ETL processes behave under the architecture of the Data Integration Framework.
55

66
## Motivation
7-
Regardless of (place in the) architecture or purpose every ETL process should be created to follow a distinct set of base rules. Essential concepts such as having ETL processes check if they have already run before inserting duplicates or corrupting data makes testing, maintenance and troubleshooting a more straightforward task. The ultimate motivation is to develop ETL which cannot cause errors due to unplanned or unwanted execution. Essentially, ETL must be able to be run and re-run at any point in time to support a fully flexible scheduling and implementation.
7+
Regardless of its place in the architecture or purpose, every data integration process should be created to follow a distinct set of base rules. Essential concepts such as making sure ETL processes check if they have already run before inserting duplicates or corrupting data will simplify testing, maintenance and troubleshooting.
8+
9+
The overarching motivation is to develop data integration processes that cannot cause errors due to unplanned or unwanted execution. Essentially, data integration processes must be able to be run and re-run at any point in time to support a fully flexible scheduling and implementation.
810

911
## Applicability
10-
This Design Pattern applies to every ETL process.
12+
This Design Pattern applies to every data integration process.
1113

1214
## Structure
1315
The requirements are as follows:
14-
* ETL processes contain transformation logic for a single specific function or purpose (atomicity). This design decision follow ETL best practices to create many ETL processes that each address a specific function as opposed to few ETL processes that perform a range of activities. Every ETL process attempts to execute an atomic functionality. Examples of atomic Data Warehouse processes (which therefore are implemented as separate ETL processes) are key distribution, detecting changes and inserting records.
15-
* Related to the above definition of designing ETL to suit atomic Data Warehouse processes, every ETL process can read from one or more sources but only write to a single target.
16-
* ETL processes detect whether they should insert records or not, i.e. should not fail on constraints. This is the design decision that ETL handles referential integrity and constraints (and not the RDBMS).
17-
* ETL can always be rerun without the need to manually change settings. This manifests itself in many ways depending on the purpose of the process (i.e. its place in the overall architecture). An ETL process that truncates a target table to load form another table is the most straightforward example since this will invariably run successfully every time. Another example is the distribution of surrogate keys by checking if keys are already present before they are inserted, or pre-emptively perform checksum comparisons to manage history. This requirement is also be valid for Presentation Layer tables which merge mutations into an aggregate. Not only does this requirement make testing and maintenance easier, it also ensures that no data is corrupted when an ETL process is run by accident.
18-
* Source data for any ETL can always be related to, or be recovered. This is covered by correctly implementing an ETL control / metadata framework and concepts such as the Persistent Staging Area. This metadata model covers the audit trail and its ability to follow data through the Data Warehouse while the Persistent Staging Area enables a new initial load in case of disaster or to reload (parts of) the Data Vault.
19-
* The direction of data is always ‘up’. The typical process of data is from a source, to Staging, Integration and ultimately Presentation. No regular ETL process should write data back to an earlier layer, or use access this information using a (key) lookup.  This implies that the required information is always available in the same Layer of the reference architecture.
20-
* ETL processes must be able to process multiple intervals (changes) in one run. In other words, every time an ETL process runs it needs to process all data that it can (is available). This is an important requirement for ETL to be able to be run at any point in time and to support real-time processing. It means that ETL should not just be able to load a single snapshot or change for a single business keym but to correctly handle multiple changes in a single data set. For instance if the address of an employee changes multiple times during the day and ETL is run daily, all changes are still captures and correctly processed in a single run of the ETL process. This requirement also prevents ETL to be run many times for catch-up processing and makes it possible to easily change loading frequencies.
21-
* ETL processes should automatically recover /rollback when failed. This means that if an error has been detected the ETL automatically repairs the information from the erroneous run and inserts the correct data along with any new information that has been sourced.
16+
* **Atomicity**; ETL processes contain transformation logic for a single specific function or purpose. This design decision follow ETL best practices to create many ETL processes that each address a specific function as opposed to few ETL processes that perform a range of activities. Every ETL process attempts to execute an atomic functionality. Examples of atomic Data Warehouse processes (which therefore are implemented as separate ETL processes) are key distribution, detecting changes and inserting records.
17+
* An ETL process can read from one or more sources but **only write to a single target**. This further supports enabling ETL processes to become atomic as outlined in the first bullet point.
18+
* The direction of data loading and selection is **always related to the target area or layer**. In other words, it is always 'upstream'’. The typical processing of data is from a source to the Staging, Integration and ultimately Presentation layers. No regular ETL process should write data back to an earlier layer, or use access this information using a (key) lookup. This implies that the required information is always available in the same layer of the reference architecture.
19+
* ETL processes **detect whether they should insert records or not**, i.e. should not fail on constraints. This is the design decision that ETL handles referential integrity and constraints (and not the RDBMS). An ETL process that truncates a target table to load form another table is the most straightforward example since this will invariably run successfully every time. Another example is the distribution of surrogate keys by checking if keys are already present before they are inserted, or preemptively perform checksum comparisons to manage history. This requirement is also be valid for Presentation Layer tables which merge mutations into an aggregate. Not only does this requirement make testing and maintenance easier, it also ensures that no data is corrupted when an ETL process is run by accident.
20+
* ETL processes should be able to **rerun without the need to manually change settings**. This manifests itself in many ways depending on the purpose of the process (i.e. its place in the overall architecture).
21+
* **Auditability**; source data for any ETL can always be related to, or be recovered. This is covered by correctly implementing an ETL control / metadata framework and concepts such as the Persistent Staging Area. An ETL control framework metadata model covers the audit trail and its ability to follow data through the Data Warehouse while the Persistent Staging Area enables a new initial load in case of disaster or to reload (parts of) the data solution.
22+
* ETL processes must be able to **process multiple intervals** (changes) in one run. In other words, every time an ETL process runs it needs to process all data that it can (is available). This is an important requirement for ETL to be able to be run at any point in time and to support near real-time processing. It means that ETL should not just be able to load a single snapshot or change for a single business key, but to correctly handle multiple changes in a single data set. For instance if the address of an employee changes multiple times during the day and ETL is run daily, all changes are still captures and correctly processed in a single run of the ETL process. This requirement also prevents ETL to be run many times for catch-up processing and makes it possible to easily change loading frequencies.
23+
* ETL processes should **automatically recover /rollback when failed**. This means that if an error has been detected the ETL automatically repairs the information from the erroneous run and inserts the correct data along with any new information that has been sourced.
2224

2325
## Implementation guidelines
2426
It is recommended to follow a ‘sortable’ folder structure to visibly order containers / folders where ETL processes are stored in a way that represents the flow of data.
@@ -33,9 +35,9 @@ An example is as follows:
3335
ETL processes are recommended to be placed in the directory/folder where they pull data _to_. For instance the ETL logic for ‘Staging to History’ exists in the ‘150_History_Area’ folder and loads data from the ‘100_Staging_Area’.
3436

3537
## Consequences and considerations
36-
In some situations specific properties of the ETL process may seem overkill or perhaps even redundant. This (perceived) additional effort will have its impact on developing duration.
38+
In some situations adding some of the listed functionality to ETL processes may seem overkill or perhaps even redundant. This (perceived) additional effort will impact the developing duration.
3739

38-
But in the context of maintaining a generic design (e.g. to support ETL generation and maintenance) this will still be necessary. Concessions may be made per architectural Layer (all ETL processes within a certain architecture step) but this is recommended to be motivated in the customised (i.e. project specific) Solution Architecture documentation.
40+
But in the context of maintaining a generic design (e.g. to support ETL generation and maintenance) this will still be necessary. Concessions can be made per architectural layer (all ETL processes within a certain architecture step). It is recommended to document any pattern exceptions in the Solution Architecture documentation.
3941

4042
## Related patterns
41-
In the various Design and Implementation Patterns where detailed ETL design for a specific task is documented the requirements in this pattern will be adhered to.
43+
All ETL related patterns.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Design Pattern - Generic - Loading Landing tables
2+
3+
## Purpose
4+
This Design Pattern describes how to load data from source systems into the Staging Area.
5+
6+
## Motivation
7+
A first step in data processing on the Data Platform process is bringing operational ('source') data into the data platform environment. In some cases, data needs to be 'staged' first for further processing. This may happen, for example, in the case of receiving full data copies or files.
8+
9+
The intent is to ensure a data delta (differential) can be derived for further processing into the Persistent Staging Area. This pattern describes a consistent way to develop ETL for this purpose.
10+
11+
Also known as:
12+
13+
* Data Staging
14+
* Source to Staging
15+
16+
## Applicability
17+
This pattern is only applicable for loading processes from source systems or files to the Landing Area (of the Staging Layer).
18+
19+
## Structure
20+
The Landing Area is an optional area in the Data Platform architecture. If correct data deltas are received already these can be directly written to the Persistent Staging Area (PSA) and can therefore bypass the Landing Area.
21+
22+
However, it is possible to store data in the Landing Area even when a correct data delta is received. In this case, the delta will be overwritten once the next delta arrives (and the current delta has been successfully committed for further processing such as in the PSA).
23+
24+
The only reason to use the Landing Area is for pre-processing required to derive a data delta or when receiving data via flat files or equivalent.
25+
26+
No data transformations are done in the Landing Area.
27+
28+
The ETL process can be described as a truncate/insert process which copies all source data. The process essentially copied all data into the Landing table *as-is*, while at the same time adding the operational metadata attributes as specified by the ETL process control framework.
29+
30+
The key points of interest are:
31+
32+
* Defining the correct event date/time as part of the ETL process
33+
* Using a character string to indicate the record source
34+
* Perform data type streamlining:
35+
* Text fields smaller or equal than 100 will be mapped to NVARCHAR (100)
36+
* Text fields > 100 but <= than 1000 will be mapped to NVARCHAR (1000)
37+
* The rest of the text attributes will be mapped to NVARCHAR (4000)
38+
* Dates, times, date times will be mapped to DATETIME2(7)
39+
* All decimals or numeric values will be mapped to NUMBER(38,20)
40+
41+
## Implementation Guidelines
42+
* Use a single ETL process, module or mapping to load data from a single source system table in the corresponding Landing table.
43+
* A control table, parameter or restartable sequence in a mapping can be used to generate the Source Row ID numbers.
44+
* The data type conversion has many uses (as detailed in the Data Integration Framework Staging Layer document); most notably limiting the variety of data types in the Integration Layer and creating a buffer against changes in the source system.
45+
46+
## Considerations and Consequences
47+
* Resolving the Record Source ID will be done in the Integration Layer because disk space is less an issue in the Staging Layer.
48+
* Adding key lookups in the Staging Area will also overcomplicate the ETL design and negatively impact performance. Alternatively is can be discussed to hard-code the identifier instead of the Source System name (as the RECORD_SOURCE). This reduces the requirement for the key lookup but reduces visibility over the data.
49+
* For Staging Area ETL processes that use a CDC based source an extra step is added to control the CDC deltas (using a load window table). This is explained in the ‘Using CDC’ Design Pattern and subsequent Implementation Patterns.
50+
51+
## Related Patterns
52+
* Design Pattern 003 – Mapping requirements
53+
* Design Pattern 006 – Using Start, Process and End dates
54+
* Design Pattern 016 – Delta calculation
55+
* Design Pattern 021 - Using CDC

0 commit comments

Comments
 (0)