Skip to content

Commit 7a34c07

Browse files
authored
Update Design Pattern - Generic - Fundamental ETL Requirements.md
1 parent 1d1f3bf commit 7a34c07

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

1000_Design_Patterns/Design Pattern - Generic - Fundamental ETL Requirements.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The requirements are as follows:
1717
* ETL can always be rerun without the need to manually change settings. This manifests itself in many ways depending on the purpose of the process (i.e. its place in the overall architecture). An ETL process that truncates a target table to load form another table is the most straightforward example since this will invariably run successfully every time. Another example is the distribution of surrogate keys by checking if keys are already present before they are inserted, or pre-emptively perform checksum comparisons to manage history. This requirement is also be valid for Presentation Layer tables which merge mutations into an aggregate. Not only does this requirement make testing and maintenance easier, it also ensures that no data is corrupted when an ETL process is run by accident.
1818
* Source data for any ETL can always be related to, or be recovered. This is covered by correctly implementing an ETL control / metadata framework and concepts such as the Persistent Staging Area. This metadata model covers the audit trail and its ability to follow data through the Data Warehouse while the Persistent Staging Area enables a new initial load in case of disaster or to reload (parts of) the Data Vault.
1919
* The direction of data is always ‘up’. The typical process of data is from a source, to Staging, Integration and ultimately Presentation. No regular ETL process should write data back to an earlier layer, or use access this information using a (key) lookup.  This implies that the required information is always available in the same Layer of the reference architecture.
20-
* ETL processes must be able to process multiple intervals (changes) in one run. This is an important requirement for ETL to be able to be run at any point in time and to support real-time processing. It means that ETL should not only be able to load a single snapshot or change for a single business key but to correctly handle multiple changes in a single data set. For instance if the address of an employee changes multiple times during the day and ETL is run daily, all changes are still captures and correctly processed in a single run of the ETL process. This requirement also prevents ETL to be run many times for catch-up processing and makes it possible to easily change loading frequencies.
20+
* ETL processes must be able to process multiple intervals (changes) in one run. In other words, every time an ETL process runs it needs to process all data that it can (is available). This is an important requirement for ETL to be able to be run at any point in time and to support real-time processing. It means that ETL should not just be able to load a single snapshot or change for a single business keym but to correctly handle multiple changes in a single data set. For instance if the address of an employee changes multiple times during the day and ETL is run daily, all changes are still captures and correctly processed in a single run of the ETL process. This requirement also prevents ETL to be run many times for catch-up processing and makes it possible to easily change loading frequencies.
2121
* ETL processes should automatically recover /rollback when failed. This means that if an error has been detected the ETL automatically repairs the information from the erroneous run and inserts the correct data along with any new information that has been sourced.
2222

2323
## Implementation guidelines

0 commit comments

Comments
 (0)