Skip to content

Commit f51d684

Browse files
committed
Cleanup again
1 parent db217e0 commit f51d684

6 files changed

+172
-184
lines changed

Data Integration Framework - Reference Solution Architecture - 1 - Overview.md

Lines changed: 45 additions & 179 deletions
Large diffs are not rendered by default.

Data Integration Framework - Reference Solution Architecture - 2 - Staging Layer.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Staging Layer overview
22

3-
The Staging Layer covers the first major series of ETL process steps within the Data Warehouse reference architecture. The processes involved with the Staging Layer introduce data from many (often disparate) source applications into the Data Warehouse environment. In this sense the Staging Layer is for the most part literally a place where the data is collected onto the Data Warehouse environment before being integrated in the core Data Warehouse or loaded for other use-cases (i.e. analytics, ad-hoc reporting).
3+
The Staging Layer covers the first series of ETL process steps within the reference architecture. The processes involved with the Staging Layer introduce data from many (often disparate) source applications into the Data Warehouse environment. In this sense the Staging Layer is for the most part literally a place where the data is collected onto the Data Warehouse environment before being integrated in the core Data Warehouse or loaded for other use-cases (i.e. analytics, ad-hoc reporting).
44

55
But even then many fundamental decisions are required that have repercussions throughout the rest of the design. This document defines the Staging Layer and describes the required process steps and available solutions.
66

@@ -12,6 +12,20 @@ The position of the Staging Layer in the overall architecture is outlined in the
1212

1313
![1547519184139](.\Images\Staging_Layer_1_Overview.png)
1414

15+
## Staging Layer
16+
17+
The Staging Layer consists of the **Staging Area** and the **Persistent Staging Area**. The main purpose of this layer is to collect source data and optionally store it in a source data archive. The Staging Layer prepares and collects data for further process into the Integration Layer.
18+
19+
The Staging Area within the Staging Layer streamlines data types and loads source data into the Data Warehouse environment. This is done by utilising different Change Data Capture (CDC) techniques depending on the source system, files or options / restrictions of the available technology. Another important role for the Staging Area is the correct definition of time in the Data Warehouse. Depending on the type of source and interface dynamics extreme care has to be taken to ensure timelines are setup correctly for proper management of historical information in the subsequent steps.
20+
21+
The design is to load the source data delta into the History Area. Here the data is stored in the structure of the providing source but changes are tracked over time. The History Area is an important component in Data Recovery (DR) and re-initialisation of data (initial load) and is also used as part of the Full Outer Join comparison against the source systems.
22+
23+
An option in the Data Warehouse design is to load the source data into a History Area. Here the data is stored in the structure of the providing source but changes are tracked using the Slowly Changing Dimensions (SCD type 2) mechanism. The History Area is an important component in Disaster Recovery (DR) and re-initialisation of data (initial loads). When Change Data Capture, Change Tracking or messaging sources are part of the design the addition of a History Area is strongly recommended. A History Area can also be used for full outer join comparison against the source system and/or a full data dump interface.
24+
25+
Objects in the Staging Layer are not accessible for end-users or Business Intelligence and analytics software (e.g. Cognos). This is because for most scenarios information has not yet been prepared for consumption. There is an exception to this rule; for specific data mining or statistical analysis it is often preferable for analysts to access the raw / unprocessed data. This means this access can be granted for the Staging Layer which contains essentially raw time variant data. Allow access serves a purpose in prototyping and local self-service BI / visualisation.
26+
27+
28+
1529
The Staging Layer, or the process from source to staging, consists of two separate parts (areas):
1630

1731
* The Staging Area, and
@@ -66,6 +80,10 @@ The PSA provides these benefits at the cost of extra disk space, ETL development
6680

6781
Due to the generic nature of this design depending on the ETL software used these ETL processes and table structures can be created using development patterns / automation.
6882

83+
### Principles
84+
85+
The Staging Layer is always in the same structure as the providing operational system, but all attributes are nullable to avoid load errors
86+
6987
### Implementing Change Data Capture
7088

7189
The way data is loaded into the staging tables depends very much on the source system, company guidelines and general availability. Different ways of approaching change data capture and acquiring data in general are:
@@ -163,7 +181,7 @@ The following is a list of conventions for the Staging Area:
163181
* Source to Staging Area ETL processes use the truncate/insert load strategy. When delta detection is handled by the DWH (i.e. using a Full Outer Join) a Landing Area table can be incorporated.
164182
* Everything is copied as-is, no transformations are done other than formatting data types. The Staging Area processing may never lead to errors!
165183

166-
# The Persistent Staging
184+
# Persistent Staging Area
167185

168186
The structure of the PSA is the same as the Staging Area (including the metadata attributes). The following attributes are mandatory for the PSA tables:
169187

@@ -191,7 +209,7 @@ Note: there are other suitable approaches towards a PSA. Depending on the requir
191209

192210
When loading data delta directly into the PSA (i.e. the Staging Area is not adopted) the same rules apply as for the Staging Area.
193211

194-
## 4.1 Persistent Staging Area development guidelines
212+
## Persistent Staging Area development guidelines
195213

196214
The following is a list of development conventions for the Persistent Staging Area (PSA):
197215

Data Integration Framework - Reference Solution Architecture - 3 - Integration Layer.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,32 @@ The intention of the Interpretation Area is to reduce replication of the rules i
3434

3535
In separating the state of the data into two physical forms (original and modified), it gives the flexibility of applying multiple sets of rule to a specific data set, in order to suit specific needs.
3636

37+
## Integration Layer
38+
39+
The Integration Layer consists of the **Raw Data Vault** area and the **Business Data Vault** area. The main purpose of this Layer is to function as the ‘core Data Warehouse layer’ where all the data is collected in a normalised Data Warehouse model. To achieve optimal flexibility and error handling business rules are implemented as late as possible in the ETL process.
40+
41+
The Raw Data Vault stores the source data without changing the contents in the core Data Warehouse model. The system collects the data from all source systems in a generic way which is suitable for further expansion. The main Data Warehouse functionalities such as surrogate key distribution, storing history and maintaining relationships are done in this area.
42+
43+
The Business Data Vault uses the same modelling standards as the Raw Data Vault but provided interpretations or alternate views on the granular data. Both areas link closely to each other and in most cases provides separate cleaned or changed instances of tables that already exist in the Raw Data Vault.
44+
45+
The Business Data Vault is not a full copy of the Raw Data Vault. In most cases the Interpretation Area tables will refer to Integration Area surrogate key tables and provide an alternative perspective to Integration Area historical tables.
46+
47+
Examples of logic that can be applied in the Business Data Vault are generic business rules such as de-duplication or determining a single customer view. Additionally, the Business Data Vault is also used to design cross-references between similar datasets from different source systems. These cross-references are essentially recursive or intersection entities between business entities in the Raw Data Vault, but contain (business) rules to identify the main keys.
48+
49+
The important factor is that in this layer, business rules that alter the contents of the data are not yet applied. In the case of derivations, for example in the Business Data Vault, this means the original values will always need to stay available. Also, records are not checked for errors to keep the system as flexible as possible towards the Information Marts.
50+
51+
The Integration Layer (Integration Area and Interpretation Area) will be created using a **Data Vault 2.0** model which decouples key distribution using main entities (Hubs) but de-normalises reference information (Satellites) for these entities. Relationships between the main entities (Links) can be managed and tracked over time. This is a loosely-coupled data modelling approach which reduces dependencies and timing issues which are expected to occur in the data delivery.
52+
53+
As an example, this approach allows information related to the same customer or prospect to be delivered and integrated independently. It also supports ongoing linking of customer information to tie in various elements of information to the unique prospect or customer over time without losing flexibility; the logic for de-duplication can be changed and/or recalculated across historical information if required.
54+
55+
![1547521558900](D:/Git_Repositories/Data_Integration_Framework/Images/558900.png)
56+
57+
Objects in the Integration Layer are not accessible for end-users or Business Intelligence and analytics software. This is because for most scenarios information has not yet been prepared for consumption; only Data Warehouse logic is implemented. There is an exception to this rule; for specific data mining or statistical analysis it is often preferable for analysts to access the raw / unprocessed data. This means this access can be granted for the Integration Layer which contains essentially raw, but indexed and time variant data in the right context (e.g. related to the correct business keys). This is an ideal structure for statistical analysis.
58+
59+
## Principles
60+
61+
The Integration Layer can be modelled using a hybrid (Data Vault, Anchor Modelling) technique. For the Enterprise Data Warehouse, which integrates many sources and is subject to change a Data Vault 2.0 approach is adopted
62+
3763
# The Integration Area
3864

3965
The Integration Area is modelled differently from the Staging Area. In the Integration Area data is divided into common entities which form the core of the Data Warehouse model. Various modelling techniques can be applied for the Integration Layer (3NF, Data Vault, Anchor) as long as the same technique is used for both areas. Regardless of the approach the tables in the Integration Area are either:

Data Integration Framework - Reference Solution Architecture - 4 - Presentation Layer.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,16 @@ As documented in the solution architecture overview, data from the Integration L
2424

2525
The most important aspect of the Presentation Layer is that it inherits its Data Warehouse keys from the Integration Layer, thus enabling backtracking of information. The detailed (raw or cleaned) data from the Integration Layer is still applicable to the keys in the Presentation Layer.
2626

27+
The Presentation Layer consists of the **Helper Area** and the **Reporting Structure Area**. This layer provides the data in a structure that is suitable for reporting and applies any specific business logic. By design information can be provided in any format and/or historical view since the presentation itself is decoupled from the core data store. Where the Integration Layer focuses on optimally storing anything that happens to the data (manage the data itself) and its relationships the Presentation Layer combines these relationships to form Facts and Dimensions. Since historical information is maintained in the previous layer these structures can be easily changed or re-deployed. Deriving dimensional models from a properly structured Integration Layer is very straightforward and development is made very easy because templates are provided and both facts and dimensions can be emptied (truncated) and reloaded at any point in time without losing information.
28+
29+
The Helper Area of the Presentation Layer is an optional area where semi-aggregates or useful tables can be stored to simplify or speed up processing. These types of tables are usually added for either performance reasons or the wish to implement the same business logic in as few places as possible. Helper tables can be modelled in any way as long as they benefit the Reporting Structure Area. They are not accessible by users or front-end reporting and analysis software.
30+
31+
By thoughtfully creating aggregate tables which can be shared by the Information Mart one could for instance create a fact table on a certain aggregate level and have different Information Marts aggregate this table further depending on their needs. This way the business logic and performance demanding calculations only have to be done once.
32+
33+
The Reporting Structure Area is the final part of the reference architecture. An Information Mart is modelled for a specific purpose, audience and technical requirement. The complete Data Warehouse can contain very different Information Marts with different models and different ‘versions of the truth’ depending on the business needs.
34+
35+
In the process from loading the data from the Integration Layer to the Presentation Layer most of the business logic is implemented.
36+
2737
## Load strategies
2838

2939
### Loading from Integration to Presentation
@@ -279,4 +289,28 @@ Error handling for this area is documented part of the ‘A160 – Error handlin
279289

280290
· If the ETL platform supports it, prefix the ‘area’ or ‘folder’ in the ETL tool with ‘300_’ because this is the first area in the third layer in the architecture. This forces most ETL software to sort the folders in the way the architecture handles the data
281291

282-
· Reuse the Integration Area surrogate keys wherever possible. This further strengthens the audit capabilities and provides a standard level key in a Dimension
292+
· Reuse the Integration Area surrogate keys wherever possible. This further strengthens the audit capabilities and provides a standard level key in a Dimension
293+
294+
295+
296+
297+
298+
## Use of Views
299+
300+
### Decoupling views
301+
302+
The Data Warehouse design incorporates views ‘on top off’’ the Presentation Layer (Information Mart). This is applied for the following reasons:
303+
304+
- Views allow a more flexible implementation of data access security (in addition to the security applied in the BI Layer)
305+
- Views act as ‘decoupling’ mechanism between the physical table structure and the Semantic Layer (business model)
306+
- Views allow for flexible changing of information delivery (historical views)
307+
308+
These views are meant to be 1-to-1, meaning that they represent the physical table structure of the Information Mart. However, during development and upgrades these views can be altered to temporarily reduce the impact of changes in the table structure from the perspective of the BI platform. This way changes in the Information Mart can be made without the necessity to immediately change the Semantic Layer and/or reports. In this approach normal reporting can continue and the switch to the new structure can be done at a convenient moment.
309+
310+
This is always meant as a temporary solution to mitigate the impact of these changes and the end state after the change should always include the return to the 1-to-1 relationship with the physical table.
311+
312+
A very specific use which includes the only allowed type of functionality to be implemented in the views is the way they deliver the historical information. Initially these views will be restricted to Type 1 information by adding the restriction of showing only the most recent state of the information (where the Expiry Date/Time = ‘9999-12-31’). Over time however it will be possible to change these views to provide historical information if required. On a full Type2 Information Mart, views can be used to deliver any type of history without changing the underlying data or applying business logic.
313+
314+
## Views for virtualisation
315+
316+
Another use case for view is for virtualising the Presentation Layer. As all granular and historic information is stored in the Integration Layer it is possible, if the hardware allows it, to use views to present information in any specific format. This removes the need for ETL – physically moving data – from the solution design. Applicability of virtualisation depends largely on the way the information is accessed and the infrastructure that is in place. Possible application includes when the BI platform uses the information to create cubes, when information is infrequently accessed or with a smaller user base.

0 commit comments

Comments
 (0)