Skip to content

Commit 7542382

Browse files
committed
Cleanup
1 parent f51d684 commit 7542382

9 files changed

+147
-133
lines changed

Data Integration Framework - Reference Solution Architecture - 1 - Overview.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -168,18 +168,18 @@ For tables that store history the metadata includes the regular start and end da
168168

169169
## Context of the reference architecture
170170

171-
The reference architecture itself exists in a larger system of components and functionality. While these are considered out of scope for the data integration framework, it can be useful to display this wider context.
171+
The reference architecture itself exists in a larger system of components and functionality. While these are considered out of scope for the data integration framework, it can be useful to mention this wider context.
172172

173173
![1547521593410](.\Images\93410.png)
174174

175-
This diagram shows the related concepts that interact with the reference architecture.
175+
This diagram shows the some of the key concepts that relate to the reference architecture.
176176

177-
- BI Semantic Layer. This is essentially a business friendly view of the underlying physical database. A Semantic Layer prepares the joins (relationships) between database tables and defines for instance which measures can be summarised and which ones would result in double counting. Another purpose is to rename technical database attribute names to a name more suitable for reporting. A Semantic Layer aims to support the data exploration by defining these central requirements once so they do not have to become part of each query
177+
- Business Intelligence Semantic Layer. The 'Semantic Layer' is essentially a business friendly view of the underlying (physical) database for use in Business Intelligence environments. It includes renaming of attributes to a more suitable business-friendly name, preparation of joins (relationships) between database tables and definition of aggregation rules. For example; which measures can be aggregated (and which ones would result in double counting) or which metrics can be used only in conjunction with certain elements (dimensions).
178178
- In most software platforms, the definition of a Semantic Layer adds intelligence and awareness of neighbouring data entities to assist users in the creation of the reports
179179
- BI Views. The reference architecture incorporates views only ‘on top off’’ the Presentation Layer (Information Marts) to act as ‘decoupling’ mechanism between the physical table structure and the Business Objects semantic layer (business model). This is explained in more detail in the Modelling section
180180
- Specialist Applications. In some cases applications are defined that ‘live’ in the Presentation Layer and/or the (typically) OLAP applications that are updated through the Presentation Layer. This occurs often in Finance related scenarios where OLAP is used as Forecasting application and various Forecast scenarios are saved back into the OLAP cube or Presentation Layer structure
181-
- Dependent / Independent Information Marts. In some cases not all or even none of the data is provided by the Data Warehouse. This occurs typically in prototype scenarios or for purposes which have limited requirements including a short term use. While the directive is to integrate all data into the Data Warehouse it is acknowledged that in some cases Information Marts exist in the Presentation Layer that are sourced directly from operational systems. Similarly, dependent Information Marts are always updated via the Data Warehouse (Integration Layer)
182-
- ODS. Provisions may be taken to introduce an Operational Data Store (ODS) which is an integrated repository that enables operational use and maintenance of data. This may be applicable to one or more source systems, or for information that does not have an authoritative source. It is important to note that not all information needs to pass through the ODS before entering the Data Warehouse. The ODS is a fit-for-purpose solution that provides a new operational use for a subset of information that may be integrated from various sources that fit the requirement. The remaining data can be sourced in the conventional way
183-
- Data Models / Data Modelling. The design supports different approaches for modelling of information both in the context of Information Modelling (Logical Models and Physical Models) of the information for an organisation and for specific techniques in Data Warehouse modelling. Ultimately, data models, as well as a selected technique, are a core requirement for developing under the Reference Architecture
184-
- Metadata. Information about the data is a key design input and describes how, when and by whom the data is collected, formatted and used. Metadata is essential for understanding information stored in the data layer. Metadata is vital to the understanding the impact that results when data or its meanings is altered
185-
- SDLC. The Data Warehouse is essentially a custom developed application and is subject to the same Software Development Life Cycle (SDLC) processes as any other application. This also defines the involved DTAP environments (Development, Test, Acceptance, and Production) and processes around the use of these environments.
181+
- Dependent / Independent Information Marts. In some cases not all or even none of the data is provided by the Data Warehouse. This occurs typically in prototype scenarios or for purposes which have limited requirements including a short term use. While the directive is to integrate all data into the Data Warehouse it is acknowledged that in some cases Information Marts exist in the Presentation Layer that are sourced directly from operational systems. Similarly, dependent Information Marts are always updated via the Data Warehouse (Integration Layer).
182+
- ODS. Provisions may be taken to introduce an Operational Data Store (ODS) which is an integrated repository that enables operational use and maintenance of data. This may be applicable to one or more source systems, or for information that does not have an authoritative source. It is important to note that not all information needs to pass through the ODS before entering the Data Warehouse. The ODS is a fit-for-purpose solution that provides a new operational use for a subset of information that may be integrated from various sources that fit the requirement. The remaining data can be sourced in the conventional way.
183+
- Data Models / Data Modelling. The design supports different approaches for modelling of information both in the context of Information Modelling (Logical Models and Physical Models) of the information for an organisation and for specific techniques in Data Warehouse modelling. Ultimately, data models, as well as a selected technique, are a core requirement for developing under the Reference Architecture.
184+
- Metadata. Information about the data is a key design input and describes how, when and by whom the data is collected, formatted and used. Metadata is essential for understanding information stored in the data layer. Metadata is vital to the understanding the impact that results when data or its meanings is altered.
185+
- Software Development Life Cycle (SDLC). The data solution is essentially a custom developed application and is subject to SDLC processes as with any other application. A defined approach covers the involved environments (for example Development, Test, Acceptance, and Production) and processes around the use of these environments (i.e. DevOps, deployment, change management, version control).

Data Integration Framework - Reference Solution Architecture - 2 - Staging Layer.md

Lines changed: 3 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Staging Layer overview
22

3-
The Staging Layer covers the first series of ETL process steps within the reference architecture. The processes involved with the Staging Layer introduce data from many (often disparate) source applications into the Data Warehouse environment. In this sense the Staging Layer is for the most part literally a place where the data is collected onto the Data Warehouse environment before being integrated in the core Data Warehouse or loaded for other use-cases (i.e. analytics, ad-hoc reporting).
3+
The Staging Layer covers the first series of ETL process steps within the reference architecture. The processes involved with the Staging Layer introduce data from many (often disparate) source applications into the Data Warehouse environment.
4+
5+
In this sense, the Staging Layer is for the most part literally a place where the data is collected onto the Data Warehouse environment before being integrated in the core Data Warehouse or loaded for other use-cases (i.e. analytics, ad-hoc reporting).
46

57
But even then many fundamental decisions are required that have repercussions throughout the rest of the design. This document defines the Staging Layer and describes the required process steps and available solutions.
68

@@ -149,72 +151,3 @@ There are a number of choices that can be made on this topic, depending on the r
149151
* Where there is a mix of data marts where data completeness and accuracy is vital to some of the data marts and not others (Finance data mart versus CRM data mart), strategies can be put in place to permit partial load while minimizing its impact on all data marts. A possible approach is to filter out partial loads (based on an ETL Process Id) as well as all subsequent loads when populating data marts that require data to be complete and accurate
150152

151153
The complete error handling for each layer in the architecture is documented in detail in the error handling and recycling approach.
152-
153-
## The Staging Area
154-
155-
This section describes both the concepts and structure of the Staging Area in detail.
156-
157-
### Staging Area table structure
158-
159-
The Staging Area is modelled after the structure of the application that supplies the data. This excludes all constraints or referential integrity the source tables might have; the Data Warehouse assumes the source systems handle this, and otherwise this will be handled by the Integration / Presentation Layer processes of the Data Warehouse.
160-
161-
The structure of the Staging Area therefore is the same as the source table, but always containing the following process attributes:
162-
163-
| Column Name | **Required / Optional** | **Data Type / constraint** | **Reasoning** | **DIRECT equivalent** |
164-
| ------------------------ | ----------------------- | ----------------------------------------------------- | ------------------------------------------------------------ | ------------------------------- |
165-
| Load Date/Time Stamp | Required | High precision date/time – not null | The date/time that the record has been presented to the Data Warehouse environment. | |
166-
| Event Date/Time | Required | High precision date/time – not null | The date/time the change occurred in the source system. |Event Datetime |
167-
| Source System ID / Code | Required | Varchar(100) – not null | The code or ID for the source system that supplied the data. |Record Source |
168-
| Source Row ID | Required | Integer – not null | Audit attribute that captures the row order within the data delta as provided by a unique ETL execution. The combination of the unique execution instance and the row ID will always relate back to a single record. Also used to distinguish order if the effective date/time itself is not unique for a given key (due to fast-changing data). | Source Row ID |
169-
| CDC Operation | Required | Varchar(100) – not null | Information derived or received by the ETL process to derive logical deletes. |CDC Operation |
170-
| Full row hash | Optional | Character(32), when using MD5 – not null | Using a checksum for record comparison requires storing a checksum value as an attribute. Can be made optional if column-by-column comparison is implemented instead. | Hash Full Record |
171-
| ETL Process Execution ID | Required | Integer – not null | Logging which unique ETL process has inserted the record. | Insert Module Instance ID |
172-
| Upstream Hash Values | Optional | Character(32), when using MD5 – not null | Any pre-calculated hash values one may like to add to optimize upstream parallel loading (i.e. pre-hashing business keys), | Hash <> |
173-
| <source attributes> | Required | According to data type conversion table - nullable | The source attributes as available. Note that if a primary hash key is not used the natural key (source primary key) needs to be set to NOT NULL. All other attributes are nullable. | N/A |
174-
175-
## Staging Area development guidelines
176-
177-
The following is a list of conventions for the Staging Area:
178-
179-
* When loading data from the source, always load the lowest level (grain) of data available. If a summary is required and even available as a source, load the detail data anyway
180-
* If the ETL platform allows it, prefix the ‘area’, ‘folder’ or ‘namespace’ in the ETL platform with ‘100_’ because this is the first Layer in the architecture. Source definition folders, if applicable, are labelled ‘000_’. This forces most ETL tools to sort the folders in the way the architecture handles the data.
181-
* Source to Staging Area ETL processes use the truncate/insert load strategy. When delta detection is handled by the DWH (i.e. using a Full Outer Join) a Landing Area table can be incorporated.
182-
* Everything is copied as-is, no transformations are done other than formatting data types. The Staging Area processing may never lead to errors!
183-
184-
# Persistent Staging Area
185-
186-
The structure of the PSA is the same as the Staging Area (including the metadata attributes). The following attributes are mandatory for the PSA tables:
187-
188-
| **Column Name** | **Required / Optional** | **Data Type / constraint** | **Reasoning** | **DIRECT equivalent** |
189-
| --------------------------------- | ----------------------- | ----------------------------------------------------- | ------------------------------------------------------------ | ------------------------------- |
190-
| Primary hash key i.e. <entity>_SK | Optional | Character(32), when using MD5 – not null | The hashed value of the source (natural) key. Part of the primary key which is issued for each record in the history table. Can be used instead of composite primary key. | N/A |
191-
| Effective Date/Time | Required | High precision date/time – not null | The date/time that the record has been presented to the Data Warehouse environment. If a Staging Area is used these values will be inherited. | Insert Datetime |
192-
| Event Date/Time | Required | High precision date/time – not null | The date/time the change occurred in the source system. If a Staging Area is used these values will be inherited. | Event Datetime |
193-
| Source System ID / Code | Required | Varchar(100) – not null | The code or ID for the source system that supplied the data. | Record Source |
194-
| Source Row ID | Required | Integer – not null | Audit attribute that captures the row order within the data delta as provided by a unique ETL execution. The combination of the unique execution instance and the row ID will always relate back to a single record. Also used to distinguish order if the effective date/time itself is not unique for a given key (due to fast-changing data). If a Staging Area is used these values will be inherited. | Source Row ID |
195-
| CDC Operation | Required | Varchar(100) – not null | Information derived or received by the ETL process to derive logical deletes. If a Staging Area is used these values will be inherited. | CDC Operation |
196-
| Full row hash | Optional | Character(32), when using MD5 – not null | Using a checksum for record comparison requires storing a checksum value as an attribute. Can be made optional if column-by-column comparison is implemented instead. If a Staging Area is used these values will be inherited. | Hash Full Record |
197-
| ETL Process Execution ID | Required | Integer – not null | Logging which unique ETL process has inserted the record. | Insert Module Instance ID |
198-
| Current Row Indicator | Optional | Varchar(100) – not null | A flag or Boolean to indicate if the record is the most current one (in relation of the effective date). | Current Record Indicator |
199-
| Change Date/Time | Optional | High precision date/time – nullable | A derived date/time field to standardize the main business effective date/time for more harmonised upstream processing. | Change Datetime |
200-
| <source attributes> | Required | According to data type conversion table - nullable | The source attributes as available. Note that if a primary hash key is not used the natural key (source primary key) needs to be set to NOT NULL. All other attributes are nullable. | N/A |
201-
202-
The ETL process from the Staging Area to the PSA checks the data based on the source key and the date/time information and compares all the attribute values. This can result in the following actions:
203-
204-
* No history record is found: insert a new record in the history table.
205-
* A record is found and the source attribute values are different from the history attribute values: insert a new record. Update current row indicators if adopted.
206-
* A record is found but there are no changes found in the attribute comparison: ignore.
207-
208-
Note: there are other suitable approaches towards a PSA. Depending on the requirements there can also be opted for a snapshot PSA where every run inserts a new record (instance) of the source data. This creates a more redundant dataset but arguably makes reloading data to the Integration Layer easier.
209-
210-
When loading data delta directly into the PSA (i.e. the Staging Area is not adopted) the same rules apply as for the Staging Area.
211-
212-
## Persistent Staging Area development guidelines
213-
214-
The following is a list of development conventions for the Persistent Staging Area (PSA):
215-
216-
* When loading the PSA from the Staging Area, always start a PSA ETL process as soon as the Staging Area is finished to ensure that there are no ‘gaps’ in the history. Since the Staging Area has the ‘truncate/insert’ load strategy, PSA data has to be processed before the next Staging Area run. During normal loads, the Integration Area has no dependency on the History Area and loading into history and integration can be done in parallel if the Data Warehouse has capacity for concurrent jobs. This is handled by the ‘Batch’ concept which guarantees the unit of work; e.g. making sure all data delta has been processed
217-
* If the ETL platform supports it, prefix the ‘schema’, ‘area’ or ‘folder’ in the RDBMS and ETL software with ‘150_’. The PSA is part of the first layer in the architecture, but is updated after the Staging Area (if adopted). This forces most ETL tools to sort the folders in the way the architecture handles the data
218-
* Everything is copied as-is, no transformations are done.
219-
* There is no requirement for error handling in PSA ETL.
220-
* When using files, changes in file formats (schema) over time can be handled in separate file mask metadata versions.

0 commit comments

Comments
 (0)