|
1 | 1 | # Staging Layer overview
|
2 | 2 |
|
3 |
| -The Staging Layer covers the first series of ETL process steps within the reference architecture. The processes involved with the Staging Layer introduce data from many (often disparate) source applications into the Data Warehouse environment. In this sense the Staging Layer is for the most part literally a place where the data is collected onto the Data Warehouse environment before being integrated in the core Data Warehouse or loaded for other use-cases (i.e. analytics, ad-hoc reporting). |
| 3 | +The Staging Layer covers the first series of ETL process steps within the reference architecture. The processes involved with the Staging Layer introduce data from many (often disparate) source applications into the Data Warehouse environment. |
| 4 | + |
| 5 | +In this sense, the Staging Layer is for the most part literally a place where the data is collected onto the Data Warehouse environment before being integrated in the core Data Warehouse or loaded for other use-cases (i.e. analytics, ad-hoc reporting). |
4 | 6 |
|
5 | 7 | But even then many fundamental decisions are required that have repercussions throughout the rest of the design. This document defines the Staging Layer and describes the required process steps and available solutions.
|
6 | 8 |
|
@@ -149,72 +151,3 @@ There are a number of choices that can be made on this topic, depending on the r
|
149 | 151 | * Where there is a mix of data marts where data completeness and accuracy is vital to some of the data marts and not others (Finance data mart versus CRM data mart), strategies can be put in place to permit partial load while minimizing its impact on all data marts. A possible approach is to filter out partial loads (based on an ETL Process Id) as well as all subsequent loads when populating data marts that require data to be complete and accurate
|
150 | 152 |
|
151 | 153 | The complete error handling for each layer in the architecture is documented in detail in the error handling and recycling approach.
|
152 |
| - |
153 |
| -## The Staging Area |
154 |
| - |
155 |
| -This section describes both the concepts and structure of the Staging Area in detail. |
156 |
| - |
157 |
| -### Staging Area table structure |
158 |
| - |
159 |
| -The Staging Area is modelled after the structure of the application that supplies the data. This excludes all constraints or referential integrity the source tables might have; the Data Warehouse assumes the source systems handle this, and otherwise this will be handled by the Integration / Presentation Layer processes of the Data Warehouse. |
160 |
| - |
161 |
| -The structure of the Staging Area therefore is the same as the source table, but always containing the following process attributes: |
162 |
| - |
163 |
| -| Column Name | **Required / Optional** | **Data Type / constraint** | **Reasoning** | **DIRECT equivalent** | |
164 |
| -| ------------------------ | ----------------------- | ----------------------------------------------------- | ------------------------------------------------------------ | ------------------------------- | |
165 |
| -| Load Date/Time Stamp | Required | High precision date/time – not null | The date/time that the record has been presented to the Data Warehouse environment. | | |
166 |
| -| Event Date/Time | Required | High precision date/time – not null | The date/time the change occurred in the source system. |Event Datetime | |
167 |
| -| Source System ID / Code | Required | Varchar(100) – not null | The code or ID for the source system that supplied the data. |Record Source | |
168 |
| -| Source Row ID | Required | Integer – not null | Audit attribute that captures the row order within the data delta as provided by a unique ETL execution. The combination of the unique execution instance and the row ID will always relate back to a single record. Also used to distinguish order if the effective date/time itself is not unique for a given key (due to fast-changing data). | Source Row ID | |
169 |
| -| CDC Operation | Required | Varchar(100) – not null | Information derived or received by the ETL process to derive logical deletes. |CDC Operation | |
170 |
| -| Full row hash | Optional | Character(32), when using MD5 – not null | Using a checksum for record comparison requires storing a checksum value as an attribute. Can be made optional if column-by-column comparison is implemented instead. | Hash Full Record | |
171 |
| -| ETL Process Execution ID | Required | Integer – not null | Logging which unique ETL process has inserted the record. | Insert Module Instance ID | |
172 |
| -| Upstream Hash Values | Optional | Character(32), when using MD5 – not null | Any pre-calculated hash values one may like to add to optimize upstream parallel loading (i.e. pre-hashing business keys), | Hash <> | |
173 |
| -| <source attributes> | Required | According to data type conversion table - nullable | The source attributes as available. Note that if a primary hash key is not used the natural key (source primary key) needs to be set to NOT NULL. All other attributes are nullable. | N/A | |
174 |
| - |
175 |
| -## Staging Area development guidelines |
176 |
| - |
177 |
| -The following is a list of conventions for the Staging Area: |
178 |
| - |
179 |
| -* When loading data from the source, always load the lowest level (grain) of data available. If a summary is required and even available as a source, load the detail data anyway |
180 |
| -* If the ETL platform allows it, prefix the ‘area’, ‘folder’ or ‘namespace’ in the ETL platform with ‘100_’ because this is the first Layer in the architecture. Source definition folders, if applicable, are labelled ‘000_’. This forces most ETL tools to sort the folders in the way the architecture handles the data. |
181 |
| -* Source to Staging Area ETL processes use the truncate/insert load strategy. When delta detection is handled by the DWH (i.e. using a Full Outer Join) a Landing Area table can be incorporated. |
182 |
| -* Everything is copied as-is, no transformations are done other than formatting data types. The Staging Area processing may never lead to errors! |
183 |
| - |
184 |
| -# Persistent Staging Area |
185 |
| - |
186 |
| -The structure of the PSA is the same as the Staging Area (including the metadata attributes). The following attributes are mandatory for the PSA tables: |
187 |
| - |
188 |
| -| **Column Name** | **Required / Optional** | **Data Type / constraint** | **Reasoning** | **DIRECT equivalent** | |
189 |
| -| --------------------------------- | ----------------------- | ----------------------------------------------------- | ------------------------------------------------------------ | ------------------------------- | |
190 |
| -| Primary hash key i.e. <entity>_SK | Optional | Character(32), when using MD5 – not null | The hashed value of the source (natural) key. Part of the primary key which is issued for each record in the history table. Can be used instead of composite primary key. | N/A | |
191 |
| -| Effective Date/Time | Required | High precision date/time – not null | The date/time that the record has been presented to the Data Warehouse environment. If a Staging Area is used these values will be inherited. | Insert Datetime | |
192 |
| -| Event Date/Time | Required | High precision date/time – not null | The date/time the change occurred in the source system. If a Staging Area is used these values will be inherited. | Event Datetime | |
193 |
| -| Source System ID / Code | Required | Varchar(100) – not null | The code or ID for the source system that supplied the data. | Record Source | |
194 |
| -| Source Row ID | Required | Integer – not null | Audit attribute that captures the row order within the data delta as provided by a unique ETL execution. The combination of the unique execution instance and the row ID will always relate back to a single record. Also used to distinguish order if the effective date/time itself is not unique for a given key (due to fast-changing data). If a Staging Area is used these values will be inherited. | Source Row ID | |
195 |
| -| CDC Operation | Required | Varchar(100) – not null | Information derived or received by the ETL process to derive logical deletes. If a Staging Area is used these values will be inherited. | CDC Operation | |
196 |
| -| Full row hash | Optional | Character(32), when using MD5 – not null | Using a checksum for record comparison requires storing a checksum value as an attribute. Can be made optional if column-by-column comparison is implemented instead. If a Staging Area is used these values will be inherited. | Hash Full Record | |
197 |
| -| ETL Process Execution ID | Required | Integer – not null | Logging which unique ETL process has inserted the record. | Insert Module Instance ID | |
198 |
| -| Current Row Indicator | Optional | Varchar(100) – not null | A flag or Boolean to indicate if the record is the most current one (in relation of the effective date). | Current Record Indicator | |
199 |
| -| Change Date/Time | Optional | High precision date/time – nullable | A derived date/time field to standardize the main business effective date/time for more harmonised upstream processing. | Change Datetime | |
200 |
| -| <source attributes> | Required | According to data type conversion table - nullable | The source attributes as available. Note that if a primary hash key is not used the natural key (source primary key) needs to be set to NOT NULL. All other attributes are nullable. | N/A | |
201 |
| - |
202 |
| -The ETL process from the Staging Area to the PSA checks the data based on the source key and the date/time information and compares all the attribute values. This can result in the following actions: |
203 |
| - |
204 |
| -* No history record is found: insert a new record in the history table. |
205 |
| -* A record is found and the source attribute values are different from the history attribute values: insert a new record. Update current row indicators if adopted. |
206 |
| -* A record is found but there are no changes found in the attribute comparison: ignore. |
207 |
| - |
208 |
| -Note: there are other suitable approaches towards a PSA. Depending on the requirements there can also be opted for a snapshot PSA where every run inserts a new record (instance) of the source data. This creates a more redundant dataset but arguably makes reloading data to the Integration Layer easier. |
209 |
| - |
210 |
| -When loading data delta directly into the PSA (i.e. the Staging Area is not adopted) the same rules apply as for the Staging Area. |
211 |
| - |
212 |
| -## Persistent Staging Area development guidelines |
213 |
| - |
214 |
| -The following is a list of development conventions for the Persistent Staging Area (PSA): |
215 |
| - |
216 |
| -* When loading the PSA from the Staging Area, always start a PSA ETL process as soon as the Staging Area is finished to ensure that there are no ‘gaps’ in the history. Since the Staging Area has the ‘truncate/insert’ load strategy, PSA data has to be processed before the next Staging Area run. During normal loads, the Integration Area has no dependency on the History Area and loading into history and integration can be done in parallel if the Data Warehouse has capacity for concurrent jobs. This is handled by the ‘Batch’ concept which guarantees the unit of work; e.g. making sure all data delta has been processed |
217 |
| -* If the ETL platform supports it, prefix the ‘schema’, ‘area’ or ‘folder’ in the RDBMS and ETL software with ‘150_’. The PSA is part of the first layer in the architecture, but is updated after the Staging Area (if adopted). This forces most ETL tools to sort the folders in the way the architecture handles the data |
218 |
| -* Everything is copied as-is, no transformations are done. |
219 |
| -* There is no requirement for error handling in PSA ETL. |
220 |
| -* When using files, changes in file formats (schema) over time can be handled in separate file mask metadata versions. |
0 commit comments