You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 000_Documentation/DIRECT_Functional_Design.md
+29-34Lines changed: 29 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,27 @@
1
1
# Introduction of DIRECT
2
-
DIRECT, the Data Integration & Execution Control Tool, is a data integration control and execution metadata model. It is a core and stand-alone component of the Data Integration Framework. Every Extract Transform and Load (ETL) process is linked to this model which provides the orchestration and management capabilities for data integration. ETL in this context is a broad definition covering various related data integration approachess such as ELT (Extract Load, Transform - pushdown into SQL or underlying processing) and LETS (Load-Extract-Transform-Store). ETL in this document essentially covers all processes that 'touch' data.
2
+
DIRECT, the Data Integration & Execution Control Tool, is a data integration control and execution metadata model. It is a core and stand-alone component of the Data Integration Framework.
3
3
4
-
The repository essentially captures process information about the ETL and is an invaluable source of information to monitor how the system is expanding (time, size) but also to drive and monitor ETL processes.
4
+
Every Data Integration / Extract Transform and Load (ETL) process is linked to this model which provides the orchestration and management capabilities for data integration.
5
5
6
-
This document references all other architectural documents because the metadata model is an integral part of a fully implemented system. For functionality such as rollback and recovery information about the individual ETL processes including the related layers and areas as defined in the Outline Architecture are retrieved from the repository.
6
+
Data Integration in this context is a broad definition covering various implementation techniques such as ELT (Extract Load, Transform - pushdown into SQL or underlying processing) and LETS (Load-Extract-Transform-Store).
7
7
8
-
The objective of the DIRECT Framework is to provide a structured approach to describing and recording ETL processes that can be made up of many separate components. This is to be done in such a way that they can be represented and managed as a coherent system.
8
+
Data Integration in this document essentially covers all processes that 'touch' data.
9
+
10
+
The DIRECT repository captures Data Integration process information, and is an invaluable source of information to monitor how the system is expanding (time, size) but also to drive and monitor processes - a fundamental requirement for parallel processing and transaction control.
11
+
12
+
The objective of the DIRECT Framework is to provide a structured approach to describing and recording Data integration processes that can be made up of many separate components. This is to be done in such a way that they can be represented and managed as a coherent system.
9
13
# Overview
10
-
This document covers the design and specifications for the metadata repository and the integration (events) for data integration processes. The documentation also includes the available (logical) scripted components for controlled execution of ETL within the Enterprise Data Warehouse. The DIRECT framework covers a broad variety of process information, including (but not limited to):
11
-
* What process information will be stored and how
12
-
* How this is integrated into the various defined Layers and Areas
13
-
* Of what entities the metadata model consists
14
-
* The available procedures for managing the Data Warehouse environment
15
-
* Concepts and principles
16
-
* The logic which can be used to control the processes
17
-
* Housekeeping functions
14
+
This document covers the design and specifications for the metadata repository and the integration (events) for data integration processes.
15
+
16
+
The documentation also includes the available (logical) scripted components for controlled execution of Data Integration processes. The DIRECT framework covers a broad variety of process information, including (but not limited to):
17
+
18
+
* What process information will be stored and how.
19
+
* How this is integrated into the various defined Layers and Areas.
20
+
* Of what entities the metadata model consists,
21
+
* The available procedures for managing the data solution.
22
+
* Concepts and principles.
23
+
* The logic which can be used to control the processes.
24
+
* Housekeeping functions.
18
25
19
26
## Positioning of DIRECT
20
27
The position of the control and execution framework in the overall architecture is:
@@ -24,19 +31,19 @@ The position of the control and execution framework in the overall architecture
24
31
# Concepts
25
32
## Purpose
26
33
27
-
In general, the process control framework supports the ability to trace back what data has been loaded, when and in what way for every interface.
34
+
The process control framework supports the ability to trace back what data has been loaded, when and in what way for every interface.
28
35
29
-
A single attribute that has been populated in any location of the overall architecture should be auditable - and being able to be traced back to the originating source system.
36
+
Any single data element (e.g. attribute value in a table) should be auditable. It should be possible to track the what processes have been run that has led to the visible result.
30
37
31
38
This means that the following information must be available:
32
39
33
-
- When a record was inserted
34
-
- When a record was updated
35
-
- What the source was where the record originated from
36
-
- When the event took place that changed the source data
37
-
- Which process has loaded the data
38
-
- Within which workflow the data was loaded
39
-
- Which platform the ETL took place
40
+
- When a recordwas inserted.
41
+
- When a record was updated.
42
+
- What the source was where the record originated from.
43
+
- When the event took place that changed the source data.
44
+
- Which process has loaded the data .
45
+
- Within which workflow the data was loaded.
46
+
- Which platform the process took place.
40
47
41
48
## Elements of process information
42
49
@@ -105,7 +112,7 @@ The following is a high level overview of the reprocessing strategy. These actio
105
112
***Staging Area** ; the target table is truncated. This essentially is a redundant step because the Staging Area is truncated by the Module Instance but the step is added for consistency reasons and to be on the safe side for reprocessing
106
113
***Staging Area** ; if the Source Control table is implemented this information is corrected by deleting the entries that were inserted by the failed Module Instances
107
114
***Persistent Staging Area** ; all records that have been inserted by the failed Module Instances are deleted. Due to the default (mandatory) structure of the History Area tables only the delete statement is sufficient
108
-
***Integration Layer** ; rollback varies depending on the type of model but rollback usually is a combination of inserts and deletes depending on the types of tables in the Data Warehouse (in turn dependant on the data modelling technique). An example of recovery using Data Vault is added below:
115
+
***Integration Layer** ; rollback varies depending on the type of model but rollback usually is a combination of inserts and deletes depending on the types of tables in the Data Warehouse (in turn dependent on the data modelling technique). An example of recovery using Data Vault is added below:
109
116
* Hub table: deletion of all records inserted by the Module Instance.
110
117
* Link table: deletion of all records inserted by the Module Instance.
111
118
* Satellite table: deletion of all records inserted based on the Insert Module Instance ID attribute. Also included is an update of all records to set these to be the active record again (repair timelines) using the Update Module Instance ID information
@@ -126,29 +133,17 @@ AREA | The Area table contains the list of architecture areas as defined in the
126
133
BATCH | The Batch table contains the unique list of Batches as registered in the framework. To be able to run successfully each Batch must be present in this table with its own unique Batch ID. Batch IDs are generated keys.
127
134
BATCH_INSTANCE | At runtime, the framework generates a new Batch Instance ID for the Batch execution. This information is stored in this table along with ETL process statistics. The Batch Instance table is the driving table for process control and recovery as it contain information about the status and results of the Batch run.
128
135
BATCH_MODULE | The Batch Module table contains the relationships between batches and modules. It is a many-to-many relationship, i.e. one Batch can contain multiple Modules, and one Module could be utilised by multiple Batches
129
-
DATA_AUDIT | The Data Audit table provides a location for custom functionality to perform sanity checks and/or housekeeping on specific data store. These processes should be run separately from the main ETL processes and can be configured to perform a range of supporting functionality such as clean-ups and reconciliation.
130
-
DATA_AUDIT_TYPE | The Data Audit Type table was added to allow for a classification of Data Store Audits, and to provide additional handling and descriptive information about these housekeeping processes
131
-
DATA_STORE | The Data Store table contains descriptive information of data stores that are read from or loaded by the ETL process. The ‘Allow Truncate Indicator’ attribute can be used in custom Stored Procedures to prevent accidental truncation of tables (safety catch).
132
-
DATA_STORE_TYPE | The Data Store Type table contains optional descriptive information: the type of data stores, such as flat file or table.
133
-
ERROR_BITMAP | The Error Bitmap table contains the master list of possible errors. One or more errors from this list may be detected and logged as an Error Bitmap in the target tables. A bitwise join will enable this bitmap to relate back to the various errors as defined in this table.
134
-
ERROR_TYPE | The Error Type table contains descriptive information about types of events or errors for reporting purposes. By default all errors are associated with the Error Bitmap but additional errors and error types can be added.
135
136
EVENT_LOG | The Event Log table is a generic logging table which is used to track and record events that happen during ETL execution. The Event Log table can contain informative details (i.e. ‘Batch Instance was created’) or information related to issues or errors provided by the ETL platform.
136
137
EVENT_TYPE | The Event Type table contains descriptive information about types of events or errors for reporting purposes, such as process logs, environment related issues, and custom defined errors or ETL process errors.
137
138
EXECUTION_STATUS | The Execution Status table contains descriptive attributes about the Execution Status codes that the framework uses during the ETL process.
138
-
FREQUENCY | The Frequency table contains descriptive information about the frequency codes of a Batch run.
139
139
LAYER | The Layer table contains the list of Layers as defined in the ETL Framework Outline Architecture. Unlike the Areas this information is not queried during Module execution and is purely descriptive for use in reporting. The Layer is the higher level classification of ETL processes in the ETL Framework.
140
140
MODULE | The Module table contains the unique list of Modules as registered in the framework. To be able to run successfully each Module must be present in this table with its own unique Module ID. Module IDs are not generated keys and are consistent across environments and represent a single ETL process.
141
-
MODULE_DATA_STORE | The Module Data Store table contains the relationships between Modules and the Data Stores used in the Modules. For instance the target (mandatory) and source (optional) for each Module.
142
141
MODULE_INSTANCE | At runtime, the framework generates a new Module Instance ID for the Module execution. This information is stored in this table along with ETL process statistics. The Module Instance table is the driving table for process control and recovery as it contain information about the status and results of the Module run. The generated Module Instance ID is stored in the target tables for audit trail purposes. It also contains additional runtime details including the number of rows read (selected), updated, inserted, deleted, updated, discarded or rejected.
143
142
MODULE_PARAMETER | The Module Parameter table creates a relationship between specific parameters and the Modules for which they are applicable. It is best practice to ‘register’ the Modules that require certain parameters in their processing using this table.
144
-
MODULE_TYPE | The Module Type table contains optional descriptive information for reporting purposes. As the Module is defined as the smallest executable component typically more than one type of Module is used, for instance ETL programs and Operating Scripts.
145
143
NEXT_RUN_INDICATOR | The Next Run Indicator table contains descriptive attributes about the Next Run Indicator codes that the framework uses during the ETL process.
146
144
PARAMETER | The Parameter table provides the option to define parameters that can be queried by custom code in the ETL process. This can include (but not limited to!) flags (Initial Load Y/N) or tracking date ranges for moving loading windows into the Presentation Layer.
147
145
PROCESSING_INDICATOR | The Processing Indicator table contains descriptive attributes about the Processing Indicator codes that the framework uses during the ETL process.
148
-
RECORD_SOURCE | The Record Source table contains abbreviations and descriptions of the source systems that interface to the Data Warehouse. Depending on the Staging Layer design decisions the Record Source Code is resolved to the ID during the Integration Layer ETL, or the ID is hard-coded in the Staging Area. Either way, the Record Source provides the option to load datasets from different systems that may contain similar information (i.e. the same keys) with different meaning.
149
-
SEVERITY | Severity is an optional descriptive attribute that can be used to classify the level of Errors defined in the Bitmap Error table. It can be used for reporting purposes and to select (a certain quality of) data into the Presentation Layer.
150
146
SOURCE_CONTROL | The Source Control table is used in source-to-staging interfaces that require the administration of load windows. Examples are CDC based interfaces, pull-delta interfaces or when only a certain range from a full dataset is required but all data is provided. It is designed to track the load window for each individual Module.
151
-
VERSION | Administrative information to record the DIRECT version used.
152
147
153
148
## Events
154
149
In order to provide a common, reusable means of interacting with the repository the framework includes a number of processes which collectively serve as the logic tier. The implementation of these events varies depending on the ETL software used in the various projects. This information is captured using Implementation Patterns, documenting how these concepts can be implemented using specific software. The following events, or functions, are defined as part of the framework:
Copy file name to clipboardExpand all lines: 000_Documentation/DIRECT_Setup_Tips.md
-4Lines changed: 0 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,3 @@
1
-
Can be used as dacpac file.
2
-
3
-
4
-
5
1
## Using Direct as a database project reference
6
2
7
3
If required, Direct can be installed in each database project as a reference, which means the Direct content will be installed in the hosting database when published.
0 commit comments