|
3 | 3 |
|
4 | 4 | The Data Integration framework provides a software and methodology independent, structured approach to developing data processes.
|
5 | 5 |
|
6 |
| -The framework is designed to facilitate a platform-independent, flexible and manageable development cycle. It covers pre-defined documents, templates, design decisions, implementation approaches, auditability and process control (orchestration approaches). |
| 6 | +The framework is designed to facilitate a platform-independent, flexible and manageable development cycle. It contains pre-defined documents, templates, design decisions, implementation approaches as well as auditing and process control (orchestration). |
7 | 7 |
|
8 |
| -The framework should not be seen as a one-size-fits all solution; everything is defined in a modular way as much as possible, and different elements can be applied to suit the needs of each individual solution. The Data Integration framework is set up as a variety of components which can be used in conjunction with each other, or as stand-alone additions to existing management information solutions. |
| 8 | +The framework is defined in a modular way, allowing different elements to be selected to suit the needs individual data solutions. The individual components can be used in conjunction with each other or as stand-alone additions to existing data solutions. |
9 | 9 |
|
10 |
| -The fundamental principle of the framework is to design for change by decoupling 'technical' logic and 'business' logic and ensuring every data integration process can run independently and in parallel - as well as recover at any point in time without impacting dependencies to other processes. |
| 10 | +The fundamental principle of the framework is to design for change by decoupling 'technical' logic and 'business' logic and ensuring each data integration process can run independently and in parallel with built-in recovery mechanisms. |
11 | 11 |
|
12 |
| -The framework aims to provide standards for decoupling (functional separation) so new or changed requirements in information delivery can be met without re-engineering the foundations of the data solution. These standards are developed and maintained in MarkDown format on Github to enable collective maintenance of this body of knowledge. |
| 12 | +The framework aims to provide guidelines for decoupling (functional separation) of the various elements of the data solution, so new or changed requirements can be incorporated without re-engineering the data solution foundations. |
13 | 13 |
|
14 |
| -On several occasions, this framework makes mention of an ETL Process Control framework. The default option for this is the DIRECT framework as maintained in the [DIRECT Github](https://github.com/RoelantVos/DIRECT) (private at the moment while being finalised). |
| 14 | +To enable collective maintenance of this body of knowledge these standards are developed and maintained using the MarkDown format on Github. |
| 15 | + |
| 16 | +On several occasions, the Data Integration framework makes mention of the ETL process control framework. Although other control frameworks can be added, the default option for this is the DIRECT framework as maintained in the [DIRECT Github](https://github.com/RoelantVos/DIRECT) (private at the moment while being finalised). |
15 | 17 |
|
16 | 18 | ## Why need a Data Integration framework?
|
17 | 19 |
|
18 | 20 | *‘If we want better performance we can buy better hardware, unfortunately we cannot buy a more maintainable or reliable system’.*
|
19 | 21 |
|
20 |
| -The design and implementation of the data integration is still largely a labour-intensive activity and typically consumes large amounts of effort in Data Warehouse and data integration projects. |
| 22 | +Design and implementation of data integration can be a labour-intensive activity that typically consumes large amounts of effort in Data Warehouse and data integration projects. |
21 | 23 |
|
22 | 24 | Over time, as requirements change and enterprises become more data-driven, the architecture faces challenges in the complexity, consistency and flexibility in the design (and maintenance) of the data integration flows.
|
23 | 25 |
|
24 |
| -These changes can include changes in latency requirements, a bigger variety of sources or the need for more parallel processing to a previously rigid serial pipeline. All of this typically occurs when adoption of Data Warehousing and BI matures within an organisation, and having up-to-date information become more and more mission critical. |
| 26 | +These changes can include changes in latency and availability requirements, a bigger variety of sources or the need to expose information in different ways. This typically occurs when adoption of data and information products (i.e. BI, Analytics) matures within an organisation and the need to have up-to-date information becomes more mission critical. |
| 27 | + |
| 28 | +Using a standard data integration approach will meet these challenges by providing structure, flexibility and scalability for the design of data flows. |
| 29 | + |
| 30 | +In a more traditional configuration, data solutions are often designed to store structured data for strategic decision making. This type of solution allows a small number of (expert) users to analyse (historical) data and define reports. |
25 | 31 |
|
26 |
| -Using a flexible data integration approach will meet these challenges by providing structure, flexibility and scalability for the design of data flows. |
| 32 | +Data is typically periodically extracted, cleansed, integrated and transformed in a centralised Data Warehouse from a heterogeneous set of sources. The focus for ETL in these design is typically on ‘correct functionality’ and ‘adequate performance’ - but not necessarily on design elements that are equally important for success. |
27 | 33 |
|
28 |
| -In a more traditional sense, data solutions are often designed to store structured data for strategic decision making, which allows a small number of (expert) users analyse (historical) data and reports. Data is typically periodically extracted, cleansed, integrated and transformed in a centralised Data Warehouse from a heterogeneous set of sources. The focus for ETL in these design is typically on ‘correct functionality’ and ‘adequate performance’ - but this focus misses key elements that are equally important for success. |
| 34 | +These elements, including consistency, degree of atomicity, ability to rerun, scalability and durability are addressed in the Data Integration framework. |
29 | 35 |
|
30 |
| -These elements, including consistency, degree of atomicity, ability to rerun, scalability and durability are addressed in the Data Integration framework. For example, data solutions may be required to cater for sending back cleansed or interpreted data to the operational systems. They also may need to handle unstructured data next to the structured data, as well as able to quickly respond to changes in (business) requirements. Lastly, they may need to support a ‘feedback loop’ to incorporate changes made by (authorised) end-users in the front-end environments. |
| 36 | +For example, data solutions may be required to cater for sending back cleansed or interpreted data to the operational (feeding, or source) systems. They also may need to handle unstructured data in addition to the structured data, as well as being able to quickly respond to changes in (business) requirements. Lastly, they may need to support a ‘feedback loop’ to incorporate changes made by (authorised) end-users in the front-end environments. |
31 | 37 |
|
32 |
| -By providing architecture patterns and templates, design decisions and guidelines for error handling and process control the Data Integration framework intends to provide a structured approach to data integration design - for a flexible and affordable development cycle. |
| 38 | +The Data Integration framework intents to provide architecture patterns and templates, design decisions and guidelines for error handling and process control for a flexible and manageable development cycle. |
33 | 39 |
|
34 | 40 | ## Key contents
|
35 | 41 |
|
36 |
| -The framework contains several high level documents providing an overview of the definitions, intent as well as layers and areas in the architecture. |
| 42 | +The framework contains a reference Solution Architecture that provides an overview of the definitions and intent of the various layers and areas that can be considered in the design. |
37 | 43 |
|
38 | 44 | The core body of knowledge sits in the various *Design Patterns* (details of specific concepts) and *Solution Patterns* (implementation guides at technical level).
|
39 | 45 |
|
40 |
| -The idea is that Design- and Solution patterns are continuously updated and added to. A typical solution design would select the relevant patterns to define the architecture. |
| 46 | +The idea is that Design- and Solution patterns are continuously updated and added to. A typical solution design would select the relevant patterns to define the architecture - captured in the Solution Architecture design artefact. |
41 | 47 |
|
42 | 48 | # Data Integration framework components
|
43 | 49 |
|
44 |
| -The diagram below outlines the components that are considered part of the Data Integration framework. These are all considered required to support a data solution and enable Data Warehouse Automation. |
| 50 | +The diagram below outlines the Data Integration framework components. These are all required to define a data solution that supports Data Warehouse Automation. |
45 | 51 |
|
46 |
| -The intent is to enable a standard and structured way for documenting decisions made related to system design and intended operation. |
| 52 | +The idea is to enable a standard and structured way for documenting decisions related to system design and operation. |
47 | 53 |
|
48 | 54 | 
|
49 | 55 |
|
50 | 56 |
|
51 | 57 |
|
52 |
| -- **Reference Solution Architecture**; a blueprint for a common Data Warehouse / Information Hub architecture. The corresponding documents outline the various layers and areas that form the data solution. |
53 |
| -- **Reference Technical Architecture**; capturing the common technical requirements relevant to the Solution Architecture. The intent for this template is to capture the infrastructure and software specifics, as well as context around the physical data models and database / data platform configuration. The Technical Architecture also covers details around the implementation of security, encryption and retention approaches. |
| 58 | +- **Reference Solution Architecture**; a blueprint for a common data solution architecture such as Data Warehouses, Data Hubs etc. The corresponding documents outline the various layers and areas that define the data solution. |
| 59 | +- **Reference Technical Architecture**; capturing the technical details relevant to the Solution Architecture. The intent for this template is to capture the infrastructure and software specifics, as well as context for the physical data models and database / data platform configuration. The Technical Architecture also covers details around the implementation of security, encryption and retention approaches. |
54 | 60 | - **Design Patterns**; documentation of key design decisions and backgrounds on design principles: the 'how-to's'. This includes the application of data integration and modelling concepts. Design Patterns follow a defined template and are centrally stored and managed.
|
55 | 61 | - **Solution Patterns**; the practical details on how to implement concepts explained in a Design Pattern for a given technology. Similar to Design Patterns, the Solution Patterns all follow the same template. In many cases a single Design Pattern is referred to by multiple Solution Patterns, all of which document how to implement the concept for a specific technology.
|
56 | 62 | - **Documentation templates**, standards and conventions; modelling and technical conventions.
|
57 |
| -- **ETL templates**; technical templates that can be used as blueprints to generate data integration processes with or against. |
| 63 | +- **ETL templates & patterns**; technical templates that can be used as blueprints to generate data integration processes with or against. |
58 | 64 | - **ETL mapping metadata**; approaches for managing the source-to-target mappings - vital ETL metadata to enable Data Warehouse Automation / ETL generation.
|
59 | 65 | - **ETL process control framework**; this is the runtime execution, logging and monitoring of data integration processes, including recovery and orchestration. This is further detailed in the DIRECT Github (Data Integration Runtime Execution and Control framework). DIRECT includes a repository for ETL control, integration hooks for ETL processes and automation scripts.
|
60 | 66 |
|
|
0 commit comments