Skip to content

Commit 440cf77

Browse files
committed
Added some solution patterns
1 parent 580de03 commit 440cf77

3 files changed

+233
-0
lines changed
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Implementation Pattern - SQL Server 2008-2012 -2014 - CDC and Replication
2+
3+
## Purpose
4+
This implementation pattern documents how SQL Server Replication and native Change Data Capture (CDC) are configured to be consistent with the ETL Framework.
5+
Structure
6+
The aim is to be as less intrusive on the source systems as possible when using Replication and CDC, this is done by:
7+
Configuring a replication Publication Agent on the source system for the required tables and databases. This is a transactional replication.
8+
Configuring the Distribution Agent (distribution database) on the Data Warehouse server.
9+
Configure the Subscribing Agent on the Data Warehouse server as a pull mechanism for more flexibility (when splitting the location of the Distribution and Subscribing Agent).
10+
The main reason for this configuration is to be as less intrusive as possible for the source system. The services (agents) can later be hosted on other severs than the Data Warehouse server if required (for example in a central distribution hub).
11+
By creating a Subscribing Agent on your Data Warehouse server you automatically create a replicated table, on which you can enable native CDC. The resulting database structure is as follows:
12+
Source database (typically on another server).
13+
Replicated source database (on the Data Warehouse Server). This database has CDC enabled, and because of this will contain the replicated source but also the log of changes on this source (corresponding CDC table).
14+
Staging Area database (on the Data Warehouse server) as part of the default ETL Framework (100_Staging_Area). This database will ultimately receive the CDC delta.
15+
The following diagram shows the overview of this implementation of replication and CDC:
16+
17+
18+
Figure 1: Replication and CDC configuration
19+
SQL Server’s native CDC functionality reads the transaction log to record changes in system tables associated with each table for which CDC is enabled. It writes those files to system tables in the same database, and those system tables are accessible through direct queries or system functions. CDC can be enabled using the available functions in SQL Server:
20+
Execute sys.sp_cdc_enable_db on the Replicated Source database.
21+
Execute sys.sp_cdc_enable_table on the table that should have CDC enabled. The following minimum parameters are required for use in the ETL Framework:
22+
Source_schema: database schema if available, otherwise ‘dbo’ as the default schema.
23+
Source_name: the name of the source table.
24+
Supports_Net_Changes : 1 (enable).
25+
The newly created CDC table is created under the ‘cdc’ schema as part of the System Tables.
26+
Related Design Patterns
27+
None.
28+
Consequences
29+
This approach requires changes to the source systems, which may not always be possible or allowed. It has to be verified if you are allowed to install the publisher agent on the source system.
30+
Information about the state of CDC, or disabling the mechanism is available using similar procedures to the creation statement:
31+
EXEC sys.sp_cdc_disable_table
32+
@source_schema = N'dbo',
33+
@source_name = N'Employee',
34+
@capture_instance = N'dbo_Employee'
35+
EXEC sys.sp_cdc_help_change_data_capture
36+
Discussion items (not yet to be implemented or used until final)
37+
None.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Solution Pattern - Data Modelling - Data Vault Integration Layer
2+
3+
## Purpose
4+
This Implementation Pattern describes the data modelling conventions for a Data Vault based Integration Layer. It documents how the OMD attributes relate to support Data Vault based ETL
5+
Structure
6+
The modelling conventions for Data Vault are as follows:
7+
Table Type
8+
Table name convention
9+
Mandatory attribute
10+
Comments
11+
Hub
12+
HUB_<name>
13+
<table name without HUB_>_SK
14+
OMD_INSERT_MODULE_INSTANCE_ID
15+
OMD_INSERT_DATETIME
16+
OMD_RECORD_SOURCE_ID
17+
<Business Key>
18+
The first attribute (SK) is the primary key
19+
An unique key / index is placed on the <Business Key> attribute (5)
20+
Link
21+
LNK_<name>
22+
<table name without LNK_>_SK
23+
OMD_INSERT_MODULE_INSTANCE_ID
24+
OMD_INSERT_DATETIME
25+
OMD_RECORD_SOURCE_ID
26+
<Hub Keys>
27+
The first attribute (SK) is the primary key
28+
An unique key / index is placed on the combination of Hub keys
29+
Satellite
30+
SAT_<name>
31+
<Hub key – inherited from Hub table>
32+
OMD_EFFECTIVE_DATETIME
33+
OMD_CURRENT_RECORD_INDICATOR
34+
OMD_EXPIRY_DATETIME
35+
OMD_INSERT_MODULE_INSTANCE_ID
36+
OMD_UPDATE_MODULE_INSTANCE_ID
37+
OMD_DELETED_RECORD_INDICATOR
38+
OMD_SOURCE_ROW_ID
39+
OMD_CHECKSUM
40+
The first 3 attributes compose the primary key
41+
The Hub Key attribute is not set in this table, but inherited from the parent Hub table
42+
43+
Link Satellite
44+
LSAT_<name>
45+
<LNK key – inherited from Link table>
46+
OMD_EFFECTIVE_DATETIME
47+
OMD_CURRENT_RECORD_INDICATOR
48+
OMD_EXPIRY_DATETIME
49+
OMD_INSERT_MODULE_INSTANCE_ID
50+
OMD_UPDATE_MODULE_INSTANCE_ID
51+
OMD_DELETED_RECORD_INDICATOR
52+
OMD_SOURCE_ROW_ID
53+
OMD_CHECKSUM
54+
The first 3 attributes compose the primary key
55+
The Link Key attribute is not set in this table, but inherited from the parent Link table
56+
57+
58+
The following data types apply:
59+
OMD_EFFECTIVE_DATETIME; high precision datetime e.g. datetime2(7), date)
60+
OMD_CURRENT_RECORD_INDICATOR; integer
61+
OMD_EXPIRY_DATETIME
62+
OMD_INSERT_MODULE_INSTANCE_ID
63+
OMD_UPDATE_MODULE_INSTANCE_ID
64+
OMD_DELETED_RECORD_INDICATOR
65+
OMD_SOURCE_ROW_ID
66+
OMD_CHECKSUM
67+
Color schema
68+
The following color schema modelling convention for Data Vault is used:
69+
Table Type
70+
Color
71+
Hub
72+
Blue
73+
Link
74+
Red
75+
Satellite
76+
Light yellow
77+
Link Satellite
78+
Dark yellow
79+
80+
Kim Vigors fix the colours in table, or insert as pic. Check w/ Roelant Vos
81+
Related Design Patterns
82+
None.
83+
Consequences
84+
Index strategy is documented in dedicated RDBMS Solution Patterns
85+
Discussion items (not yet to be implemented or used until final)
86+
None.
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Solution Pattern - Data Modelling - Presentation Layer
2+
3+
## Purpose
4+
This Implementation Pattern describes the data modelling conventions and architecture for the Presentation Layer.
5+
Structure
6+
In principle, there are two mechanisms towards preparing information for consumption in the Presentation Layer (created in the Presentation Layer database):
7+
Direct view on top of the Integration Layer (virtual information mart). In by far the most scenarios the first option (direct view / virtual) option is preferred as the subsequent layers in the (BI) architecture are typically MOLAP or in-memory.
8+
Table / persistence / physical storage using a view to join and prepare the data in the format that matches the table (logic) and can be used to incrementally load the table.
9+
In both cases these Presentation Layer objects will require one or more views to decouple the Business Intelligence (BI) and Data Warehouse (DWH) environments. These decoupling views are also intended to apply the history perspective at attribute level; e.g. how every attribute is displayed in time (e.g. Type1, Type2, Type 6).
10+
The following are general guidelines:
11+
The logic views and tables contain all history (Type2) by default. Any interpretation of history, such as ‘current state view’, can be queried using the decoupling views.
12+
There may be multiple decoupling views, but two views is the guideline, to represent history using different perspectives (e.g. current state / Type1 or mixed / Type2).
13+
The decoupling views are faced towards the BI / business side and aims to present information in the way it is easiest to consume.
14+
The decoupling views should be generated from metadata, and use Extended Properties defined at the view logic view to generate an accurate representation of history.
15+
The logic views used to populate tables are geared towards ETL and follows the rigor of naming conventions to support automation. BIML scripts are available to generate SSIS packages from the logic view to the Presentation Layer tables.
16+
The pres schema is the enterprise information / mart schema and contains the decoupling views, so this contains what is effectively the complete dimensional model exposed to the BI environment. This also allows the decoupling views to have the same name as the accompanying Dimension or Fact table.
17+
The ben schema contains the logic views and tables since objects (views and tables in this case) cannot be named identically (as they would be in the pres schema). The views are named with the ‘_VW’ suffix.
18+
Normal casing is used, with underscores (no spaces) for all tables and attributes.
19+
Definitions are maintained in the Confluence Business Glossary.
20+
Logic views are primarily manually developed (with some history merge scripts to assist) as these views handle the change from data handling to business use.
21+
Tables and decoupling views are generated from metadata.
22+
Decoupling views are used to expose history using additional metadata (‘extended properties’).
23+
The extract schema is used for data provision to support external systems (e.g. non-BI) and is therefore considered not to be part of the standard Presentation Layer.
24+
There is also a va schema which is specifically there to expose information to SAS Visual Analytics.
25+
There also is a temp schema which is strictly only used to store ETL required information / to support the performance and workings of the ETL.
26+
This is displayed in the following diagram:
27+
28+
The modelling conventions for the Presentation Layer tables are outlined in the table below.
29+
Table Type
30+
Table name convention
31+
Mandatory attribute
32+
Comments
33+
Dimension
34+
ben.DIM_<name>
35+
table name>_SK
36+
OMD_INSERT_MODULE_INSTANCE_ID
37+
OMD_DELETED_RECORD_INDICATOR
38+
OMD_UPDATE_MODULE_INSTANCE_ID
39+
OMD_CHECKSUM_TYPE_1
40+
OMD_CHECKSUM_TYPE_2
41+
OMD_EFFECTIVE_DATETIME
42+
OMD_EXPIRY_DATETIME
43+
OMD_CURRENT_RECORD_INDICATOR
44+
<attributes>
45+
The first attribute (SK) is the primary key, and is a hash value (32 byte character)
46+
Optionally, a unique key / index is placed on the combination of level natural keys and the OMD_EFFECTIVE_DATETIME. This represents a unique point in time record. See the consequences section for more details
47+
Every attribute is specified as Type 0, Type 1, Type 2 (can be combined to type 3 or 6 - check the relevant pattern). This is specified in the model / database as an extended property
48+
Fact Table
49+
ben.FACT_<name>
50+
<table name>_SK
51+
<Dimension Keys>
52+
OMD_INSERT_MODULE_INSTANCE_ID
53+
OMD_INSERT_DATETIME
54+
OMD_INSERT_MODULE_INSTANCE_ID
55+
OMD_DELETED_RECORD_INDICATOR
56+
OMD_UPDATE_MODULE_INSTANCE_ID
57+
OMD_CHECKSUM_TYPE_1
58+
OMD_CHECKSUM_TYPE_2
59+
OMD_EFFECTIVE_DATETIME
60+
OMD_EXPIRY_DATETIME
61+
OMD_CURRENT_RECORD_INDICATOR
62+
<attributes>
63+
The first attribute (SK) is the primary key
64+
A unique key / index is placed on the combination of Dimension keys.
65+
Other
66+
ben.<name>
67+
<table name>_SK
68+
OMD_INSERT_MODULE_INSTANCE_ID
69+
OMD_INSERT_DATETIME
70+
<any OMD attributes required>
71+
<attributes>
72+
Not every delivery of information is necessarily in the form of a Star Schema / Dimensional Model. If a dataset is better delivered in a different format (wide table, normalised) this is preferred.
73+
74+
The modelling conventions for the Presentation Layer views are outlined in the table below.
75+
View Type
76+
Table name convention
77+
Mandatory attribute
78+
Comments
79+
Logic View
80+
ben.<name>_VW
81+
<view name>_SK
82+
OMD_INSERT_MODULE_INSTANCE_ID
83+
OMD_INSERT_DATETIME
84+
OMD_INSERT_MODULE_INSTANCE_ID
85+
OMD_DELETED_RECORD_INDICATOR
86+
OMD_UPDATE_MODULE_INSTANCE_ID
87+
OMD_CHECKSUM_TYPE_1
88+
OMD_CHECKSUM_TYPE_2
89+
OMD_EFFECTIVE_DATETIME
90+
OMD_EXPIRY_DATETIME
91+
OMD_CURRENT_RECORD_INDICATOR
92+
<attributes>
93+
Used to load a standard Dimension or Fact table supported by BIML.
94+
The name of the view needs to match the name of the target table (except for the _VW suffix)
95+
The _VW suffix is required as there may be a table with the original name in the ben schema
96+
The checksums for Type 1 and Type 2 calculations will be handled by the BIML, and do not need to be present in the views. This allows for a more automated update if required
97+
All other OMD attributes required in the target table are handled by the BIML scripts
98+
Decoupling View
99+
pres.<name> (for regular views)
100+
pres.<name>_history (for history or mixed-history views)
101+
Underlying ‘ben’ table or logic view, but without OMD attributes.
102+
Surrogate keys optional.
103+
Business-facing, e.g. DIM_CUSTOMER, or DIM_CUSTOMER_HISTORY.
104+
105+
Related Design Patterns
106+
Design Pattern 002 - Generic - Types of History
107+
Consequences
108+
Related to having Data Vault Surrogate (Hub) Keys (SK) in the Dimensional Model: it is OK to add Hub keys (Surrogate Keys) in the Presentation Layer for tracing and auditability purposes. However they cannot be adequately used as level keys as a level in a Dimension may not 100% map a business concept. For instance a 'business unit type' may not be modelled as a Hub in the Data Vault, but could be a level in a Dimension. By using Hub Keys for Dimension lookups a dependancy between the Integration and Presentation Layers is created that should be avoided. An example is where you have Business Unit Type, State, Counter and Ownership in the same Satellite (e.g. SAT_BUSINESS_UNIT). If these attributes are modelled in separate Dimensions in the Presentation Layer the Hub Key (from HUB_BUSINESS_UNIT) cannot be used, rather a separate Dimension Key must be created and a dedicated natural key must be selected appropriate for the dimension. In other words, lookups and constraints should be using natural keys.
109+
Discussion items (not yet to be implemented or used until final)
110+
None.

0 commit comments

Comments
 (0)