Skip to content

Commit 7a2a81c

Browse files
authored
Merge pull request #88509 from djpmsft/updates
Update Schema drift docs
2 parents 63e2769 + 75ba519 commit 7a2a81c

File tree

9 files changed

+31
-42
lines changed

9 files changed

+31
-42
lines changed
Lines changed: 30 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,78 +1,67 @@
11
---
2-
title: Azure Data Factory Mapping Data Flow Schema Drift
2+
title: Schema drift in Mapping Data Flow | Azure Data Factory
33
description: Build resilient Data Flows in Azure Data Factory with Schema Drift
44
author: kromerm
55
ms.author: makromer
6+
ms.reviewer: daperlov
67
ms.service: data-factory
78
ms.topic: conceptual
8-
ms.date: 10/04/2018
9+
ms.date: 09/12/2019
910
---
1011

11-
# Mapping data flow schema drift
12+
# Schema drift in Mapping Data Flow
1213

1314
[!INCLUDE [notes](../../includes/data-factory-data-flow-preview.md)]
1415

15-
The concept of Schema Drift is the case where your sources often change metadata. Fields, columns, types, etc. can be added, removed or changed on the fly. Without handling for Schema Drift, your Data Flow becomes vulnerable to changes in upstream data source changes. When incoming columns and fields change, typical ETL patterns fail because they tend to be tied to those source names.
16+
Schema drift is the case where your sources often change metadata. Fields, columns, and, types can be added, removed, or changed on the fly. Without handling for schema drift, your data flow becomes vulnerable to upstream data source changes. Typical ETL patterns fail when incoming columns and fields change because they tend to be tied to those source names.
1617

17-
In order to protect against Schema Drift, it is important to have the facilities in a Data Flow tool to allow you, as a Data Engineer, to:
18+
To protect against schema drift, it's important to have the facilities in a data flow tool to allow you, as a Data Engineer, to:
1819

19-
* Define sources that have mutable field names, data types, values and sizes
20+
* Define sources that have mutable field names, data types, values, and sizes
2021
* Define transformation parameters that can work with data patterns instead of hard-coded fields and values
2122
* Define expressions that understand patterns to match incoming fields, instead of using named fields
2223

23-
## How to implement schema drift in ADF Mapping Data Flows
24-
ADF natively supports flexible schemas that change from execution to execution so that you can build generic data transformation logic without the need to recompile your data flows.
24+
Azure Data Factory natively supports flexible schemas that change from execution to execution so that you can build generic data transformation logic without the need to recompile your data flows.
2525

26-
* Choose "Allow Schema Drift" in your Source Transformation
26+
You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When you do this, you can protect against schema changes from the sources. However, you'll lose early-binding of your columns and types throughout your data flow. Azure Data Factory treats schema drift flows as late-binding flows, so when you build your transformations, the drifted column names won't be available to you in the schema views throughout the flow.
2727

28-
<img src="media/data-flow/schemadrift001.png" width="400">
28+
## Schema drift in source
2929

30-
* When you've selected this option, all incoming fields will be read from your source on every Data Flow execution and will be passed through the entire flow to the Sink.
30+
In a source transformation, schema drift is defined as reading columns that aren't defined your dataset schema. To enable schema drift, check **Allow schema drift** in your source transformation.
3131

32-
* All newly detected columns (drifted columns) will arrive as String data type by default. In your Source Transformation, choose "Infer drifted column types" if you wish to have ADF automatically infer data types from the source.
32+
![Schema drift source](media/data-flow/schemadrift001.png "Schema drift source")
3333

34-
* Make sure to use "Auto-Map" to map all new fields in the Sink Transformation so that all new fields get picked-up and landed in your destination and set "Allow Schema Drift" on the Sink as well.
34+
When schema drift is enabled, all incoming fields are read from your source during execution and passed through the entire flow to the Sink. By default, all newly detected columns, known as *drifted columns*, arrive as a string data type. If you wish for your data flow to automatically infer data types of drifted columns, check **Infer drifted column types** in your source settings.
3535

36-
<img src="media/data-flow/automap.png" width="400">
36+
## Schema drift in sink
3737

38-
* Everything will work when new fields are introduced in that scenario with a simple Source -> Sink (Copy) mapping.
38+
In a sink transformation, schema drift is when you write additional columns on top of what is defined in the sink data schema. To enable schema drift, check **Allow schema drift** in your sink transformation.
3939

40-
* To add transformations in that workflow that handles schema drift, you can use pattern matching to match columns by name, type, and value.
40+
![Schema drift sink](media/data-flow/schemadrift002.png "Schema drift sink")
4141

42-
* Click on "Add Column Pattern" in the Derived Column or Aggregate transformation if you wish to create a transformation that understands "Schema Drift".
42+
If schema drift is enabled, make sure the **Auto-mapping** slider in the Mapping tab is turned on. With this slider on, all incoming columns are written to your destination. Otherwise you must use rule-based mapping to write drifted columns.
4343

44-
<img src="media/data-flow/columnpattern.png" width="400">
44+
![Sink auto mapping](media/data-flow/automap.png "Sink auto mapping")
4545

46-
> [!NOTE]
47-
> You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When you do this, you can protect against schema changes from the sources. However, you will lose early-binding of your columns and types throughout your data flow. Azure Data Factory treats schema drift flows as late-binding flows, so when you build your transformations, the column names will not be available to you in the schema views throughout the flow.
46+
## Transforming drifted columns
4847

49-
<img src="media/data-flow/taxidrift1.png" width="400">
48+
When your data flow has drifted columns, you can access them in your transformations with the following methods:
5049

51-
In the Taxi Demo sample Data Flow, there is a sample Schema Drift in the bottom data flow with the TripFare source. In the Aggregate transformation, notice that we are using the "column pattern" design for the aggregation fields. Instead of naming specific columns, or looking for columns by position, we assume that the data can change and may not appear in the same order between runs.
50+
* Use the `byPosition` and `byName` expressions to explicitly reference a column by name or position number.
51+
* Add a column pattern in a Derived Column or Aggregate transformation to match on any combination of name, stream, position, or type
52+
* Add rule-based mapping in a Select or Sink transformation to match drifted columns to columns aliases via a pattern
5253

53-
In this example of Azure Data Factory Data Flow schema drift handling, we've built and aggregation that scans for columns of type 'double', knowing that the data domain contains prices for each trip. We can then perform an aggregate math calculation across all double fields in the source, regardless of where the column lands and regardless of the column's naming.
54+
For more information on how to implement column patterns, see [Column patterns in Mapping Data Flow](concepts-data-flow-column-pattern.md).
5455

55-
The Azure Data Factory Data Flow syntax uses $$ to represent each matched column from your matching pattern. You can also match on column names using complex string search and regular expression functions. In this case, we are going to create a new aggregated field name based on each match of a 'double' type of column and append the text ```_total``` to each of those matched names:
56+
### Map drifted columns quick action
5657

57-
```concat($$, '_total')```
58+
To explicitly reference drifted columns, you can quickly generate mappings for these columns via a data preview quick action. Once [debug mode](concepts-data-flow-debug-mode.md) is on, go to the Data Preview tab and click **Refresh** to fetch a data preview. If data factory detects that drifted columns exist, you can click **Map Drifted** and generate a derived column that allows you to reference all drifted columns in schema views downstream.
5859

59-
Then, we will round and sum the values for each of those matched columns:
60+
![Map drifted](media/data-flow/mapdrifted1.png "Map drifted")
6061

61-
```round(sum ($$))```
62+
In the generated Derived Column transformation, each drifted column is mapped to its detected name and data type. In the above data preview, the column 'movieId' is detected as an integer. After **Map Drifted** is clicked, movieId is defined in the Derived Column as `toInteger(byName('movieId'))` and included in schema views in downstream transformations.
6263

63-
You can see this schema drift functionality at work with the Azure Data Factory Data Flow sample "Taxi Demo". Switch on the Debug session using the Debug toggle at the top of the Data Flow design surface so that you can see your results interactively:
64-
65-
<img src="media/data-flow/taxidrift2.png" width="800">
66-
67-
## Access new columns downstream
68-
When you generate new columns with column patterns, you can access those new columns later in your data flow transformations with these methods:
69-
70-
* Use "byPosition" to identify the new columns by position number.
71-
* Use "byName" to identify the new columns by their name.
72-
* In Column Patterns, use "Name", "Stream", "Position", or "Type" or any combination of those to match new columns.
73-
74-
## Rule-based mapping
75-
The Select and Sink transformation support pattern matching via rule-based mapping. This will allow you to build rules that can map drifted columns to column aliases and to sink those columns to your destination.
64+
![Map drifted](media/data-flow/mapdrifted2.png "Map drifted")
7665

7766
## Next steps
78-
In the [Data Flow Expression Language](data-flow-expression-functions.md) you will find additional facilities for column patterns and schema drift including "byName" and "byPosition".
67+
In the [Data Flow Expression Language](data-flow-expression-functions.md), you'll find additional facilities for column patterns and schema drift including "byName" and "byPosition".

articles/data-factory/continuous-integration-deployment.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ ms.author: daperlov
1010
ms.reviewer: maghan
1111
manager: jroth
1212
ms.topic: conceptual
13-
ms.date: 01/17/2019
13+
ms.date: 08/14/2019
1414
---
1515

1616
# Continuous integration and delivery (CI/CD) in Azure Data Factory
51.5 KB
Loading
170 KB
Loading
169 KB
Loading
51.5 KB
Loading
62.2 KB
Loading
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)