|
1 | 1 | ---
|
2 |
| -title: Azure Data Factory Mapping Data Flow Schema Drift |
| 2 | +title: Schema drift in Mapping Data Flow | Azure Data Factory |
3 | 3 | description: Build resilient Data Flows in Azure Data Factory with Schema Drift
|
4 | 4 | author: kromerm
|
5 | 5 | ms.author: makromer
|
| 6 | +ms.reviewer: daperlov |
6 | 7 | ms.service: data-factory
|
7 | 8 | ms.topic: conceptual
|
8 |
| -ms.date: 10/04/2018 |
| 9 | +ms.date: 09/12/2019 |
9 | 10 | ---
|
10 | 11 |
|
11 |
| -# Mapping data flow schema drift |
| 12 | +# Schema drift in Mapping Data Flow |
12 | 13 |
|
13 | 14 | [!INCLUDE [notes](../../includes/data-factory-data-flow-preview.md)]
|
14 | 15 |
|
15 |
| -The concept of Schema Drift is the case where your sources often change metadata. Fields, columns, types, etc. can be added, removed or changed on the fly. Without handling for Schema Drift, your Data Flow becomes vulnerable to changes in upstream data source changes. When incoming columns and fields change, typical ETL patterns fail because they tend to be tied to those source names. |
| 16 | +Schema drift is the case where your sources often change metadata. Fields, columns, and, types can be added, removed, or changed on the fly. Without handling for schema drift, your data flow becomes vulnerable to upstream data source changes. Typical ETL patterns fail when incoming columns and fields change because they tend to be tied to those source names. |
16 | 17 |
|
17 |
| -In order to protect against Schema Drift, it is important to have the facilities in a Data Flow tool to allow you, as a Data Engineer, to: |
| 18 | +To protect against schema drift, it's important to have the facilities in a data flow tool to allow you, as a Data Engineer, to: |
18 | 19 |
|
19 |
| -* Define sources that have mutable field names, data types, values and sizes |
| 20 | +* Define sources that have mutable field names, data types, values, and sizes |
20 | 21 | * Define transformation parameters that can work with data patterns instead of hard-coded fields and values
|
21 | 22 | * Define expressions that understand patterns to match incoming fields, instead of using named fields
|
22 | 23 |
|
23 |
| -## How to implement schema drift in ADF Mapping Data Flows |
24 |
| -ADF natively supports flexible schemas that change from execution to execution so that you can build generic data transformation logic without the need to recompile your data flows. |
| 24 | +Azure Data Factory natively supports flexible schemas that change from execution to execution so that you can build generic data transformation logic without the need to recompile your data flows. |
25 | 25 |
|
26 |
| -* Choose "Allow Schema Drift" in your Source Transformation |
| 26 | +You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When you do this, you can protect against schema changes from the sources. However, you'll lose early-binding of your columns and types throughout your data flow. Azure Data Factory treats schema drift flows as late-binding flows, so when you build your transformations, the drifted column names won't be available to you in the schema views throughout the flow. |
27 | 27 |
|
28 |
| -<img src="media/data-flow/schemadrift001.png" width="400"> |
| 28 | +## Schema drift in source |
29 | 29 |
|
30 |
| -* When you've selected this option, all incoming fields will be read from your source on every Data Flow execution and will be passed through the entire flow to the Sink. |
| 30 | +In a source transformation, schema drift is defined as reading columns that aren't defined your dataset schema. To enable schema drift, check **Allow schema drift** in your source transformation. |
31 | 31 |
|
32 |
| -* All newly detected columns (drifted columns) will arrive as String data type by default. In your Source Transformation, choose "Infer drifted column types" if you wish to have ADF automatically infer data types from the source. |
| 32 | + |
33 | 33 |
|
34 |
| -* Make sure to use "Auto-Map" to map all new fields in the Sink Transformation so that all new fields get picked-up and landed in your destination and set "Allow Schema Drift" on the Sink as well. |
| 34 | +When schema drift is enabled, all incoming fields are read from your source during execution and passed through the entire flow to the Sink. By default, all newly detected columns, known as *drifted columns*, arrive as a string data type. If you wish for your data flow to automatically infer data types of drifted columns, check **Infer drifted column types** in your source settings. |
35 | 35 |
|
36 |
| -<img src="media/data-flow/automap.png" width="400"> |
| 36 | +## Schema drift in sink |
37 | 37 |
|
38 |
| -* Everything will work when new fields are introduced in that scenario with a simple Source -> Sink (Copy) mapping. |
| 38 | +In a sink transformation, schema drift is when you write additional columns on top of what is defined in the sink data schema. To enable schema drift, check **Allow schema drift** in your sink transformation. |
39 | 39 |
|
40 |
| -* To add transformations in that workflow that handles schema drift, you can use pattern matching to match columns by name, type, and value. |
| 40 | + |
41 | 41 |
|
42 |
| -* Click on "Add Column Pattern" in the Derived Column or Aggregate transformation if you wish to create a transformation that understands "Schema Drift". |
| 42 | +If schema drift is enabled, make sure the **Auto-mapping** slider in the Mapping tab is turned on. With this slider on, all incoming columns are written to your destination. Otherwise you must use rule-based mapping to write drifted columns. |
43 | 43 |
|
44 |
| -<img src="media/data-flow/columnpattern.png" width="400"> |
| 44 | + |
45 | 45 |
|
46 |
| -> [!NOTE] |
47 |
| -> You need to make an architectural decision in your data flow to accept schema drift throughout your flow. When you do this, you can protect against schema changes from the sources. However, you will lose early-binding of your columns and types throughout your data flow. Azure Data Factory treats schema drift flows as late-binding flows, so when you build your transformations, the column names will not be available to you in the schema views throughout the flow. |
| 46 | +## Transforming drifted columns |
48 | 47 |
|
49 |
| -<img src="media/data-flow/taxidrift1.png" width="400"> |
| 48 | +When your data flow has drifted columns, you can access them in your transformations with the following methods: |
50 | 49 |
|
51 |
| -In the Taxi Demo sample Data Flow, there is a sample Schema Drift in the bottom data flow with the TripFare source. In the Aggregate transformation, notice that we are using the "column pattern" design for the aggregation fields. Instead of naming specific columns, or looking for columns by position, we assume that the data can change and may not appear in the same order between runs. |
| 50 | +* Use the `byPosition` and `byName` expressions to explicitly reference a column by name or position number. |
| 51 | +* Add a column pattern in a Derived Column or Aggregate transformation to match on any combination of name, stream, position, or type |
| 52 | +* Add rule-based mapping in a Select or Sink transformation to match drifted columns to columns aliases via a pattern |
52 | 53 |
|
53 |
| -In this example of Azure Data Factory Data Flow schema drift handling, we've built and aggregation that scans for columns of type 'double', knowing that the data domain contains prices for each trip. We can then perform an aggregate math calculation across all double fields in the source, regardless of where the column lands and regardless of the column's naming. |
| 54 | +For more information on how to implement column patterns, see [Column patterns in Mapping Data Flow](concepts-data-flow-column-pattern.md). |
54 | 55 |
|
55 |
| -The Azure Data Factory Data Flow syntax uses $$ to represent each matched column from your matching pattern. You can also match on column names using complex string search and regular expression functions. In this case, we are going to create a new aggregated field name based on each match of a 'double' type of column and append the text ```_total``` to each of those matched names: |
| 56 | +### Map drifted columns quick action |
56 | 57 |
|
57 |
| -```concat($$, '_total')``` |
| 58 | +To explicitly reference drifted columns, you can quickly generate mappings for these columns via a data preview quick action. Once [debug mode](concepts-data-flow-debug-mode.md) is on, go to the Data Preview tab and click **Refresh** to fetch a data preview. If data factory detects that drifted columns exist, you can click **Map Drifted** and generate a derived column that allows you to reference all drifted columns in schema views downstream. |
58 | 59 |
|
59 |
| -Then, we will round and sum the values for each of those matched columns: |
| 60 | + |
60 | 61 |
|
61 |
| -```round(sum ($$))``` |
| 62 | +In the generated Derived Column transformation, each drifted column is mapped to its detected name and data type. In the above data preview, the column 'movieId' is detected as an integer. After **Map Drifted** is clicked, movieId is defined in the Derived Column as `toInteger(byName('movieId'))` and included in schema views in downstream transformations. |
62 | 63 |
|
63 |
| -You can see this schema drift functionality at work with the Azure Data Factory Data Flow sample "Taxi Demo". Switch on the Debug session using the Debug toggle at the top of the Data Flow design surface so that you can see your results interactively: |
64 |
| - |
65 |
| -<img src="media/data-flow/taxidrift2.png" width="800"> |
66 |
| - |
67 |
| -## Access new columns downstream |
68 |
| -When you generate new columns with column patterns, you can access those new columns later in your data flow transformations with these methods: |
69 |
| - |
70 |
| -* Use "byPosition" to identify the new columns by position number. |
71 |
| -* Use "byName" to identify the new columns by their name. |
72 |
| -* In Column Patterns, use "Name", "Stream", "Position", or "Type" or any combination of those to match new columns. |
73 |
| - |
74 |
| -## Rule-based mapping |
75 |
| -The Select and Sink transformation support pattern matching via rule-based mapping. This will allow you to build rules that can map drifted columns to column aliases and to sink those columns to your destination. |
| 64 | + |
76 | 65 |
|
77 | 66 | ## Next steps
|
78 |
| -In the [Data Flow Expression Language](data-flow-expression-functions.md) you will find additional facilities for column patterns and schema drift including "byName" and "byPosition". |
| 67 | +In the [Data Flow Expression Language](data-flow-expression-functions.md), you'll find additional facilities for column patterns and schema drift including "byName" and "byPosition". |
0 commit comments