Skip to content

Commit 3a593f3

Browse files
Merge pull request #92939 from kromerm/dataflow-1
Dataflow 1
2 parents afe6714 + fe4048e commit 3a593f3

File tree

5 files changed

+34
-61
lines changed

5 files changed

+34
-61
lines changed

articles/data-factory/data-flow-sink.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ Select **Validate schema** to fail the sink if the schema changes.
4848

4949
Select **Clear the folder** to truncate the contents of the sink folder before writing the destination files in that target folder.
5050

51-
## Rule-based mapping
52-
When turn-off auto-mapping, you will have the option to add either column-based mapping (fixed mapping) or rule-based mapping. Rule-based mapping will allow you to write expressions with pattern matching.
51+
## Fixed mapping vs. rule-based mapping
52+
When you turn off auto-mapping, you will have the option to add either column-based mapping (fixed mapping) or rule-based mapping. Rule-based mapping will allow you to write expressions with pattern matching while fixed mapping will map logical and physical column names.
5353

5454
![Rule-based Mapping](media/data-flow/rules4.png "Rule-based mapping")
5555

@@ -61,6 +61,12 @@ You can also enter regular expression patterns when using rule based matching by
6161

6262
![Regex Mapping](media/data-flow/scdt1g4.png "Regex mapping")
6363

64+
A very basic common example for a rule-based mapping vs. fixed mapping is the case where you want to map all incoming fields to the same name in your target. In the case of fixed mappings, you would list each individual column in the table. For rule-based mapping, you would have a single rule that maps all fields using ```true()``` to the same incoming field name represented by ```$$```.
65+
66+
### Sink association with dataset
67+
68+
The dataset that you select for your sink may or may not have a schema defined in the dataset definition. If it does not have a defined schema, then you must allow schema drift. When you defined a fixed mapping, the logical-to-physical name mapping will persist in the sink transformation. If you change the schema definition of the dataset, then you will potentially break your sink mapping. To avoid this, use rule-based mapping. Rule-based mappings are generalized, meaning that schema changes on your dataset will not break the mapping.
69+
6470
## File name options
6571

6672
Set up file naming:

articles/data-factory/frequently-asked-questions.md

Lines changed: 6 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@ services: data-factory
55
documentationcenter: ''
66
author: djpmsft
77
ms.author: daperlov
8-
manager: jroth
9-
ms.reviewer: maghan
108
ms.service: data-factory
119
ms.workload: data-services
1210
ms.topic: conceptual
@@ -175,32 +173,20 @@ You can use the `@coalesce` construct in the expressions to handle null values g
175173

176174
## Mapping data flows
177175

178-
### Which Data Factory version do I use to create mapping data flows?
179-
Use the V2 version of Data Factory to create mapping data flows.
180-
181-
### I was a previous private preview customer who used data flows, and I used the Data Factory V2 preview version for data flows.
182-
This version is now obsolete. Use Data Factory V2 for data flows.
183-
184-
### What has changed from private preview to limited public preview in regard to data flows?
185-
You'll no longer have to bring your own Azure Databricks clusters. Data Factory manages cluster creation and tear-down when running mapping data flows. Blob datasets and Azure Data Lake Storage Gen2 datasets are separated into delimited text and Apache Parquet datasets. You can still use Data Lake Storage Gen2 and Blob storage to store those files. Use the appropriate linked service for those storage engines.
186-
187-
### Can I migrate my private preview factories to Data Factory V2?
188-
189-
Yes. [Follow these instructions](https://www.slideshare.net/kromerm/adf-mapping-data-flow-private-preview-migration).
190-
191176
### I need help troubleshooting my data flow logic. What info do I need to provide to get help?
192177

193-
When Microsoft provides help or troubleshooting with data flows, please provide the data flow script. To do this, follow these steps:
178+
When Microsoft provides help or troubleshooting with data flows, please provide the Data Flow Script. This is the code-behind script from your data flow graph. From the ADF UI, open your data flow, then click the "Script" button at the top-right corner. Copy and paste this script or save it in a text file.
194179

195-
1. From the data flow canvas, select **Script** in the top-right corner. This will display the editable data flow script.
196-
3. Copy and paste this script or save it in a text file.
197-
198-
### How do I access data by using the other 80 dataset types in Data Factory?
180+
### How do I access data by using the other 90 dataset types in Data Factory?
199181

200182
The mapping data flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink.
201183

202184
Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it's been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in source to transform that data.
203185

186+
### Is the self-hosted integration runtime available for data flows?
187+
188+
Self-hosted IR is an ADF pipeline construct that you can use with the Copy Activity to acquire or move data to and from on-prem or VM-based data sources and sinks. Stage the data first with a Copy, then Data Flow for transformation, and then a subsequent copy if you need to move that transformed data back to the on-prem store.
189+
204190
## Next steps
205191
For step-by-step instructions to create a data factory, see the following tutorials:
206192

articles/data-factory/introduction.md

Lines changed: 17 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@ services: data-factory
55
documentationcenter: ''
66
author: djpmsft
77
ms.author: daperlov
8-
manager: jroth
9-
ms.reviewer: maghan
108
ms.service: data-factory
119
ms.workload: data-services
1210
ms.topic: overview
@@ -15,10 +13,6 @@ ms.date: 09/30/2019
1513

1614
# What is Azure Data Factory?
1715

18-
> [!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"]
19-
> * [Version 1](v1/data-factory-introduction.md)
20-
> * [Current version](introduction.md)
21-
2216
In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn't have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision makers.
2317

2418
Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
@@ -29,16 +23,15 @@ To analyze these logs, the company needs to use reference data such as customer
2923

3024
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight), and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily build a report on top of it. They want to automate this workflow, and monitor and manage it on a daily schedule. They also want to execute it when files land in a blob store container.
3125

32-
Azure Data Factory is the platform that solves such data scenarios. It is a *cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation*. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
26+
Azure Data Factory is the platform that solves such data scenarios. It is the *cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale*. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.
3327

34-
Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.
28+
Additionally, you can publish your transformed data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume. Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.
3529

36-
![Top-level view of Data Factory](media/introduction/big-picture.png)
30+
![Top-level view of Data Factory](media/data-flow/overview.png)
3731

3832
## How does it work?
39-
The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:
4033

41-
![Four steps of a data-driven workflow](media/introduction/four-steps-of-a-workflow.png)
34+
Data Factory contains a series of interconnected systems that provide a complete end-to-end platform for data engineers.
4235

4336
### Connect and collect
4437

@@ -51,10 +44,12 @@ Without Data Factory, enterprises must build custom data movement components or
5144
With Data Factory, you can use the [Copy Activity](copy-activity-overview.md) in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can collect data in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.
5245

5346
### Transform and enrich
54-
After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. You want to reliably produce transformed data on a maintainable and controlled schedule to feed production environments with trusted data.
47+
After data is present in a centralized data store in the cloud, process or transform the collected data by using ADF mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs that execute on Spark without needing to understand Spark clusters or Spark programming.
48+
49+
If you prefer to code transformations by hand, ADF supports external activities for executing your transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
5550

56-
### Publish
57-
After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.
51+
### CI/CD and publish
52+
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to incrementally develop and deliver your ETL processes before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.
5853

5954
### Monitor
6055
After you have successfully built and deployed your data integration pipeline, providing business value from refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.
@@ -67,6 +62,9 @@ A data factory might have one or more pipelines. A pipeline is a logical groupin
6762

6863
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.
6964

65+
### Mapping data flows
66+
Create and manage graphs of data transformation logic that you can use to transform any-sized data. You can build-up a reusable library of data transformation routines and execute those processes in a scaled-out manner from your ADF pipelines. Data Factory will execute your logic on a Spark cluster that spins-up and spins-down when you need it. You won't ever have to manage or maintain clusters.
67+
7068
### Activity
7169
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.
7270

@@ -98,33 +96,16 @@ A linked service is also a strongly typed parameter that contains the connection
9896
### Control flow
9997
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
10098

99+
### Variables
100+
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with parameters to enable passing values between pipelines, data flows, and other activities.
101101

102-
For more information about Data Factory concepts, see the following articles:
102+
## Next steps
103+
Here are important next step documents to explore:
103104

104105
- [Dataset and linked services](concepts-datasets-linked-services.md)
105106
- [Pipelines and activities](concepts-pipelines-activities.md)
106107
- [Integration runtime](concepts-integration-runtime.md)
107-
108-
## Supported regions
109-
110-
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the following page, and then expand **Analytics** to locate **Data Factory**: [Products available by region](https://azure.microsoft.com/global-infrastructure/services/). However, a data factory can access data stores and compute services in other Azure regions to move data between data stores or process data using compute services.
111-
112-
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the movement of data between supported data stores and the processing of data using compute services in other regions or in an on-premises environment. It also allows you to monitor and manage workflows by using both programmatic and UI mechanisms.
113-
114-
Although Data Factory is available only in certain regions, the service that powers the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall, then a Self-hosted Integration Runtime that's installed in your on-premises environment moves the data instead.
115-
116-
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure Machine Learning are running out of the West Europe region. You can create and use an Azure Data Factory instance in East US or East US 2 and use it to schedule jobs on your compute environments in West Europe. It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on your computing environment does not change.
117-
118-
## Accessibility
119-
120-
The Data Factory user experience in the Azure portal is accessible.
121-
122-
## Compare with version 1
123-
For a list of differences between version 1 and the current version of the Data Factory service, see [Compare with version 1](compare-versions.md).
124-
125-
## Next steps
126-
Get started with creating a Data Factory pipeline by using one of the following tools/SDKs:
127-
108+
- [Mapping Data Flows](concepts-data-flow-overview.md)
128109
- [Data Factory UI in the Azure portal](quickstart-create-data-factory-portal.md)
129110
- [Copy Data tool in the Azure portal](quickstart-create-data-factory-copy-data-tool.md)
130111
- [PowerShell](quickstart-create-data-factory-powershell.md)
54 KB
Loading

0 commit comments

Comments
 (0)