Skip to content

Commit 016c174

Browse files
Merge pull request #173984 from jonburchel/2021-09-29-breaks-up-concepts-data-flow-performance
Breaks up excessively long concepts-data-flow-performance.md article
2 parents 3824e94 + 274761b commit 016c174

7 files changed

+309
-181
lines changed

articles/data-factory/TOC.yml

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -210,8 +210,18 @@ items:
210210
- name: Data flow monitoring
211211
href: concepts-data-flow-monitoring.md
212212
- name: Data flow performance
213-
href: concepts-data-flow-performance.md
214-
displayName: merge, timeout
213+
items:
214+
- name: Overview
215+
href: concepts-data-flow-performance.md
216+
displayName: merge, timeout
217+
- name: Optimizing sources
218+
href: concepts-data-flow-performance-sources.md
219+
- name: Optimizing sinks
220+
href: concepts-data-flow-performance-sinks.md
221+
- name: Optimizing transformations
222+
href: concepts-data-flow-performance-transformations.md
223+
- name: Using data flows in pipelines
224+
href: concepts-data-flow-performance-pipelines.md
215225
- name: Integration Runtime performance
216226
href: concepts-integration-runtime-performance.md
217227
- name: Manage data flow canvas
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Optimizing pipeline performance in mapping data flow
3+
titleSuffix: Azure Data Factory & Azure Synapse
4+
description: Learn about optimizing data flow execution in pipelines in Azure Data Factory and Azure Synapse Analytics pipelines.
5+
author: kromerm
6+
ms.topic: conceptual
7+
ms.author: makromer
8+
ms.service: data-factory
9+
ms.subservice: data-flows
10+
ms.custom: synapse
11+
ms.date: 09/29/2021
12+
---
13+
14+
# Using data flows in pipelines
15+
16+
When building complex pipelines with multiple data flows, your logical flow can have a big impact on timing and cost. This section covers the impact of different architecture strategies.
17+
18+
## Executing data flows in parallel
19+
20+
If you execute multiple data flows in parallel, the service spins up separate Spark clusters for each activity. This allows for each job to be isolated and run in parallel, but will lead to multiple clusters running at the same time.
21+
22+
If your data flows execute in parallel, we recommend that you don't enable the Azure IR time to the live property because it will lead to multiple unused warm pools.
23+
24+
> [!TIP]
25+
> Instead of running the same data flow multiple times in a for each activity, stage your data in a data lake and use wildcard paths to process the data in a single data flow.
26+
27+
## Execute data flows sequentially
28+
29+
If you execute your data flow activities in sequence, it is recommended that you set a TTL in the Azure IR configuration. The service will reuse the compute resources, resulting in a faster cluster start-up time. Each activity will still be isolated and receive a new Spark context for each execution. To reduce the time between sequential activities even more, set the **quick re-use** checkbox on the Azure IR to tell the service to re-use the existing cluster.
30+
31+
## Overloading a single data flow
32+
33+
If you put all of your logic inside of a single data flow, the service will execute the entire job on a single Spark instance. While this may seem like a way to reduce costs, it mixes together different logical flows and can be difficult to monitor and debug. If one component fails, all other parts of the job will fail as well. Organizing data flows by independent flows of business logic is recommended. If your data flow becomes too large, splitting it into separates components will make monitoring and debugging easier. While there is no hard limit on the number of transformations in a data flow, having too many will make the job complex.
34+
35+
## Execute sinks in parallel
36+
37+
The default behavior of data flow sinks is to execute each sink sequentially, in a serial manner, and to fail the data flow when an error is encountered in the sink. Additionally, all sinks are defaulted to the same group unless you go into the data flow properties and set different priorities for the sinks.
38+
39+
Data flows allow you to group sinks together into groups from the data flow properties tab in the UI designer. You can both set the order of execution of your sinks as well as to group sinks together using the same group number. To help manage groups, you can ask the service to run sinks in the same group, to run in parallel.
40+
41+
On the pipeline execute data flow activity under the "Sink Properties" section is an option to turn on parallel sink loading. When you enable "run in parallel", you are instructing data flows write to connected sinks at the same time rather than in a sequential manner. In order to utilize the parallel option, the sinks must be group together and connected to the same stream via a New Branch or Conditional Split.
42+
43+
## Next steps
44+
45+
- [Data flow performance overview](concepts-data-flow-performance.md)
46+
- [Optimizing sources](concepts-data-flow-performance-sources.md)
47+
- [Optimizing sinks](concepts-data-flow-performance-sinks.md)
48+
- [Optimizing transformations](concepts-data-flow-performance-transformations.md)
49+
50+
See other Data Flow articles related to performance:
51+
52+
- [Data Flow activity](control-flow-execute-data-flow-activity.md)
53+
- [Monitor Data Flow performance](concepts-data-flow-monitoring.md)
54+
- [Integration Runtime performance](concepts-integration-runtime-performance.md)
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: Optimizing sink performance in mapping data flow
3+
titleSuffix: Azure Data Factory & Azure Synapse
4+
description: Learn about optimizing sink performance in mapping data flows in Azure Data Factory and Azure Synapse Analytics pipelines.
5+
author: kromerm
6+
ms.topic: conceptual
7+
ms.author: makromer
8+
ms.service: data-factory
9+
ms.subservice: data-flows
10+
ms.custom: synapse
11+
ms.date: 09/29/2021
12+
---
13+
14+
# Optimizing sinks
15+
16+
When data flows write to sinks, any custom partitioning will happen immediately before the write. Like the source, in most cases it is recommended that you keep **Use current partitioning** as the selected partition option. Partitioned data will write significantly quicker than unpartitioned data, even your destination is not partitioned. Below are the individual considerations for various sink types.
17+
18+
## Azure SQL Database sinks
19+
20+
With Azure SQL Database, the default partitioning should work in most cases. There is a chance that your sink may have too many partitions for your SQL database to handle. If you are running into this, reduce the number of partitions outputted by your SQL Database sink.
21+
22+
### Impact of error row handling to performance
23+
24+
When you enable error row handling ("continue on error") in the sink transformation, the service will take an additional step before writing the compatible rows to your destination table. This additional step will have a small performance penalty that can be in the range of 5% added for this step with an additional small performance hit also added if you set the option to also with the incompatible rows to a log file.
25+
26+
### Disabling indexes using a SQL Script
27+
28+
Disabling indexes before a load in a SQL database can greatly improve performance of writing to the table. Run the below command before writing to your SQL sink.
29+
30+
`ALTER INDEX ALL ON dbo.[Table Name] DISABLE`
31+
32+
After the write has completed, rebuild the indexes using the following command:
33+
34+
`ALTER INDEX ALL ON dbo.[Table Name] REBUILD`
35+
36+
These can both be done natively using Pre and Post-SQL scripts within an Azure SQL DB or Synapse sink in mapping data flows.
37+
38+
:::image type="content" source="media/data-flow/disable-indexes-sql.png" alt-text="Disable indexes":::
39+
40+
> [!WARNING]
41+
> When disabling indexes, the data flow is effectively taking control of a database and queries are unlikely to succeed at this time. As a result, many ETL jobs are triggered in the middle of the night to avoid this conflict. For more information, learn about the [constraints of disabling SQL indexes](/sql/relational-databases/indexes/disable-indexes-and-constraints)
42+
43+
### Scaling up your database
44+
45+
Schedule a resizing of your source and sink Azure SQL DB and DW before your pipeline run to increase the throughput and minimize Azure throttling once you reach DTU limits. After your pipeline execution is complete, resize your databases back to their normal run rate.
46+
47+
## Azure Synapse Analytics sinks
48+
49+
When writing to Azure Synapse Analytics, make sure that **Enable staging** is set to true. This enables the service to write using the [SQL COPY Command](/sql/t-sql/statements/copy-into-transact-sql) which effectively loads the data in bulk. You will need to reference an Azure Data Lake Storage gen2 or Azure Blob Storage account for staging of the data when using Staging.
50+
51+
Other than Staging, the same best practices apply to Azure Synapse Analytics as Azure SQL Database.
52+
53+
## File-based sinks
54+
55+
While data flows support a variety of file types, the Spark-native Parquet format is recommended for optimal read and write times.
56+
57+
If the data is evenly distributed, **Use current partitioning** will be the fastest partitioning option for writing files.
58+
59+
### File name options
60+
61+
When writing files, you have a choice of naming options that each have a performance impact.
62+
63+
:::image type="content" source="media/data-flow/file-sink-settings.png" alt-text="Sink options":::
64+
65+
Selecting the **Default** option will write the fastest. Each partition will equate to a file with the Spark default name. This is useful if you are just reading from the folder of data.
66+
67+
Setting a naming **Pattern** will rename each partition file to a more user-friendly name. This operation happens after write and is slightly slower than choosing the default. Per partition allows you to name each individual partition manually.
68+
69+
If a column corresponds to how you wish to output the data, you can select **As data in column**. This reshuffles the data and can impact performance if the columns are not evenly distributed.
70+
71+
**Output to single file** combines all the data into a single partition. This leads to long write times, especially for large datasets. This option is strongly discouraged unless there is an explicit business reason to use it.
72+
73+
## CosmosDB sinks
74+
75+
When writing to CosmosDB, altering throughput and batch size during data flow execution can improve performance. These changes only take effect during the data flow activity run and will return to the original collection settings after conclusion.
76+
77+
**Batch size:** Usually, starting with the default batch size is sufficient. To further tune this value, calculate the rough object size of your data, and make sure that object size * batch size is less than 2MB. If it is, you can increase the batch size to get better throughput.
78+
79+
**Throughput:** Set a higher throughput setting here to allow documents to write faster to CosmosDB. Keep in mind the higher RU costs based upon a high throughput setting.
80+
81+
**Write throughput budget:** Use a value which is smaller than total RUs per minute. If you have a data flow with a high number of Spark partitions, setting a budget throughput will allow more balance across those partitions.
82+
83+
## Next steps
84+
85+
- [Data flow performance overview](concepts-data-flow-performance.md)
86+
- [Optimizing sources](concepts-data-flow-performance-sources.md)
87+
- [Optimizing transformations](concepts-data-flow-performance-transformations.md)
88+
- [Using data flows in pipelines](concepts-data-flow-performance-pipelines.md)
89+
90+
See other Data Flow articles related to performance:
91+
92+
- [Data Flow activity](control-flow-execute-data-flow-activity.md)
93+
- [Monitor Data Flow performance](concepts-data-flow-monitoring.md)
94+
- [Integration Runtime performance](concepts-integration-runtime-performance.md)
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
title: Optimizing source performance in mapping data flow
3+
titleSuffix: Azure Data Factory & Azure Synapse
4+
description: Learn about optimizing source performance in mapping data flows in Azure Data Factory and Azure Synapse Analytics pipelines.
5+
author: kromerm
6+
ms.topic: conceptual
7+
ms.author: makromer
8+
ms.service: data-factory
9+
ms.subservice: data-flows
10+
ms.custom: synapse
11+
ms.date: 09/29/2021
12+
---
13+
14+
# Optimizing sources
15+
16+
For every source except Azure SQL Database, it is recommended that you keep **Use current partitioning** as the selected value. When reading from all other source systems, data flows automatically partitions data evenly based upon the size of the data. A new partition is created for about every 128 MB of data. As your data size increases, the number of partitions increase.
17+
18+
Any custom partitioning happens *after* Spark reads in the data and will negatively impact your data flow performance. As the data is evenly partitioned on read, this is not recommended.
19+
20+
> [!NOTE]
21+
> Read speeds can be limited by the throughput of your source system.
22+
23+
## Azure SQL Database sources
24+
25+
Azure SQL Database has a unique partitioning option called 'Source' partitioning. Enabling source partitioning can improve your read times from Azure SQL DB by enabling parallel connections on the source system. Specify the number of partitions and how to partition your data. Use a partition column with high cardinality. You can also enter a query that matches the partitioning scheme of your source table.
26+
27+
> [!TIP]
28+
> For source partitioning, the I/O of the SQL Server is the bottleneck. Adding too many partitions may saturate your source database. Generally four or five partitions is ideal when using this option.
29+
30+
:::image type="content" source="media/data-flow/sourcepart3.png" alt-text="Source partitioning":::
31+
32+
### Isolation level
33+
34+
The isolation level of the read on an Azure SQL source system has an impact on performance. Choosing 'Read uncommitted' will provide the fastest performance and prevent any database locks. To learn more about SQL Isolation levels, please see [Understanding isolation levels](/sql/connect/jdbc/understanding-isolation-levels).
35+
36+
### Read using query
37+
38+
You can read from Azure SQL Database using a table or a SQL query. If you are executing a SQL query, the query must complete before transformation can start. SQL Queries can be useful to push down operations that may execute faster and reduce the amount of data read from a SQL Server such as SELECT, WHERE, and JOIN statements. When pushing down operations, you lose the ability to track lineage and performance of the transformations before the data comes into the data flow.
39+
40+
## Azure Synapse Analytics sources
41+
42+
When using Azure Synapse Analytics, a setting called **Enable staging** exists in the source options. This allows the service to read from Synapse using ```Staging```, which greatly improves read performance by using the [Synapse COPY statement](/sql/t-sql/statements/copy-into-transact-sql) command for the most performant bulk loading capability. Enabling ```Staging``` requires you to specify an Azure Blob Storage or Azure Data Lake Storage gen2 staging location in the data flow activity settings.
43+
44+
:::image type="content" source="media/data-flow/enable-staging.png" alt-text="Enable staging":::
45+
46+
## File-based sources
47+
48+
While data flows support a variety of file types, the Spark-native Parquet format is recommended for optimal read and write times.
49+
50+
If you're running the same data flow on a set of files, we recommend reading from a folder, using wildcard paths or reading from a list of files. A single data flow activity run can process all of your files in batch. More information on how to configure these settings can be found in the **Source transformation** section of the [Azure Blob Storage connector](connector-azure-blob-storage.md#source-transformation) documentation.
51+
52+
If possible, avoid using the For-Each activity to run data flows over a set of files. This will cause each iteration of the for-each to spin up its own Spark cluster, which is often not necessary and can be expensive.
53+
54+
## Next steps
55+
56+
- [Data flow performance overview](concepts-data-flow-performance.md)
57+
- [Optimizing sinks](concepts-data-flow-performance-sinks.md)
58+
- [Optimizing transformations](concepts-data-flow-performance-transformations.md)
59+
- [Using data flows in pipelines](concepts-data-flow-performance-pipelines.md)
60+
61+
See other Data Flow articles related to performance:
62+
63+
- [Data Flow activity](control-flow-execute-data-flow-activity.md)
64+
- [Monitor Data Flow performance](concepts-data-flow-monitoring.md)
65+
- [Integration Runtime performance](concepts-integration-runtime-performance.md)

0 commit comments

Comments
 (0)