You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-factory/concepts-data-flow-performance.md
+12-5Lines changed: 12 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ ms.topic: conceptual
6
6
ms.author: makromer
7
7
ms.service: data-factory
8
8
ms.custom: seo-lt-2019
9
-
ms.date: 01/25/2020
9
+
ms.date: 02/24/2020
10
10
---
11
11
12
12
# Mapping data flows performance and tuning guide
@@ -30,8 +30,8 @@ While designing mapping data flows, you can unit test each transformation by cli
30
30
## Increasing compute size in Azure Integration Runtime
31
31
32
32
An Integration Runtime with more cores increases the number of nodes in the Spark compute environments and provides more processing power to read, write, and transform your data.
33
-
* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate
34
-
* Try a **Memory Optimized** cluster if you want to cache more data in memory.
33
+
* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate.
34
+
* Try a **Memory Optimized** cluster if you want to cache more data in memory. Memory optimized has a higher price-point per core than Compute Optimized, but will likely result in faster transformation speeds.
35
35
36
36

37
37
@@ -82,18 +82,25 @@ In your pipeline, add a [Stored Procedure activity](transform-data-using-stored-
82
82
83
83
Schedule a resizing of your source and sink Azure SQL DB and DW before your pipeline run to increase the throughput and minimize Azure throttling once you reach DTU limits. After your pipeline execution is complete, resize your databases back to their normal run rate.
84
84
85
-
### [Azure SQL DW only] Use staging to load data in bulk via Polybase
85
+
* SQL DB source table with 887k rows and 74 columns to a SQL DB table with a single derived column transformation takes about 3 mins end-to-end using memory optimized 80-core debug Azure IRs.
86
+
87
+
### [Azure Synapse SQL DW only] Use staging to load data in bulk via Polybase
86
88
87
89
To avoid row-by-row inserts into your DW, check **Enable staging** in your Sink settings so that ADF can use [PolyBase](https://docs.microsoft.com/sql/relational-databases/polybase/polybase-guide). PolyBase allows ADF to load the data in bulk.
88
90
* When you execute your data flow activity from a pipeline, you'll need to select a Blob or ADLS Gen2 storage location to stage your data during bulk loading.
89
91
92
+
* File source of 421Mb file with 74 columns to a Synapse table and a single derived column transformation takes about 4 mins end-to-end using memory optimized 80-core debug Azure IRs.
93
+
90
94
## Optimizing for files
91
95
92
-
At each transformation, you can set the partitioning scheme you wish data factory to use in the Optimize tab.
96
+
At each transformation, you can set the partitioning scheme you wish data factory to use in the Optimize tab. It is a good practice to first test file-based sinks keeping the default partitioning and optimizations.
97
+
93
98
* For smaller files, you may find selecting *Single Partition* can sometimes work better and faster than asking Spark to partition your small files.
94
99
* If you don't have enough information about your source data, choose *Round Robin* partitioning and set the number of partitions.
95
100
* If your data has columns that can be good hash keys, choose *Hash partitioning*.
96
101
102
+
* File source with file sink of a 421Mb file with 74 columns and a single derived column transformation takes about 2 mins end-to-end using memory optimized 80-core debug Azure IRs.
103
+
97
104
When debugging in data preview and pipeline debug, the limit and sampling sizes for file-based source datasets only apply to the number of rows returned, not the number of rows read. This can affect the performance of your debug executions and possibly cause the flow to fail.
98
105
* Debug clusters are small single-node clusters by default and we recommend using sample small files for debugging. Go to Debug Settings and point to a small subset of your data using a temporary file.
0 commit comments