Skip to content

Commit 1a7d2f4

Browse files
authored
Merge pull request #105297 from kromerm/dataflow-1
Dataflow 1
2 parents cc5c120 + 9874fef commit 1a7d2f4

File tree

2 files changed

+14
-5
lines changed

2 files changed

+14
-5
lines changed

articles/data-factory/concepts-data-flow-performance.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.topic: conceptual
66
ms.author: makromer
77
ms.service: data-factory
88
ms.custom: seo-lt-2019
9-
ms.date: 01/25/2020
9+
ms.date: 02/24/2020
1010
---
1111

1212
# Mapping data flows performance and tuning guide
@@ -30,8 +30,8 @@ While designing mapping data flows, you can unit test each transformation by cli
3030
## Increasing compute size in Azure Integration Runtime
3131

3232
An Integration Runtime with more cores increases the number of nodes in the Spark compute environments and provides more processing power to read, write, and transform your data.
33-
* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate
34-
* Try a **Memory Optimized** cluster if you want to cache more data in memory.
33+
* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate.
34+
* Try a **Memory Optimized** cluster if you want to cache more data in memory. Memory optimized has a higher price-point per core than Compute Optimized, but will likely result in faster transformation speeds.
3535

3636
![New IR](media/data-flow/ir-new.png "New IR")
3737

@@ -82,18 +82,25 @@ In your pipeline, add a [Stored Procedure activity](transform-data-using-stored-
8282

8383
Schedule a resizing of your source and sink Azure SQL DB and DW before your pipeline run to increase the throughput and minimize Azure throttling once you reach DTU limits. After your pipeline execution is complete, resize your databases back to their normal run rate.
8484

85-
### [Azure SQL DW only] Use staging to load data in bulk via Polybase
85+
* SQL DB source table with 887k rows and 74 columns to a SQL DB table with a single derived column transformation takes about 3 mins end-to-end using memory optimized 80-core debug Azure IRs.
86+
87+
### [Azure Synapse SQL DW only] Use staging to load data in bulk via Polybase
8688

8789
To avoid row-by-row inserts into your DW, check **Enable staging** in your Sink settings so that ADF can use [PolyBase](https://docs.microsoft.com/sql/relational-databases/polybase/polybase-guide). PolyBase allows ADF to load the data in bulk.
8890
* When you execute your data flow activity from a pipeline, you'll need to select a Blob or ADLS Gen2 storage location to stage your data during bulk loading.
8991

92+
* File source of 421Mb file with 74 columns to a Synapse table and a single derived column transformation takes about 4 mins end-to-end using memory optimized 80-core debug Azure IRs.
93+
9094
## Optimizing for files
9195

92-
At each transformation, you can set the partitioning scheme you wish data factory to use in the Optimize tab.
96+
At each transformation, you can set the partitioning scheme you wish data factory to use in the Optimize tab. It is a good practice to first test file-based sinks keeping the default partitioning and optimizations.
97+
9398
* For smaller files, you may find selecting *Single Partition* can sometimes work better and faster than asking Spark to partition your small files.
9499
* If you don't have enough information about your source data, choose *Round Robin* partitioning and set the number of partitions.
95100
* If your data has columns that can be good hash keys, choose *Hash partitioning*.
96101

102+
* File source with file sink of a 421Mb file with 74 columns and a single derived column transformation takes about 2 mins end-to-end using memory optimized 80-core debug Azure IRs.
103+
97104
When debugging in data preview and pipeline debug, the limit and sampling sizes for file-based source datasets only apply to the number of rows returned, not the number of rows read. This can affect the performance of your debug executions and possibly cause the flow to fail.
98105
* Debug clusters are small single-node clusters by default and we recommend using sample small files for debugging. Go to Debug Settings and point to a small subset of your data using a temporary file.
99106

articles/data-factory/data-flow-tutorials.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ As updates are constantly made to the product, some features have added or diffe
2828

2929
[Monitor and manage mapping data flow performance](https://www.youtube.com/watch?v=fktIWdJiqTk)
3030

31+
[Benchmark timings](http://youtu.be/6CSbWm4lRhw?hd=1)
32+
3133
## Transformation overviews
3234

3335
[Aggregate transformation](http://youtu.be/jdL75xIr98I)

0 commit comments

Comments
 (0)