Merge pull request #105297 from kromerm/dataflow-1

PRMerger8 · web-flow · commit 1a7d2f4bdf69 · 2020-02-24T11:11:31.000-08:00
Dataflow 1
diff --git a/articles/data-factory/concepts-data-flow-performance.md b/articles/data-factory/concepts-data-flow-performance.md
@@ -6,7 +6,7 @@ ms.topic: conceptual
 ms.author: makromer
 ms.service: data-factory
 ms.custom: seo-lt-2019
-ms.date: 01/25/2020
+ms.date: 02/24/2020
 ---
 
 # Mapping data flows performance and tuning guide
@@ -30,8 +30,8 @@ While designing mapping data flows, you can unit test each transformation by cli
 ## Increasing compute size in Azure Integration Runtime
 
 An Integration Runtime with more cores increases the number of nodes in the Spark compute environments and provides more processing power to read, write, and transform your data.
-* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate
-* Try a **Memory Optimized** cluster if you want to cache more data in memory.
+* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate.
+* Try a **Memory Optimized** cluster if you want to cache more data in memory. Memory optimized has a higher price-point per core than Compute Optimized, but will likely result in faster transformation speeds.
 
 ![New IR](media/data-flow/ir-new.png "New IR")
 
@@ -82,18 +82,25 @@ In your pipeline, add a [Stored Procedure activity](transform-data-using-stored-
 
 Schedule a resizing of your source and sink Azure SQL DB and DW before your pipeline run to increase the throughput and minimize Azure throttling once you reach DTU limits. After your pipeline execution is complete, resize your databases back to their normal run rate.
 
-### [Azure SQL DW only] Use staging to load data in bulk via Polybase
+* SQL DB source table with 887k rows and 74 columns to a SQL DB table with a single derived column transformation takes about 3 mins end-to-end using memory optimized 80-core debug Azure IRs.
+
+### [Azure Synapse SQL DW only] Use staging to load data in bulk via Polybase
 
 To avoid row-by-row inserts into your DW, check **Enable staging** in your Sink settings so that ADF can use [PolyBase](https://docs.microsoft.com/sql/relational-databases/polybase/polybase-guide). PolyBase allows ADF to load the data in bulk.
 * When you execute your data flow activity from a pipeline, you'll need to select a Blob or ADLS Gen2 storage location to stage your data during bulk loading.
 
+* File source of 421Mb file with 74 columns to a Synapse table and a single derived column transformation takes about 4 mins end-to-end using memory optimized 80-core debug Azure IRs.
+
 ## Optimizing for files
 
-At each transformation, you can set the partitioning scheme you wish data factory to use in the Optimize tab.
+At each transformation, you can set the partitioning scheme you wish data factory to use in the Optimize tab. It is a good practice to first test file-based sinks keeping the default partitioning and optimizations.
+
 * For smaller files, you may find selecting *Single Partition* can sometimes work better and faster than asking Spark to partition your small files.
 * If you don't have enough information about your source data, choose *Round Robin* partitioning and set the number of partitions.
 * If your data has columns that can be good hash keys, choose *Hash partitioning*.
 
+* File source with file sink of a 421Mb file with 74 columns and a single derived column transformation takes about 2 mins end-to-end using memory optimized 80-core debug Azure IRs.
+
 When debugging in data preview and pipeline debug, the limit and sampling sizes for file-based source datasets only apply to the number of rows returned, not the number of rows read. This can affect the performance of your debug executions and possibly cause the flow to fail.
 * Debug clusters are small single-node clusters by default and we recommend using sample small files for debugging. Go to Debug Settings and point to a small subset of your data using a temporary file.
 
diff --git a/articles/data-factory/data-flow-tutorials.md b/articles/data-factory/data-flow-tutorials.md
@@ -28,6 +28,8 @@ As updates are constantly made to the product, some features have added or diffe
 
 [Monitor and manage mapping data flow performance](https://www.youtube.com/watch?v=fktIWdJiqTk)
 
+[Benchmark timings](http://youtu.be/6CSbWm4lRhw?hd=1)
+
 ## Transformation overviews
 
 [Aggregate transformation](http://youtu.be/jdL75xIr98I)