Skip to content

Commit b2f8b1d

Browse files
authored
Update concepts-data-flow-performance.md
1 parent 3cd4ce3 commit b2f8b1d

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

articles/data-factory/concepts-data-flow-performance.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.topic: conceptual
66
ms.author: makromer
77
ms.service: data-factory
88
ms.custom: seo-lt-2019
9-
ms.date: 04/14/2020
9+
ms.date: 04/27/2020
1010
---
1111

1212
# Mapping data flows performance and tuning guide
@@ -144,7 +144,13 @@ Setting throughput and batch properties on CosmosDB sinks only take effect durin
144144

145145
## Join performance
146146

147-
Managing the performance of joins in your data flow is a very common operation that you will perform throughout the lifecycle of your data transformations. In ADF, data flows do not require data to be sorted prior to joins as these operations are performed as hash joins in Spark. However, you can benefit from improved performance with the "Broadcast" Join optimization. This will avoid shuffles by pushing down the contents of either side of your join relationship into the Spark node. This works well for smaller tables that are used for reference lookups. Larger tables that may not fit into the node's memory are not good candidates for broadcast optimization.
147+
Managing the performance of joins in your data flow is a very common operation that you will perform throughout the lifecycle of your data transformations. In ADF, data flows do not require data to be sorted prior to joins as these operations are performed as hash joins in Spark. However, you can benefit from improved performance with the "Broadcast" Join optimization that applies to Joins, Exists, and Lookup transformations.
148+
149+
This will avoid on-the-fly shuffles by pushing down the contents of either side of your join relationship into the Spark node. This works well for smaller tables that are used for reference lookups. Larger tables that may not fit into the node's memory are not good candidates for broadcast optimization.
150+
151+
The recommended configuration for data flows with many join operations is to keep the optimization set to "Auto" for "Broadcast" and use a Memory Optimized Azure Integration Runtime configuration. If you are experiencing out of memory errors or broadcast timeouts during data flow executions, you can switch off the broadcast optimization. Howevever, this will result in slower performing data flows. Optionally, you can instruct data flow to pushdown only the left or right side of the join, or both.
152+
153+
![Broadcast Settings](media/data-flow/newbroad.png "Broadcast Settings")
148154

149155
Another Join optimization is to build your joins in such a way that it avoids Spark's tendency to implement cross joins. For example, when you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you have column values on both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows.
150156

0 commit comments

Comments
 (0)