Skip to content

Commit 73fdf2f

Browse files
authored
Merge pull request #112897 from kromerm/adfdocsmark
Broadcast optimization and perf tips updates
2 parents 55152a5 + cffed48 commit 73fdf2f

File tree

5 files changed

+14
-8
lines changed

5 files changed

+14
-8
lines changed

articles/data-factory/concepts-data-flow-overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,9 @@ If you put all of your logic inside a single data flow, ADF executes that same j
6666

6767
This option can be more challenging to follow and troubleshoot because your business rules and business logic can be jumbled together. This option also doesn't provide much reusability.
6868

69-
##### Execute data flows serially
69+
##### Execute data flows sequentially
7070

71-
If you execute your data flow activities in serial in the pipeline and you have set a TTL on the Azure IR configuration, then ADF reuses the compute resources (VMs), resulting in faster subsequent execution times. You still receive a new Spark context for each execution.
71+
If you execute your data flow activities in sequence in the pipeline and you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) resulting in faster subsequent execution times. You will still receive a new Spark context for each execution.
7272

7373
Of these three options, this action likely takes the longest time to execute end-to-end. But it does provide a clean separation of logical operations in each data flow step.
7474

articles/data-factory/concepts-data-flow-performance.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.topic: conceptual
66
ms.author: makromer
77
ms.service: data-factory
88
ms.custom: seo-lt-2019
9-
ms.date: 04/14/2020
9+
ms.date: 04/27/2020
1010
---
1111

1212
# Mapping data flows performance and tuning guide
@@ -146,7 +146,13 @@ Setting throughput and batch properties on CosmosDB sinks only take effect durin
146146

147147
## Join performance
148148

149-
Managing the performance of joins in your data flow is a very common operation that you will perform throughout the lifecycle of your data transformations. In ADF, data flows do not require data to be sorted prior to joins as these operations are performed as hash joins in Spark. However, you can benefit from improved performance with the "Broadcast" Join optimization. This will avoid shuffles by pushing down the contents of either side of your join relationship into the Spark node. This works well for smaller tables that are used for reference lookups. Larger tables that may not fit into the node's memory are not good candidates for broadcast optimization.
149+
Managing the performance of joins in your data flow is a very common operation that you will perform throughout the lifecycle of your data transformations. In ADF, data flows do not require data to be sorted prior to joins as these operations are performed as hash joins in Spark. However, you can benefit from improved performance with the "Broadcast" Join optimization that applies to Joins, Exists, and Lookup transformations.
150+
151+
This will avoid on-the-fly shuffles by pushing down the contents of either side of your join relationship into the Spark node. This works well for smaller tables that are used for reference lookups. Larger tables that may not fit into the node's memory are not good candidates for broadcast optimization.
152+
153+
The recommended configuration for data flows with many join operations is to keep the optimization set to "Auto" for "Broadcast" and use a Memory Optimized Azure Integration Runtime configuration. If you are experiencing out of memory errors or broadcast timeouts during data flow executions, you can switch off the broadcast optimization. However, this will result in slower performing data flows. Optionally, you can instruct data flow to pushdown only the left or right side of the join, or both.
154+
155+
![Broadcast Settings](media/data-flow/newbroad.png "Broadcast Settings")
150156

151157
Another Join optimization is to build your joins in such a way that it avoids Spark's tendency to implement cross joins. For example, when you include literal values in your join conditions, Spark may see that as a requirement to perform a full cartesian product first, then filter out the joined values. But if you ensure that you have column values on both sides of your join condition, you can avoid this Spark-induced cartesian product and improve the performance of your joins and data flows.
152158

articles/data-factory/control-flow-execute-data-flow-activity.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.service: data-factory
88
ms.workload: data-services
99
ms.topic: conceptual
1010
ms.author: makromer
11-
ms.date: 03/16/2020
11+
ms.date: 04/25/2020
1212
---
1313

1414
# Data Flow activity in Azure Data Factory
@@ -93,7 +93,7 @@ If your data flow uses parameterized datasets, set the parameter values in the *
9393

9494
### Parameterized data flows
9595

96-
If your data flow is parameterized, set the dynamic values of the data flow parameters in the **Parameters** tab. You can use either the ADF pipeline expression language (only for String types) or the data flow expression language to assign dynamic or literal parameter values. For more information, see [Data Flow Parameters](parameters-data-flow.md).
96+
If your data flow is parameterized, set the dynamic values of the data flow parameters in the **Parameters** tab. You can use either the ADF pipeline expression language or the data flow expression language to assign dynamic or literal parameter values. For more information, see [Data Flow Parameters](parameters-data-flow.md). If you wish to include pipeline properties as part of your expression to pass into a data flow parameter, then choose pipeline expressions.
9797

9898
![Execute Data Flow Parameter Example](media/data-flow/parameter-example.png "Parameter Example")
9999

articles/data-factory/data-flow-troubleshoot-guide.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: kromerm
77
manager: anandsub
88
ms.service: data-factory
99
ms.topic: troubleshooting
10-
ms.date: 04/02/2020
10+
ms.date: 04/27/2020
1111
---
1212
# Troubleshoot data flows in Azure Data Factory
1313

@@ -38,7 +38,7 @@ This article explores common troubleshooting methods for data flows in Azure Dat
3838

3939
- **Message**: Broadcast join timeout error, make sure broadcast stream produces data within 60 secs in debug runs and 300 secs in job runs
4040
- **Causes**: Broadcast has a default timeout of 60 secs in debug runs and 300 secs in job runs. Stream chosen for broadcast seems to large to produce data within this limit.
41-
- **Recommendation**: Avoid broadcasting large data streams where the processing can take more than 60 secs. Choose a smaller stream to broadcast instead. Large SQL/DW tables and source files are typically bad candidates.
41+
- **Recommendation**: Check the Optimize tab on your data flow transformations for Join, Exists, and Lookup. The default option for Broadcast is "Auto". If this is set, or if you are manually setting the left or right side to broadcast under "Fixed", then you can either set a larger Azure Integration Runtime configuration, or switch off broadcast. The recommended approach for best performance in data flows is to allow Spark to broadcast using "Auto" and use a Memory Optimized Azure IR.
4242

4343
### Error code: DF-Executor-Conversion
4444

6.34 KB
Loading

0 commit comments

Comments
 (0)