Skip to content

Commit 8f3c15f

Browse files
authored
Merge pull request #116193 from kromerm/adfdocsmark
Updated perf and conditional split
2 parents 5e9912b + de3468d commit 8f3c15f

File tree

2 files changed

+9
-3
lines changed

2 files changed

+9
-3
lines changed

articles/data-factory/concepts-data-flow-performance.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.topic: conceptual
66
ms.author: makromer
77
ms.service: data-factory
88
ms.custom: seo-lt-2019
9-
ms.date: 04/27/2020
9+
ms.date: 05/21/2020
1010
---
1111

1212
# Mapping data flows performance and tuning guide
@@ -36,7 +36,7 @@ While designing mapping data flows, you can unit test each transformation by cli
3636

3737
An Integration Runtime with more cores increases the number of nodes in the Spark compute environments and provides more processing power to read, write, and transform your data. ADF Data Flows utilizes Spark for the compute engine. The Spark environment works very well on memory-optimized resources.
3838
* Try a **Compute Optimized** cluster if you want your processing rate to be higher than your input rate.
39-
* Try a **Memory Optimized** cluster if you want to cache more data in memory. Memory optimized has a higher price-point per core than Compute Optimized, but will likely result in faster transformation speeds.
39+
* Try a **Memory Optimized** cluster if you want to cache more data in memory. Memory optimized has a higher price-point per core than Compute Optimized, but will likely result in faster transformation speeds. If you experience out of memory errors when execution your data flows, switch to a memory optimized Azure IR configuration.
4040

4141
![New IR](media/data-flow/ir-new.png "New IR")
4242

@@ -136,6 +136,10 @@ For example, if you have a list of data files from July 2019 that you wish to pr
136136

137137
By using wildcarding, your pipeline will only contain one Data Flow activity. This will perform better than a Lookup against the Blob Store that then iterates across all matched files using a ForEach with an Execute Data Flow activity inside.
138138

139+
The pipeline For Each in parallel mode will spawn multiple clusters by spinning-up job clusters for every executed data flow activity. This can cause Azure service throttling with high numbers of concurrent executions. However, use of Execute Data Flow inside of a For Each with Sequential set in the pipeline will avoid throttling and resource exhaustion. This will force Data Factory to execute each of your files against a data flow sequentially.
140+
141+
It is recommended that if you use For Each with a data flow in sequence, that you utilize the TTL setting in the Azure Integration Runtime. This is because each file will incur a full 5 minute cluster startup time inside of your iterator.
142+
139143
### Optimizing for CosmosDB
140144

141145
Setting throughput and batch properties on CosmosDB sinks only take effect during the execution of that data flow from a pipeline data flow activity. The original collection settings will be honored by CosmosDB after your data flow execution.

articles/data-factory/data-flow-conditional-split.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.reviewer: daperlov
77
ms.service: data-factory
88
ms.topic: conceptual
99
ms.custom: seo-lt-2019
10-
ms.date: 10/16/2019
10+
ms.date: 05/21/2020
1111
---
1212

1313
# Conditional split transformation in mapping data flow
@@ -16,6 +16,8 @@ ms.date: 10/16/2019
1616

1717
The conditional split transformation routes data rows to different streams based on matching conditions. The conditional split transformation is similar to a CASE decision structure in a programming language. The transformation evaluates expressions, and based on the results, directs the data row to the specified stream.
1818

19+
> [!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/RE4wKCX]
20+
1921
## Configuration
2022

2123
The **Split on** setting determines whether the row of data flows to the first matching stream or every stream it matches to.

0 commit comments

Comments
 (0)