Skip to content

Commit d32c71c

Browse files
Merge pull request #230339 from ssabat/patch-2
Updated with shuffle partition.
2 parents c0adf7a + b2eabc0 commit d32c71c

File tree

1 file changed

+15
-1
lines changed

1 file changed

+15
-1
lines changed

articles/data-factory/concepts-integration-runtime-performance.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.author: makromer
88
ms.service: data-factory
99
ms.subservice: data-flows
1010
ms.custom: synapse
11-
ms.date: 07/20/2022
11+
ms.date: 03/10/2023
1212
---
1313

1414
# Optimizing performance of the Azure Integration Runtime
@@ -49,6 +49,20 @@ Data flows are priced at vcore-hrs meaning that both cluster size and execution-
4949
> There is a ceiling on how much the size of a cluster affects the performance of a data flow. Depending on the size of your data, there is a point where increasing the size of a cluster will stop improving performance. For example, If you have more nodes than partitions of data, adding additional nodes won't help.
5050
A best practice is to start small and scale up to meet your performance needs.
5151

52+
## Custom shuffle partition
53+
54+
Dataflow divides the data into partitions and transforms it using different processes. If the data size in a partition is more than the process can hold in memory, the process fails with OOM(out of memory) errors. If dataflow contains huge amounts of data having joins/aggregations, you may want to try changing shuffle partitions in incremental way. You can set it from 50 up to 2000, to avoid OOM errors. **Compute Custom properties** in dataflow runtime, is a way to control your compute requirements. Property name is **Shuffle partitions** and it's integer type. This customization should only be used in known scenarios, otherwise it can cause unnecessary dataflow failures.
55+
56+
While increasing the shuffle partitions, make sure data is spread across well. A rough number is to have approximately 1.5 GB of data per partition. If data is skewed, increasing the "Shuffle partitions" won't be helpful. For example, if you have 500 GB of data, having a value between 400 to 500 should work. Default limit for shuffle partitions is 200 that works well for approximately 300 GB of data.
57+
58+
Here are the steps on how it's set in a custom integration runtime. You can't set it for autoresolve integrtaion runtime.
59+
60+
1. From ADF portal under **Manage**, select a custom itegration run time and you go to edit mode.
61+
2. Under dataflow run time tab, go to **Compute Cusotm Properties** section.
62+
3. Select **Shuffle Partitions** under Property name, input value of your choice, like 250, 500 etc.
63+
64+
You can do same by editing JSON file of runtime by adding array with property name and value after *cleanup* property.
65+
5266
## Time to live
5367

5468
By default, every data flow activity spins up a new Spark cluster based upon the Azure IR configuration. Cold cluster start-up time takes a few minutes and data processing can't start until it is complete. If your pipelines contain multiple **sequential** data flows, you can enable a time to live (TTL) value. Specifying a time to live value keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL time, it will reuse the existing cluster and start up time will greatly reduced. After the second job completes, the cluster will again stay alive for the TTL time.

0 commit comments

Comments
 (0)