You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/spark/low-shuffle-merge-for-apache-spark.md
+9-6Lines changed: 9 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,17 +12,20 @@ ms.reviewer: dacoelho
12
12
13
13
# Low Shuffle Merge optimization on Delta tables
14
14
15
-
Delta Lake [MERGE command](https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge) allows users to update a delta table with advanced conditions. It can update data from a source table, view or DataFrame into a target table by using MERGE command. However, the current algorithm is not fully optimized for handling *unmodified* rows. With Low Shuffle Merge optimization, unmodified rows are excluded from an expensive shuffling operation which is needed for updating matched rows.
15
+
Delta Lake [MERGE command](https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge) allows users to update a delta table with advanced conditions. It can update data from a source table, view or DataFrame into a target table by using MERGE command. However, the current algorithm isn't fully optimized for handling *unmodified* rows. With Low Shuffle Merge optimization, unmodified rows are excluded from an expensive shuffling operation that is needed for updating matched rows.
16
16
17
-
Currently MERGE operation is done by 2 Join operations. The first join is using the whole target table and source data, to find a list of *touched* files of the target table including any matched rows. After that, it performs the second join reading only those *touched* files and source data, to do actual table update. Even though the first join is to reduce the amount of data for the second join, there could still be a huge number of *unmodified* rows in *touched* files. The first Join query is lighter because it only reads columns in the given matching condition, but the second one needs to load all columns of each file which incurs a heavy shuffling process.
17
+
## Why we need Low Shuffle Merge
18
+
19
+
Currently MERGE operation is done by two Join executions. The first join is using the whole target table and source data, to find a list of *touched* files of the target table including any matched rows. After that, it performs the second join reading only those *touched* files and source data, to do actual table update. Even though the first join is to reduce the amount of data for the second join, there could still be a huge number of *unmodified* rows in *touched* files. The first join query is lighter as it only reads columns in the given matching condition. The second one for table update needs to load all columns, which incurs an expensive shuffling process.
20
+
21
+
With Low Shuffle Merge optimization, Delta keeps the matched row result from the first join temporarily and utilizes it for the second join. Using the result, it excludes *unmodified* rows from the heavy shuffling process. There would be two separate write jobs for *matched* rows and *unmodified* rows, thus it could result in 2x number of output files compared to the previous behavior. However, the expected performance gain outweighs the possible small files problem.
18
22
19
-
With Low Shuffle Merge optimization, Delta keeps the matched row result from the first join temporarily and utilizes it for the second join. Using the result, it excludes *unmodified* rows from the heavy shuffling process. There would be 2 separate write jobs for *matched* rows and *unmodified* rows, thus it could result in 2x number of output files compared to the previous behavior. However, the expected performance gain outweighs the possible small files problem.
20
23
## Availability
21
24
22
25
> [!NOTE]
23
26
> - Low Shuffle Merge is available as a Preview feature.
24
27
25
-
It is available on Synapse Pools for Apache Spark versions 3.2 and 3.3.
28
+
It's available on Synapse Pools for Apache Spark versions 3.2 and 3.3.
26
29
27
30
|Version| Availability | Default |
28
31
|--|--|--|
@@ -33,9 +36,9 @@ It is available on Synapse Pools for Apache Spark versions 3.2 and 3.3.
33
36
34
37
## Benefits of Low Shuffle Merge
35
38
36
-
* Unmodified rows in *touched* files are handled separately and not going through the actual MERGE operation. It can save the overall MERGE execution time and compute resources. The gain would be larger when a lot of rows are just copied and only a small number of rows are updated.
39
+
* Unmodified rows in *touched* files are handled separately and not going through the actual MERGE operation. It can save the overall MERGE execution time and compute resources. The gain would be larger when many rows are copied and only a small number of rows are updated.
37
40
* Row orderings are preserved for unmodified rows. Therefore, the output files of unmodified rows could be still efficient for data skipping if the file was sorted or Z-ORDERED.
38
-
*Very little overhead even for the worst case which is matching all rows.
41
+
*There would be tiny overhead even for the worst case when MERGE condition matches all rows in touched files.
0 commit comments