Skip to content

Commit addaefc

Browse files
authored
Update low-shuffle-merge-for-apache-spark.md
1 parent 4443d72 commit addaefc

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

articles/synapse-analytics/spark/low-shuffle-merge-for-apache-spark.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Delta Lake [MERGE command](https://docs.delta.io/latest/delta-update.html#upsert
1818

1919
Currently MERGE operation is done by two Join executions. The first join is using the whole target table and source data, to find a list of *touched* files of the target table including any matched rows. After that, it performs the second join reading only those *touched* files and source data, to do actual table update. Even though the first join is to reduce the amount of data for the second join, there could still be a huge number of *unmodified* rows in *touched* files. The first join query is lighter as it only reads columns in the given matching condition. The second one for table update needs to load all columns, which incurs an expensive shuffling process.
2020

21-
With Low Shuffle Merge optimization, Delta keeps the matched row result from the first join temporarily and utilizes it for the second join. Using the result, it excludes *unmodified* rows from the heavy shuffling process. There would be two separate write jobs for *matched* rows and *unmodified* rows, thus it could result in 2x number of output files compared to the previous behavior. However, the expected performance gain outweighs the possible small files problem.
21+
With Low Shuffle Merge optimization, Delta keeps the matched row result from the first join temporarily and utilizes it for the second join. Based on the result, it excludes *unmodified* rows from the heavy shuffling process. There would be two separate write jobs for *matched* rows and *unmodified* rows, thus it could result in 2x number of output files compared to the previous behavior. However, the expected performance gain outweighs the possible small files problem.
2222

2323
## Availability
2424

@@ -36,7 +36,7 @@ It's available on Synapse Pools for Apache Spark versions 3.2 and 3.3.
3636

3737
## Benefits of Low Shuffle Merge
3838

39-
* Unmodified rows in *touched* files are handled separately and not going through the actual MERGE operation. It can save the overall MERGE execution time and compute resources. The gain would be larger when many rows are copied and only a small number of rows are updated.
39+
* Unmodified rows in *touched* files are handled separately and not going through the actual MERGE operation. It can save the overall MERGE execution time and compute resources. The gain would be larger when many rows are copied and only a few rows are updated.
4040
* Row orderings are preserved for unmodified rows. Therefore, the output files of unmodified rows could be still efficient for data skipping if the file was sorted or Z-ORDERED.
4141
* There would be tiny overhead even for the worst case when MERGE condition matches all rows in touched files.
4242

0 commit comments

Comments
 (0)