Skip to content

Commit 318127d

Browse files
authored
Merge pull request #234201 from sezruby/lsmdoc
Low Shuffle Merge doc
2 parents 1aca517 + addaefc commit 318127d

File tree

3 files changed

+93
-4
lines changed

3 files changed

+93
-4
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: Low Shuffle Merge optimization on Delta tables
3+
description: Low Shuffle Merge optimization on Delta tables for Apache Spark
4+
author: sezruby
5+
ms.service: synapse-analytics
6+
ms.topic: reference
7+
ms.subservice: spark
8+
ms.date: 04/11/2023
9+
ms.author: eunjinsong
10+
ms.reviewer: dacoelho
11+
---
12+
13+
# Low Shuffle Merge optimization on Delta tables
14+
15+
Delta Lake [MERGE command](https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge) allows users to update a delta table with advanced conditions. It can update data from a source table, view or DataFrame into a target table by using MERGE command. However, the current algorithm isn't fully optimized for handling *unmodified* rows. With Low Shuffle Merge optimization, unmodified rows are excluded from an expensive shuffling operation that is needed for updating matched rows.
16+
17+
## Why we need Low Shuffle Merge
18+
19+
Currently MERGE operation is done by two Join executions. The first join is using the whole target table and source data, to find a list of *touched* files of the target table including any matched rows. After that, it performs the second join reading only those *touched* files and source data, to do actual table update. Even though the first join is to reduce the amount of data for the second join, there could still be a huge number of *unmodified* rows in *touched* files. The first join query is lighter as it only reads columns in the given matching condition. The second one for table update needs to load all columns, which incurs an expensive shuffling process.
20+
21+
With Low Shuffle Merge optimization, Delta keeps the matched row result from the first join temporarily and utilizes it for the second join. Based on the result, it excludes *unmodified* rows from the heavy shuffling process. There would be two separate write jobs for *matched* rows and *unmodified* rows, thus it could result in 2x number of output files compared to the previous behavior. However, the expected performance gain outweighs the possible small files problem.
22+
23+
## Availability
24+
25+
> [!NOTE]
26+
> - Low Shuffle Merge is available as a Preview feature.
27+
28+
It's available on Synapse Pools for Apache Spark versions 3.2 and 3.3.
29+
30+
|Version| Availability | Default |
31+
|--|--|--|
32+
| Delta 0.6 / Spark 2.4 | No | - |
33+
| Delta 1.2 / Spark 3.2 | Yes | false |
34+
| Delta 2.2 / Spark 3.3 | Yes | true |
35+
36+
37+
## Benefits of Low Shuffle Merge
38+
39+
* Unmodified rows in *touched* files are handled separately and not going through the actual MERGE operation. It can save the overall MERGE execution time and compute resources. The gain would be larger when many rows are copied and only a few rows are updated.
40+
* Row orderings are preserved for unmodified rows. Therefore, the output files of unmodified rows could be still efficient for data skipping if the file was sorted or Z-ORDERED.
41+
* There would be tiny overhead even for the worst case when MERGE condition matches all rows in touched files.
42+
43+
44+
## How to enable and disable Low Shuffle Merge
45+
46+
Once the configuration is set for the pool or session, all Spark write patterns will use the functionality.
47+
48+
To use Low Shuffle Merge optimization, enable it using the following configuration:
49+
50+
1. Scala and PySpark
51+
52+
```scala
53+
spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "true")
54+
```
55+
56+
2. Spark SQL
57+
58+
```SQL
59+
SET `spark.microsoft.delta.merge.lowShuffle.enabled` = true
60+
```
61+
62+
To check the current configuration value, use the command as shown below:
63+
64+
1. Scala and PySpark
65+
66+
```scala
67+
spark.conf.get("spark.microsoft.delta.merge.lowShuffle.enabled")
68+
```
69+
70+
2. Spark SQL
71+
72+
```SQL
73+
SET `spark.microsoft.delta.merge.lowShuffle.enabled`
74+
```
75+
76+
To disable the feature, change the following configuration as shown below:
77+
78+
1. Scala and PySpark
79+
80+
```scala
81+
spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "false")
82+
```
83+
84+
2. Spark SQL
85+
86+
```SQL
87+
SET `spark.microsoft.delta.merge.lowShuffle.enabled` = false
88+
```

articles/synapse-analytics/spark/optimize-write-for-apache-spark.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,7 @@ Optimize Write is a Delta Lake on Synapse feature that reduces the number of fil
1919
This feature achieves the file size by using an extra data shuffle phase over partitions, causing an extra processing cost while writing the data. The small write penalty should be outweighed by read efficiency on the tables.
2020

2121
> [!NOTE]
22-
> - Optimize write is available as a Preview feature.
23-
> - It is available on Synapse Pools for Apache Spark versions 3.1 and 3.2.
22+
> - It is available on Synapse Pools for Apache Spark versions above 3.1.
2423
2524
## Benefits of Optimize Writes
2625

@@ -48,7 +47,7 @@ This feature achieves the file size by using an extra data shuffle phase over pa
4847

4948
## How to enable and disable the optimize write feature
5049

51-
The optimize write feature is disabled by default.
50+
The optimize write feature is disabled by default. In Spark 3.3 Pool, it is enabled by default for partitioned tables.
5251

5352
Once the configuration is set for the pool or session, all Spark write patterns will use the functionality.
5453

@@ -172,4 +171,4 @@ SET `spark.microsoft.delta.optimizeWrite.binSize` = 134217728
172171
- [Use serverless Apache Spark pool in Synapse Studio](../quickstart-create-apache-spark-pool-studio.md).
173172
- [Run a Spark application in notebook](./apache-spark-development-using-notebooks.md).
174173
- [Create Apache Spark job definition in Azure Studio](./apache-spark-job-definitions.md).
175-
174+

articles/synapse-analytics/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -722,6 +722,8 @@ items:
722722
href: ./spark/apache-spark-what-is-delta-lake.md
723723
- name: Optimize Apache Spark writes on Delta Lake
724724
href: ./spark/optimize-write-for-apache-spark.md
725+
- name: Low Shuffle Merge on Delta Lake
726+
href: ./spark/low-shuffle-merge-for-apache-spark.md
725727
- name: Apache Spark autoscale behavior
726728
href: ./spark/apache-spark-autoscale.md
727729
- name: Intelligent Cache

0 commit comments

Comments
 (0)