Skip to content

Commit 3e609b8

Browse files
authored
Create low-shuffle-merge-for-apache-spark.md
draft
1 parent 9fb5587 commit 3e609b8

File tree

1 file changed

+88
-0
lines changed

1 file changed

+88
-0
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: Low Shuffle Merge optimization on Delta tables
3+
description: Low Shuffle Merge optimization on Delta tables for Apache Spark
4+
author: sezruby
5+
ms.service: synapse-analytics
6+
ms.topic: reference
7+
ms.subservice: spark
8+
ms.date: 04/11/2023
9+
ms.author: eunjinsong
10+
ms.reviewer: dacoelho
11+
---
12+
13+
# Introduction
14+
15+
Delta Lake [MERGE command](https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge) allows users to update a delta table with advanced conditions. It can update data from a source table, view or DataFrame into a target table by using MERGE command. However, the current algorithm of MERGE command is not optimized for handling *unmodified* rows. With Low Shuffle Merge optimization, unmodified rows are excluded from expensive shuffling execution and written separately.
16+
17+
To execute MERGE, 2 join operations are required. The first one is joining whole target table and source data, to find *touched* files including any matched row. The other one is for actual MERGE operation only with *touched* files of the target table. The first join is lighter as it only reads columns in matching condition. Although Delta performs the first join to reduce the amount of data for the actual merge process, huge amount of *unmodified* rows in *touched* files goes through the second join process. With Low Shuffle Merge, Delta retreives "matched" rows result from the first join and utilizes it for classifying *matched* rows. Based on that, there are 2 separate write jobs for *matched* rows and *unmodified* rows, so it could result in 2x number of output files compared to default MERGE operation. The expected performance gain outweighs the possible small files problem.
18+
19+
## Availability
20+
21+
> [!NOTE]
22+
> - Low shuffle merge is available as a Preview feature.
23+
24+
It is available on Synapse Pools for Apache Spark versions 3.2 and 3.3.
25+
26+
|Version| Availability | Default |
27+
|--|--|--|
28+
| Delta 0.6 / Spark 2.4 | No | - |
29+
| Delta 1.2 / Spark 3.2 | Yes | false |
30+
| Delta 2.2 / Spark 3.3 | Yes | true |
31+
32+
33+
## Benefits of Low Shuffle Merge
34+
35+
* Unmodified rows in *touched* files are handled separately and not going through the actual MERGE operation. It can save the overall MERGE execution time and resources.
36+
* Row orderings are preserved for unmodified rows. Therefore, the output files of unmodified rows are still efficient for data skipping if the file was sorted or Z-ORDERED.
37+
* Not much overhead even for the worst cast which is matching all rows.
38+
39+
40+
## How to enable and disable Low Shuffle Merge
41+
42+
Once the configuration is set for the pool or session, all Spark write patterns will use the functionality.
43+
44+
To use Low Shuffle Merge optimization, enable it using the following configuration:
45+
46+
1. Scala and PySpark
47+
48+
```scala
49+
spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "true")
50+
```
51+
52+
2. Spark SQL
53+
54+
```SQL
55+
SET `spark.microsoft.delta.merge.lowShuffle.enabled` = true
56+
```
57+
58+
To check the current configuration value, use the command as shown below:
59+
60+
1. Scala and PySpark
61+
62+
```scala
63+
spark.conf.get("spark.microsoft.delta.merge.lowShuffle.enabled")
64+
```
65+
66+
2. Spark SQL
67+
68+
```SQL
69+
SET `spark.microsoft.delta.merge.lowShuffle.enabled`
70+
```
71+
72+
To disable the feature, change the following configuration as shown below:
73+
74+
1. Scala and PySpark
75+
76+
```scala
77+
spark.conf.set("spark.microsoft.delta.merge.lowShuffle.enabled", "false")
78+
```
79+
80+
2. Spark SQL
81+
82+
```SQL
83+
SET `spark.microsoft.delta.merge.lowShuffle.enabled` = false
84+
```
85+
86+
## Future improvement
87+
88+
With Low Shuffle Merge feature, rewriting unmodified rows still takes a long time and resources. There is ongoing work in OSS Delta Lake for deletion vector. Once it is delivered, we would be able to skip rewriting unmodified rows.

0 commit comments

Comments
 (0)