Skip to content

Commit 03d664b

Browse files
authored
Merge pull request #223620 from SharonZhang1/updatesparkkadvisor
update spark advisor
2 parents 153dc87 + 5e81cb1 commit 03d664b

File tree

1 file changed

+32
-21
lines changed

1 file changed

+32
-21
lines changed
Lines changed: 32 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Troubleshoot Spark application issues with Spark Advisor
3-
description: Learn how to troubleshoot Spark application issues with Spark Advisor. The advisor automatically analyzes queries and commands, and offers advice.
2+
title: Spark Advisor
3+
description: Spark Advisor is a system to automatically analyze commands/queries, and show the appropriate advice when a customer executes code or query.
44
services: synapse-analytics
55
author: jejiang
66
ms.author: jejiang
@@ -11,51 +11,62 @@ ms.subservice: spark
1111
ms.date: 06/23/2022
1212
---
1313

14-
# Troubleshoot Spark application issues with Spark Advisor
14+
# Spark Advisor
1515

16-
Spark Advisor is a system that automatically analyzes your code, queries, and commands, and advises you about them. By following this advice, you can improve your execution performance, fix execution failures, and decrease costs. This article helps you solve common problems with Spark Advisor.
16+
Spark Advisor is a system to automatically analyze commands/queries, and show the appropriate advice when customer executes code or query. After applying the advice, you would have chance to improve your execution performance, decrease cost and fix the execution failures.
1717

18-
## Advice on query hints
1918

20-
### May return inconsistent results when using 'randomsplit'
21-
Verify that the hint is spelled correctly.
19+
20+
## May return inconsistent results when using 'randomSplit'
21+
Inconsistent or inaccurate results may be returned when working with the results of the 'randomSplit' method. Use Apache Spark (RDD) caching before using the 'randomSplit' method.
22+
23+
Method randomSplit() is equivalent to performing sample() on your data frame multiple times, with each sample refetching, partitioning, and sorting your data frame within partitions. The data distribution across partitions and sorting order is important for both randomSplit() and sample(). If either changes upon data refetch, there may be duplicates, or missing values across splits and the same sample using the same seed may produce different results.
24+
25+
These inconsistencies may not happen on every run, but to eliminate them completely, cache your data frame, repartition on a column(s), or apply aggregate functions such as groupBy.
26+
27+
## Table/view name is already in use
28+
A view already exists with the same name as the created table, or a table already exists with the same name as the created view.
29+
When this name is used in queries or applications, only the view will be returned no matter which one created first. To avoid conflicts, rename either the table or the view.
30+
31+
## Hints related advise
32+
### Unable to recognize a hint
33+
The selected query contains a hint that isn't recognized. Verify that the hint is spelled correctly.
2234

2335
```scala
2436
spark.sql("SELECT /*+ unknownHint */ * FROM t1")
2537
```
2638

27-
### Unable to find specified relation names
28-
Verify that the relations are spelled correctly and are accessible within the scope of the hint.
39+
### Unable to find a specified relation name(s)
40+
Unable to find the relation(s) specified in the hint. Verify that the relation(s) are spelled correctly and accessible within the scope of the hint.
2941

3042
```scala
3143
spark.sql("SELECT /*+ BROADCAST(unknownTable) */ * FROM t1 INNER JOIN t2 ON t1.str = t2.str")
3244
```
3345

3446
### A hint in the query prevents another hint from being applied
47+
The selected query contains a hint that prevents another hint from being applied.
3548

3649
```scala
3750
spark.sql("SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.str = t2.str")
3851
```
3952

40-
### Reduce rounding error propagation caused by division
41-
This query contains the expression with the `double` type. We recommend that you enable the configuration `spark.advise.divisionExprConvertRule.enable`, which can help reduce the division expressions and the rounding error propagation.
53+
## Enable 'spark.advise.divisionExprConvertRule.enable' to reduce rounding error propagation
54+
This query contains the expression with Double type. We recommend that you enable the configuration 'spark.advise.divisionExprConvertRule.enable', which can help reduce the division expressions and to reduce the rounding error propagation.
4255

4356
```text
4457
"t.a/t.b/t.c" convert into "t.a/(t.b * t.c)"
4558
```
4659

47-
### Improve query performance for non-equal join
48-
This query contains a time-consuming join because of an `Or` condition within the query. We recommend that you enable the configuration `spark.advise.nonEqJoinConvertRule.enable`. It can help convert the join triggered by the `Or` condition to shuffle sort merge join (SMJ) or broadcast hash join (BHJ) to accelerate this query.
49-
50-
### The use of the randomSplit method might return inconsistent results
51-
Spark Advisor might return inconsistent or inaccurate results when you work with the results of the `randomSplit` method. Use Apache Spark resilient distributed dataset caching (RDD) before you use the `randomSplit` method.
60+
## Enable 'spark.advise.nonEqJoinConvertRule.enable' to improve query performance
61+
This query contains time consuming join due to "Or" condition within query. We recommend that you enable the configuration 'spark.advise.nonEqJoinConvertRule.enable', which can help to convert the join triggered by "Or" condition to SMJ or BHJ to accelerate this query.
5262

53-
The `randomSplit()` method is equivalent to performing a `sample()` action on your DataFrame multiple times, with each sample refetching, partitioning, and sorting your DataFrame within partitions. The data distribution across partitions and sort order is important for both `randomSplit()` and `sample()` methods. If either changes upon data refetch, there might be duplicates or missing values across splits, and the same sample that uses the same seed might produce different results.
63+
## Optimize delta table with small files compaction
5464

55-
These inconsistencies might not happen on every run. To eliminate them completely, cache your DataFrame, repartition on columns, or apply aggregate functions such as `groupBy`.
65+
This query is on a delta table with many small files. To improve the performance of queries, run the OPTIMIZE command on the delta table. More details could be found within this [article](https://aka.ms/small-file-advise-delta).
5666

57-
### A table or view name might already be in use
58-
A view already exists with the same name as the created table, or a table already exists with the same name as the created view. When you use this name in queries or applications, Spark Advisor returns only the view, regardless of which one was created first. To avoid conflicts, rename either the table or the view.
67+
## Optimize Delta table with ZOrder
5968

69+
This query is on a Delta table and contains a highly selective filter. To improve the performance of queries, run the OPTIMIZE ZORDER BY command on the delta table. More details could be found within this [article](https://aka.ms/small-file-advise-delta).
6070
## Next steps
61-
For more information on monitoring pipeline runs, see [Monitor pipeline runs using Synapse Studio](how-to-monitor-pipeline-runs.md).
71+
72+
For more information on monitoring pipeline runs, see the [Monitor pipeline runs using Synapse Studio](how-to-monitor-pipeline-runs.md) article.

0 commit comments

Comments
 (0)