Skip to content

Commit c04a8e5

Browse files
authored
Merge pull request #202570 from v-lanjli/newdocforSparkAdvisor
new file spark advisor
2 parents 2e8bb6d + 1f17afc commit c04a8e5

File tree

2 files changed

+71
-3
lines changed

2 files changed

+71
-3
lines changed
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
title: Spark Advisor
3+
description: Spark Advisor is a system to automatically analyze commands/queries, and show the appropriate advice when a customer executes code or query.
4+
services: synapse-analytics
5+
author: jejiang
6+
ms.author: jejiang
7+
ms.reviewer: sngun
8+
ms.service: synapse-analytics
9+
ms.topic: tutorial
10+
ms.subservice: spark
11+
ms.date: 06/23/2022
12+
---
13+
14+
# Spark Advisor
15+
16+
Spark Advisor is a system to automatically analyze commands/queries, and show the appropriate advice when customer executes code or query. After applying the advice, you would have chance to improve your execution performance, decrease cost and fix the execution failures.
17+
18+
19+
## Advice provided
20+
21+
### May return inconsistent results when using 'randomSplit'
22+
Inconsistent or inaccurate results may be returned when working with the results of the 'randomSplit' method. Use Apache Spark (RDD) caching before using the 'randomSplit' method.
23+
24+
Method randomSplit() is equivalent to performing sample() on your data frame multiple times, with each sample refetching, partitioning, and sorting your data frame within partitions. The data distribution across partitions and sorting order is important for both randomSplit() and sample(). If either changes upon data refetch, there may be duplicates, or missing values across splits and the same sample using the same seed may produce different results.
25+
26+
These inconsistencies may not happen on every run, but to eliminate them completely, cache your data frame, repartition on a column(s), or apply aggregate functions such as groupBy.
27+
28+
### Table/view name is already in use
29+
A view already exists with the same name as the created table, or a table already exists with the same name as the created view.
30+
When this name is used in queries or applications, only the view will be returned no matter, which one created first. To avoid conflicts, rename either the table or the view.
31+
32+
## Hints related advise
33+
### Unable to recognize a hint
34+
The selected query contains a hint that isn't recognized. Verify that the hint is spelled correctly.
35+
36+
```scala
37+
spark.sql("SELECT /*+ unknownHint */ * FROM t1")
38+
```
39+
40+
### Unable to find a specified relation name(s)
41+
Unable to find the relation(s) specified in the hint. Verify that the relation(s) are spelled correctly and accessible within the scope of the hint.
42+
43+
```scala
44+
spark.sql("SELECT /*+ BROADCAST(unknownTable) */ * FROM t1 INNER JOIN t2 ON t1.str = t2.str")
45+
```
46+
47+
### A hint in the query prevents another hint from being applied
48+
The selected query contains a hint that prevents another hint from being applied.
49+
50+
```scala
51+
spark.sql("SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.str = t2.str")
52+
```
53+
54+
## Enable 'spark.advise.divisionExprConvertRule.enable' to reduce rounding error propagation
55+
This query contains the expression with Double type. We recommend that you enable the configuration 'spark.advise.divisionExprConvertRule.enable', which can help reduce the division expressions and to reduce the rounding error propagation.
56+
57+
```text
58+
"t.a/t.b/t.c" convert into "t.a/(t.b * t.c)"
59+
```
60+
61+
## Enable 'spark.advise.nonEqJoinConvertRule.enable' to improve query performance
62+
This query contains time consuming join due to "Or" condition within query. We recommend that you enable the configuration 'spark.advise.nonEqJoinConvertRule.enable', which can help to convert the join triggered by "Or" condition to SMJ or BHJ to accelerate this query.
63+
64+
## Next steps
65+
66+
For more information on monitoring pipeline runs, see the [Monitor pipeline runs using Synapse Studio](how-to-monitor-pipeline-runs.md) article.

articles/synapse-analytics/toc.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -775,9 +775,9 @@ items:
775775
- name: Navigate the Apache Spark pool history server
776776
href: ./spark/apache-spark-history-server.md
777777
- name: Monitor Spark applications
778-
href: monitoring/apache-spark-applications.md
778+
href: ./monitoring/apache-spark-applications.md
779779
- name: Monitor Apache Spark pools
780-
href: monitoring/how-to-monitor-spark-pools.md
780+
href: ./monitoring/how-to-monitor-spark-pools.md
781781
- name: Collect Apache Spark applications metrics using APIs
782782
href: ./spark/connect-monitor-azure-synapse-spark-application-level-metrics.md
783783
- name: Monitor Apache Spark Applications metrics with Prometheus and Grafana
@@ -789,7 +789,9 @@ items:
789789
- name: Collect Apache Spark applications logs and metrics with Azure Event Hubs
790790
href: ./spark/azure-synapse-diagnostic-emitters-azure-eventhub.md
791791
- name: Manage Apache Spark configuration
792-
href: ./spark/apache-spark-azure-create-spark-configuration.md
792+
href: ./spark/apache-spark-azure-create-spark-configuration.md
793+
- name: Apache Spark Advisor
794+
href: ./monitoring/apache-spark-advisor.md
793795
- name: Data sources
794796
items:
795797
- name: Azure Cosmos DB Spark 3

0 commit comments

Comments
 (0)