You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-perf.md
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,12 +7,12 @@ ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.custom: hdinsightactive
9
9
ms.topic: conceptual
10
-
ms.date: 10/01/2019
10
+
ms.date: 02/12/2020
11
11
---
12
12
13
13
# Optimize Apache Spark jobs in HDInsight
14
14
15
-
Learn how to optimize [Apache Spark](https://spark.apache.org/) cluster configuration for your particular workload. The most common challenge is memory pressure, because of improper configurations (particularly wrong-sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed up jobs with appropriate caching, and by allowing for [data skew](#optimize-joins-and-shuffles). For the best performance, monitor and review long-running and resource-consuming Spark job executions.
15
+
Learn how to optimize [Apache Spark](https://spark.apache.org/) cluster configuration for your particular workload. The most common challenge is memory pressure, because of improper configurations (particularly wrong-sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed up jobs with appropriate caching, and by allowing for [data skew](#optimize-joins-and-shuffles). For the best performance, monitor and review long-running and resource-consuming Spark job executions. For information on getting started with Apache Spark on HDInsight, see [Create Apache Spark cluster using Azure portal](apache-spark-jupyter-spark-sql-use-portal.md).
16
16
17
17
The following sections describe common Spark job optimizations and recommendations.
18
18
@@ -60,6 +60,8 @@ When you create a new Spark cluster, you can select Azure Blob Storage or Azure
60
60
| Azure Data Lake Storage Gen 1|**adl:**//url/ |**Faster**| Yes | Transient cluster |
61
61
| Local HDFS |**hdfs:**//url/ |**Fastest**| No | Interactive 24/7 cluster |
62
62
63
+
For a full description of the storage options available for HDInsight clusters, see [Compare storage options for use with Azure HDInsight clusters](../hdinsight-hadoop-compare-storage-options.md).
64
+
63
65
## Use the cache
64
66
65
67
Spark provides its own native caching mechanisms, which can be used through different methods such as `.persist()`, `.cache()`, and `CACHE TABLE`. This native caching is effective with small data sets as well as in ETL pipelines where you need to cache intermediate results. However, Spark native caching currently doesn't work well with partitioning, since a cached table doesn't keep the partitioning data. A more generic and reliable caching technique is *storage layer caching*.
@@ -69,7 +71,7 @@ Spark provides its own native caching mechanisms, which can be used through diff
69
71
* Doesn't work with partitioning, which may change in future Spark releases.
70
72
71
73
* Storage level caching (recommended)
72
-
* Can be implemented using [Alluxio](https://www.alluxio.io/).
74
+
* Can be implemented on HDInsight using the [IO Cache](apache-spark-improve-performance-iocache.md) feature.
73
75
* Uses in-memory and SSD caching.
74
76
75
77
* Local HDFS (recommended)
@@ -102,6 +104,8 @@ To address 'out of memory' messages, try:
102
104
* Leverage DataFrames rather than the lower-level RDD objects.
103
105
* Create ComplexTypes that encapsulate actions, such as "Top N", various aggregations, or windowing operations.
104
106
107
+
For additional troubleshooting steps, see [OutOfMemoryError exceptions for Apache Spark in Azure HDInsight](apache-spark-troubleshoot-outofmemory.md).
108
+
105
109
## Optimize data serialization
106
110
107
111
Spark jobs are distributed, so appropriate data serialization is important for the best performance. There are two serialization options for Spark:
@@ -188,7 +192,11 @@ When running concurrent queries, consider the following:
188
192
3. Distribute queries across parallel applications.
189
193
4. Modify size based both on trial runs and on the preceding factors such as GC overhead.
190
194
191
-
Monitor your query performance for outliers or other performance issues, by looking at the timeline view, SQL graph, job statistics, and so forth. Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. This frequently happens on larger clusters (> 30 nodes). In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. For example, have at least twice as many tasks as the number of executor cores in the application. You can also enable speculative execution of tasks with `conf: spark.speculation = true`.
195
+
For more information on using Ambari to configure executors, see [Apache Spark settings - Spark executors](apache-spark-settings.md#configuring-spark-executors).
196
+
197
+
Monitor your query performance for outliers or other performance issues, by looking at the timeline view, SQL graph, job statistics, and so forth. For information on debugging Spark jobs using YARN and the Spark History server, see [Debug Apache Spark jobs running on Azure HDInsight](apache-spark-job-debugging.md). For tips on using YARN Timeline Server, see [Access Apache Hadoop YARN application logs](../hdinsight-hadoop-access-yarn-app-logs-linux.md).
198
+
199
+
Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. This frequently happens on larger clusters (> 30 nodes). In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. For example, have at least twice as many tasks as the number of executor cores in the application. You can also enable speculative execution of tasks with `conf: spark.speculation = true`.
0 commit comments