Skip to content

Commit 4d24dd1

Browse files
authored
Merge pull request #104125 from hrasheed-msft/hdinsight_content_improvement
adding links to supporting documents
2 parents 3bcae41 + 6945e51 commit 4d24dd1

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

articles/hdinsight/spark/apache-spark-perf.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.custom: hdinsightactive
99
ms.topic: conceptual
10-
ms.date: 10/01/2019
10+
ms.date: 02/12/2020
1111
---
1212

1313
# Optimize Apache Spark jobs in HDInsight
1414

15-
Learn how to optimize [Apache Spark](https://spark.apache.org/) cluster configuration for your particular workload. The most common challenge is memory pressure, because of improper configurations (particularly wrong-sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed up jobs with appropriate caching, and by allowing for [data skew](#optimize-joins-and-shuffles). For the best performance, monitor and review long-running and resource-consuming Spark job executions.
15+
Learn how to optimize [Apache Spark](https://spark.apache.org/) cluster configuration for your particular workload. The most common challenge is memory pressure, because of improper configurations (particularly wrong-sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed up jobs with appropriate caching, and by allowing for [data skew](#optimize-joins-and-shuffles). For the best performance, monitor and review long-running and resource-consuming Spark job executions. For information on getting started with Apache Spark on HDInsight, see [Create Apache Spark cluster using Azure portal](apache-spark-jupyter-spark-sql-use-portal.md).
1616

1717
The following sections describe common Spark job optimizations and recommendations.
1818

@@ -60,6 +60,8 @@ When you create a new Spark cluster, you can select Azure Blob Storage or Azure
6060
| Azure Data Lake Storage Gen 1| **adl:**//url/ | **Faster** | Yes | Transient cluster |
6161
| Local HDFS | **hdfs:**//url/ | **Fastest** | No | Interactive 24/7 cluster |
6262

63+
For a full description of the storage options available for HDInsight clusters, see [Compare storage options for use with Azure HDInsight clusters](../hdinsight-hadoop-compare-storage-options.md).
64+
6365
## Use the cache
6466

6567
Spark provides its own native caching mechanisms, which can be used through different methods such as `.persist()`, `.cache()`, and `CACHE TABLE`. This native caching is effective with small data sets as well as in ETL pipelines where you need to cache intermediate results. However, Spark native caching currently doesn't work well with partitioning, since a cached table doesn't keep the partitioning data. A more generic and reliable caching technique is *storage layer caching*.
@@ -69,7 +71,7 @@ Spark provides its own native caching mechanisms, which can be used through diff
6971
* Doesn't work with partitioning, which may change in future Spark releases.
7072

7173
* Storage level caching (recommended)
72-
* Can be implemented using [Alluxio](https://www.alluxio.io/).
74+
* Can be implemented on HDInsight using the [IO Cache](apache-spark-improve-performance-iocache.md) feature.
7375
* Uses in-memory and SSD caching.
7476

7577
* Local HDFS (recommended)
@@ -102,6 +104,8 @@ To address 'out of memory' messages, try:
102104
* Leverage DataFrames rather than the lower-level RDD objects.
103105
* Create ComplexTypes that encapsulate actions, such as "Top N", various aggregations, or windowing operations.
104106

107+
For additional troubleshooting steps, see [OutOfMemoryError exceptions for Apache Spark in Azure HDInsight](apache-spark-troubleshoot-outofmemory.md).
108+
105109
## Optimize data serialization
106110

107111
Spark jobs are distributed, so appropriate data serialization is important for the best performance. There are two serialization options for Spark:
@@ -188,7 +192,11 @@ When running concurrent queries, consider the following:
188192
3. Distribute queries across parallel applications.
189193
4. Modify size based both on trial runs and on the preceding factors such as GC overhead.
190194

191-
Monitor your query performance for outliers or other performance issues, by looking at the timeline view, SQL graph, job statistics, and so forth. Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. This frequently happens on larger clusters (> 30 nodes). In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. For example, have at least twice as many tasks as the number of executor cores in the application. You can also enable speculative execution of tasks with `conf: spark.speculation = true`.
195+
For more information on using Ambari to configure executors, see [Apache Spark settings - Spark executors](apache-spark-settings.md#configuring-spark-executors).
196+
197+
Monitor your query performance for outliers or other performance issues, by looking at the timeline view, SQL graph, job statistics, and so forth. For information on debugging Spark jobs using YARN and the Spark History server, see [Debug Apache Spark jobs running on Azure HDInsight](apache-spark-job-debugging.md). For tips on using YARN Timeline Server, see [Access Apache Hadoop YARN application logs](../hdinsight-hadoop-access-yarn-app-logs-linux.md).
198+
199+
Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. This frequently happens on larger clusters (> 30 nodes). In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. For example, have at least twice as many tasks as the number of executor cores in the application. You can also enable speculative execution of tasks with `conf: spark.speculation = true`.
192200

193201
## Optimize job execution
194202

0 commit comments

Comments
 (0)