Skip to content

Commit 6352c0b

Browse files
authored
Merge pull request #111789 from dagiro/freshness_c2
freshness_c2
2 parents 31eaa69 + 968e722 commit 6352c0b

File tree

1 file changed

+28
-26
lines changed

1 file changed

+28
-26
lines changed

articles/hdinsight/spark/apache-spark-perf.md

Lines changed: 28 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ author: hrasheed-msft
55
ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdinsightactive
98
ms.topic: conceptual
10-
ms.date: 02/12/2020
9+
ms.custom: hdinsightactive
10+
ms.date: 04/17/2020
1111
---
1212

1313
# Optimize Apache Spark jobs in HDInsight
1414

15-
Learn how to optimize [Apache Spark](https://spark.apache.org/) cluster configuration for your particular workload. The most common challenge is memory pressure, because of improper configurations (particularly wrong-sized executors), long-running operations, and tasks that result in Cartesian operations. You can speed up jobs with appropriate caching, and by allowing for [data skew](#optimize-joins-and-shuffles). For the best performance, monitor and review long-running and resource-consuming Spark job executions. For information on getting started with Apache Spark on HDInsight, see [Create Apache Spark cluster using Azure portal](apache-spark-jupyter-spark-sql-use-portal.md).
15+
Learn how to optimize Apache Spark cluster configuration for your particular workload. The most common challenge is memory pressure, because of improper configurations (such as wrong-sized executors). Also, long-running operations, and tasks that result in Cartesian operations. You can speed up jobs with appropriate caching, and by allowing for [data skew](#optimize-joins-and-shuffles). For best performance, monitor and review long-running and resource-consuming Spark job executions. For information on getting started with Apache Spark on HDInsight, see [Create Apache Spark cluster using Azure portal](apache-spark-jupyter-spark-sql-use-portal.md).
1616

1717
The following sections describe common Spark job optimizations and recommendations.
1818

@@ -50,7 +50,7 @@ The best format for performance is parquet with *snappy compression*, which is t
5050

5151
## Select default storage
5252

53-
When you create a new Spark cluster, you can select Azure Blob Storage or Azure Data Lake Storage as your cluster's default storage. Both options give you the benefit of long-term storage for transient clusters, so your data doesn't get automatically deleted when you delete your cluster. You can recreate a transient cluster and still access your data.
53+
When you create a new Spark cluster, you can select Azure Blob Storage or Azure Data Lake Storage as your cluster's default storage. Both options give you the benefit of long-term storage for transient clusters. So your data doesn't get automatically deleted when you delete your cluster. You can recreate a transient cluster and still access your data.
5454

5555
| Store Type | File System | Speed | Transient | Use Cases |
5656
| --- | --- | --- | --- | --- |
@@ -60,11 +60,11 @@ When you create a new Spark cluster, you can select Azure Blob Storage or Azure
6060
| Azure Data Lake Storage Gen 1| **adl:**//url/ | **Faster** | Yes | Transient cluster |
6161
| Local HDFS | **hdfs:**//url/ | **Fastest** | No | Interactive 24/7 cluster |
6262

63-
For a full description of the storage options available for HDInsight clusters, see [Compare storage options for use with Azure HDInsight clusters](../hdinsight-hadoop-compare-storage-options.md).
63+
For a full description of storage options, see [Compare storage options for use with Azure HDInsight clusters](../hdinsight-hadoop-compare-storage-options.md).
6464

6565
## Use the cache
6666

67-
Spark provides its own native caching mechanisms, which can be used through different methods such as `.persist()`, `.cache()`, and `CACHE TABLE`. This native caching is effective with small data sets as well as in ETL pipelines where you need to cache intermediate results. However, Spark native caching currently doesn't work well with partitioning, since a cached table doesn't keep the partitioning data. A more generic and reliable caching technique is *storage layer caching*.
67+
Spark provides its own native caching mechanisms, which can be used through different methods such as `.persist()`, `.cache()`, and `CACHE TABLE`. This native caching is effective with small data sets and in ETL pipelines where you need to cache intermediate results. However, Spark native caching currently doesn't work well with partitioning, since a cached table doesn't keep the partitioning data. A more generic and reliable caching technique is *storage layer caching*.
6868

6969
* Native Spark caching (not recommended)
7070
* Good for small datasets.
@@ -81,18 +81,18 @@ Spark provides its own native caching mechanisms, which can be used through diff
8181

8282
## Use memory efficiently
8383

84-
Spark operates by placing data in memory, so managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster's memory efficiently.
84+
Spark operates by placing data in memory. So managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster's memory efficiently.
8585

8686
* Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.
87-
* Consider the newer, more efficient [Kryo data serialization](https://github.com/EsotericSoftware/kryo), rather than the default Java serialization.
87+
* Consider the newer, more efficient [`Kryo data serialization`](https://github.com/EsotericSoftware/kryo), rather than the default Java serialization.
8888
* Prefer using YARN, as it separates `spark-submit` by batch.
8989
* Monitor and tune Spark configuration settings.
9090

9191
For your reference, the Spark memory structure and some key executor memory parameters are shown in the next image.
9292

9393
### Spark memory considerations
9494

95-
If you're using [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), then YARN controls the maximum sum of memory used by all containers on each Spark node. The following diagram shows the key objects and their relationships.
95+
If you're using Apache Hadoop YARN, then YARN controls the memory used by all containers on each Spark node. The following diagram shows the key objects and their relationships.
9696

9797
![YARN Spark Memory Management](./media/apache-spark-perf/apache-yarn-spark-memory.png)
9898

@@ -101,7 +101,7 @@ To address 'out of memory' messages, try:
101101
* Review DAG Management Shuffles. Reduce by map-side reducting, pre-partition (or bucketize) source data, maximize single shuffles, and reduce the amount of data sent.
102102
* Prefer `ReduceByKey` with its fixed memory limit to `GroupByKey`, which provides aggregations, windowing, and other functions but it has ann unbounded memory limit.
103103
* Prefer `TreeReduce`, which does more work on the executors or partitions, to `Reduce`, which does all work on the driver.
104-
* Leverage DataFrames rather than the lower-level RDD objects.
104+
* Use DataFrames rather than the lower-level RDD objects.
105105
* Create ComplexTypes that encapsulate actions, such as "Top N", various aggregations, or windowing operations.
106106

107107
For additional troubleshooting steps, see [OutOfMemoryError exceptions for Apache Spark in Azure HDInsight](apache-spark-troubleshoot-outofmemory.md).
@@ -111,11 +111,11 @@ For additional troubleshooting steps, see [OutOfMemoryError exceptions for Apach
111111
Spark jobs are distributed, so appropriate data serialization is important for the best performance. There are two serialization options for Spark:
112112

113113
* Java serialization is the default.
114-
* Kryo serialization is a newer format and can result in faster and more compact serialization than Java. Kryo requires that you register the classes in your program, and it doesn't yet support all Serializable types.
114+
* `Kryo` serialization is a newer format and can result in faster and more compact serialization than Java. `Kryo` requires that you register the classes in your program, and it doesn't yet support all Serializable types.
115115

116116
## Use bucketing
117117

118-
Bucketing is similar to data partitioning, but each bucket can hold a set of column values rather than just one. Bucketing works well for partitioning on large (in the millions or more) numbers of values, such as product identifiers. A bucket is determined by hashing the bucket key of the row. Bucketed tables offer unique optimizations because they store metadata about how they were bucketed and sorted.
118+
Bucketing is similar to data partitioning. But each bucket can hold a set of column values rather than just one. This method works well for partitioning on large (in the millions or more) numbers of values, such as product identifiers. A bucket is determined by hashing the bucket key of the row. Bucketed tables offer unique optimizations because they store metadata about how they were bucketed and sorted.
119119

120120
Some advanced bucketing features are:
121121

@@ -127,9 +127,9 @@ You can use partitioning and bucketing at the same time.
127127

128128
## Optimize joins and shuffles
129129

130-
If you have slow jobs on a Join or Shuffle, the cause is probably *data skew*, which is asymmetry in your job data. For example, a map job may take 20 seconds, but running a job where the data is joined or shuffled takes hours. To fix data skew, you should salt the entire key, or use an *isolated salt* for only some subset of keys. If you're using an isolated salt, you should further filter to isolate your subset of salted keys in map joins. Another option is to introduce a bucket column and pre-aggregate in buckets first.
130+
If you have slow jobs on a Join or Shuffle, the cause is probably *data skew*. Data skew is asymmetry in your job data. For example, a map job may take 20 seconds. But running a job where the data is joined or shuffled takes hours. To fix data skew, you should salt the entire key, or use an *isolated salt* for only some subset of keys. If you're using an isolated salt, you should further filter to isolate your subset of salted keys in map joins. Another option is to introduce a bucket column and pre-aggregate in buckets first.
131131

132-
Another factor causing slow joins could be the join type. By default, Spark uses the `SortMerge` join type. This type of join is best suited for large data sets, but is otherwise computationally expensive because it must first sort the left and right sides of data before merging them.
132+
Another factor causing slow joins could be the join type. By default, Spark uses the `SortMerge` join type. This type of join is best suited for large data sets. But is otherwise computationally expensive because it must first sort the left and right sides of data before merging them.
133133

134134
A `Broadcast` join is best suited for smaller data sets, or where one side of the join is much smaller than the other side. This type of join broadcasts one side to all executors, and so requires more memory for broadcasts in general.
135135

@@ -156,13 +156,15 @@ To manage parallelism for Cartesian joins, you can add nested structures, window
156156

157157
## Customize cluster configuration
158158

159-
Depending on your Spark cluster workload, you may determine that a non-default Spark configuration would result in more optimized Spark job execution. Perform benchmark testing with sample workloads to validate any non-default cluster configurations.
159+
Depending on your Spark cluster workload, you may determine a non-default Spark configuration would result in more optimized Spark job execution. Do benchmark testing with sample workloads to validate any non-default cluster configurations.
160160

161161
Here are some common parameters you can adjust:
162162

163-
* `--num-executors` sets the appropriate number of executors.
164-
* `--executor-cores` sets the number of cores for each executor. Typically you should have middle-sized executors, as other processes consume some of the available memory.
165-
* `--executor-memory` sets the memory size for each executor, which controls the heap size on YARN. You should leave some memory for execution overhead.
163+
|Parameter |Description |
164+
|---|---|
165+
|--num-executors|Sets the appropriate number of executors.|
166+
|--executor-cores|Sets the number of cores for each executor. Typically you should have middle-sized executors, as other processes consume some of the available memory.|
167+
|--executor-memory|Sets the memory size for each executor, which controls the heap size on YARN. Leave some memory for execution overhead.|
166168

167169
### Select the correct executor size
168170

@@ -177,15 +179,15 @@ When deciding your executor configuration, consider the Java garbage collection
177179
2. Reduce the number of open connections between executors (N2) on larger clusters (>100 executors).
178180
3. Increase heap size to accommodate for memory-intensive tasks.
179181
4. Optional: Reduce per-executor memory overhead.
180-
5. Optional: Increase utilization and concurrency by oversubscribing CPU.
182+
5. Optional: Increase usage and concurrency by oversubscribing CPU.
181183

182-
As a general rule of thumb when selecting the executor size:
184+
As a general rule, when selecting the executor size:
183185

184186
1. Start with 30 GB per executor and distribute available machine cores.
185187
2. Increase the number of executor cores for larger clusters (> 100 executors).
186188
3. Modify size based both on trial runs and on the preceding factors such as GC overhead.
187189

188-
When running concurrent queries, consider the following:
190+
When running concurrent queries, consider:
189191

190192
1. Start with 30 GB per executor and all machine cores.
191193
2. Create multiple parallel Spark applications by oversubscribing CPU (around 30% latency improvement).
@@ -194,9 +196,9 @@ When running concurrent queries, consider the following:
194196

195197
For more information on using Ambari to configure executors, see [Apache Spark settings - Spark executors](apache-spark-settings.md#configuring-spark-executors).
196198

197-
Monitor your query performance for outliers or other performance issues, by looking at the timeline view, SQL graph, job statistics, and so forth. For information on debugging Spark jobs using YARN and the Spark History server, see [Debug Apache Spark jobs running on Azure HDInsight](apache-spark-job-debugging.md). For tips on using YARN Timeline Server, see [Access Apache Hadoop YARN application logs](../hdinsight-hadoop-access-yarn-app-logs-linux.md).
199+
Monitor query performance for outliers or other performance issues, by looking at the timeline view. Also SQL graph, job statistics, and so forth. For information on debugging Spark jobs using YARN and the Spark History server, see [Debug Apache Spark jobs running on Azure HDInsight](apache-spark-job-debugging.md). For tips on using YARN Timeline Server, see [Access Apache Hadoop YARN application logs](../hdinsight-hadoop-access-yarn-app-logs-linux.md).
198200

199-
Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. This frequently happens on larger clusters (> 30 nodes). In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. For example, have at least twice as many tasks as the number of executor cores in the application. You can also enable speculative execution of tasks with `conf: spark.speculation = true`.
201+
Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. This slowness frequently happens on larger clusters (> 30 nodes). In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. For example, have at least twice as many tasks as the number of executor cores in the application. You can also enable speculative execution of tasks with `conf: spark.speculation = true`.
200202

201203
## Optimize job execution
202204

@@ -206,7 +208,7 @@ Sometimes one or a few of the executors are slower than the others, and tasks ta
206208

207209
Monitor your running jobs regularly for performance issues. If you need more insight into certain issues, consider one of the following performance profiling tools:
208210

209-
* [Intel PAL Tool](https://github.com/intel-hadoop/PAT) monitors CPU, storage, and network bandwidth utilization.
211+
* [Intel PAL Tool](https://github.com/intel-hadoop/PAT) monitors CPU, storage, and network bandwidth usage.
210212
* [Oracle Java 8 Mission Control](https://www.oracle.com/technetwork/java/javaseproducts/mission-control/java-mission-control-1998576.html) profiles Spark and executor code.
211213

212214
Key to Spark 2.x query performance is the Tungsten engine, which depends on whole-stage code generation. In some cases, whole-stage code generation may be disabled. For example, if you use a non-mutable type (`string`) in the aggregation expression, `SortAggregate` appears instead of `HashAggregate`. For example, for better performance, try the following and then re-enable code generation:
@@ -219,7 +221,7 @@ MAX(AMOUNT) -> MAX(cast(AMOUNT as DOUBLE))
219221

220222
* [Debug Apache Spark jobs running on Azure HDInsight](apache-spark-job-debugging.md)
221223
* [Manage resources for an Apache Spark cluster on HDInsight](apache-spark-resource-manager.md)
222-
* [Use the Apache Spark REST API to submit remote jobs to an Apache Spark cluster](apache-spark-livy-rest-interface.md)
224+
* [Configure Apache Spark settings](apache-spark-settings.md)
223225
* [Tuning Apache Spark](https://spark.apache.org/docs/latest/tuning.html)
224226
* [How to Actually Tune Your Apache Spark Jobs So They Work](https://www.slideshare.net/ilganeli/how-to-actually-tune-your-spark-jobs-so-they-work)
225-
* [Kryo Serialization](https://github.com/EsotericSoftware/kryo)
227+
* [`Kryo Serialization`](https://github.com/EsotericSoftware/kryo)

0 commit comments

Comments
 (0)