Skip to content

Commit ee80364

Browse files
authored
Merge pull request #88505 from dagiro/cats92
cats92
2 parents fe9dde3 + 5060325 commit ee80364

File tree

6 files changed

+13
-12
lines changed

6 files changed

+13
-12
lines changed

articles/hdinsight/spark/apache-spark-settings.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ ms.custom: hdinsightactive
99
ms.topic: conceptual
1010
ms.date: 06/17/2019
1111
---
12+
1213
# Configure Apache Spark settings
1314

1415
An HDInsight Spark cluster includes an installation of the [Apache Spark](https://spark.apache.org/) library. Each HDInsight cluster includes default configuration parameters for all its installed services, including Spark. A key aspect of managing an HDInsight Apache Hadoop cluster is monitoring workload, including Spark Jobs, to make sure the jobs are running in a predictable manner. To best run Spark jobs, consider the physical cluster configuration when determining how to optimize the cluster's logical configuration.
@@ -38,11 +39,11 @@ Apache Spark has three system configuration locations:
3839
When you select a particular version of Spark, your cluster includes the default configuration settings. You can change the default Spark configuration values by using a custom Spark configuration file. An example is shown below.
3940

4041
```
41-
spark.hadoop.io.compression.codecs org.apache.hadoop.io.compress.GzipCodec
42-
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 1099511627776
43-
spark.hadoop.parquet.block.size 1099511627776
44-
spark.sql.files.maxPartitionBytes 1099511627776
45-
spark.sql.files.openCostInBytes 1099511627776
42+
spark.hadoop.io.compression.codecs org.apache.hadoop.io.compress.GzipCodec
43+
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 1099511627776
44+
spark.hadoop.parquet.block.size 1099511627776
45+
spark.sql.files.maxPartitionBytes 1099511627776
46+
spark.sql.files.openCostInBytes 1099511627776
4647
```
4748

4849
The example shown above overrides several default values for five Spark configuration parameters. These are the compression codec, Apache Hadoop MapReduce split minimum size and parquet block sizes, and also the Spar SQL partition and open file sizes default values. These configuration changes are chosen because the associated data and jobs (in this example, genomic data) have particular characteristics, which will perform better using these custom configuration settings.
@@ -57,7 +58,7 @@ The Apache Ambari Web UI appears, with a dashboard view of key cluster resource
5758

5859
To see configuration values for Apache Spark, select **Config History**, then select **Spark2**. Select the **Configs** tab, then select the `Spark` (or `Spark2`, depending on your version) link in the service list. You see a list of configuration values for your cluster:
5960

60-
![Spark Configurations](./media/apache-spark-settings/spark-config.png)
61+
![Spark Configurations](./media/apache-spark-settings/spark-configurations.png)
6162

6263
To see and change individual Spark configuration values, select any link with the word "spark" in the link title. Configurations for Spark include both custom and advanced configuration values in these categories:
6364

@@ -76,7 +77,7 @@ If you create a non-default set of configuration values, then you can also see t
7677

7778
The following diagram shows key Spark objects: the driver program and its associated Spark Context, and the cluster manager and its *n* worker nodes. Each worker node includes an Executor, a cache, and *n* task instances.
7879

79-
![Cluster objects](./media/apache-spark-settings/spark-arch.png)
80+
![Cluster objects](./media/apache-spark-settings/hdi-spark-architecture.png)
8081

8182
Spark jobs use worker resources, particularly memory, so it's common to adjust Spark configuration values for worker node Executors.
8283

@@ -87,7 +88,7 @@ Three key parameters that are often adjusted to tune Spark configurations to imp
8788
8889
Another source of information about the resources being used by the Spark Executors is the Spark Application UI. In the Spark UI, select the **Executors** tab to display Summary and Detail views of the configuration and resources consumed by the executors. These views can help you determine whether to change default values for Spark executors for the entire cluster, or a particular set of job executions.
8990

90-
![Spark Executors](./media/apache-spark-settings/spark-executors.png)
91+
![Spark Executors](./media/apache-spark-settings/apache-spark-executors.png)
9192

9293
Alternatively, you can use the Ambari REST API to programmatically verify HDInsight and Spark cluster configuration settings. More information is available at the [Apache Ambari API reference on GitHub](https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md).
9394

@@ -99,7 +100,7 @@ Depending on your Spark workload, you may determine that a non-default Spark con
99100

100101
Here is an example of two worker nodes with different configuration values:
101102

102-
![Two node configurations](./media/apache-spark-settings/executor-config.png)
103+
![Two node configurations](./media/apache-spark-settings/executor-configuration.png)
103104

104105
The following list shows key Spark executor memory parameters.
105106

@@ -110,7 +111,7 @@ The following list shows key Spark executor memory parameters.
110111

111112
YARN controls the maximum sum of memory used by the containers on each Spark node. The following diagram shows the per-node relationships between YARN configuration objects and Spark objects.
112113

113-
![YARN Spark Memory Management](./media/apache-spark-settings/yarn-spark-memory.png)
114+
![YARN Spark Memory Management](./media/apache-spark-settings/hdi-yarn-spark-memory.png)
114115

115116
## Change parameters for an application running in Jupyter notebook
116117

@@ -130,8 +131,8 @@ For applications running in the Jupyter notebook, use the `%%configure` command
130131
The code below shows how to change the configuration for an application running in a Jupyter notebook.
131132

132133
```
133-
%%configure
134-
{"executorMemory": "3072M", "executorCores": 4, "numExecutors":10}
134+
%%configure
135+
{"executorMemory": "3072M", "executorCores": 4, "numExecutors":10}
135136
```
136137

137138
## Conclusion

0 commit comments

Comments
 (0)