Skip to content

Commit bd0fbc0

Browse files
committed
freshness58
1 parent 25b928d commit bd0fbc0

File tree

1 file changed

+33
-28
lines changed

1 file changed

+33
-28
lines changed

articles/hdinsight/spark/apache-spark-settings.md

Lines changed: 33 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -5,28 +5,27 @@ author: hrasheed-msft
55
ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdinsightactive
98
ms.topic: conceptual
10-
ms.date: 06/17/2019
9+
ms.custom: hdinsightactive
10+
ms.date: 04/15/2020
1111
---
1212

1313
# Configure Apache Spark settings
1414

15-
An HDInsight Spark cluster includes an installation of the [Apache Spark](https://spark.apache.org/) library. Each HDInsight cluster includes default configuration parameters for all its installed services, including Spark. A key aspect of managing an HDInsight Apache Hadoop cluster is monitoring workload, including Spark Jobs, to make sure the jobs are running in a predictable manner. To best run Spark jobs, consider the physical cluster configuration when determining how to optimize the cluster's logical configuration.
15+
An HDInsight Spark cluster includes an installation of the [Apache Spark](https://spark.apache.org/) library. Each HDInsight cluster includes default configuration parameters for all its installed services, including Spark. A key aspect of managing an HDInsight Apache Hadoop cluster is monitoring workload, including Spark Jobs. To best run Spark jobs, consider the physical cluster configuration when determining the cluster's logical configuration.
1616

1717
The default HDInsight Apache Spark cluster includes the following nodes: three [Apache ZooKeeper](https://zookeeper.apache.org/) nodes, two head nodes, and one or more worker nodes:
1818

1919
![Spark HDInsight Architecture](./media/apache-spark-settings/spark-hdinsight-arch.png)
2020

21-
The number of VMs, and the VM sizes, for the nodes in your HDInsight cluster can also affect your Spark configuration. Non-default HDInsight configuration values often require non-default Spark configuration values. When you create an HDInsight Spark cluster, you are shown suggested VM sizes for each of the components. Currently the [Memory-optimized Linux VM sizes](../../virtual-machines/linux/sizes-memory.md) for Azure are D12 v2 or greater.
21+
The number of VMs, and VM sizes, for the nodes in your HDInsight cluster can affect your Spark configuration. Non-default HDInsight configuration values often require non-default Spark configuration values. When you create an HDInsight Spark cluster, you're shown suggested VM sizes for each of the components. Currently the [Memory-optimized Linux VM sizes](../../virtual-machines/linux/sizes-memory.md) for Azure are D12 v2 or greater.
2222

2323
## Apache Spark versions
2424

2525
Use the best Spark version for your cluster. The HDInsight service includes several versions of both Spark and HDInsight itself. Each version of Spark includes a set of default cluster settings.
2626

2727
When you create a new cluster, there are multiple Spark versions to choose from. To see the full list, [HDInsight Components and Versions](https://docs.microsoft.com/azure/hdinsight/hdinsight-component-versioning)
2828

29-
3029
> [!NOTE]
3130
> The default version of Apache Spark in the HDInsight service may change without notice. If you have a version dependency, Microsoft recommends that you specify that particular version when you create clusters using .NET SDK, Azure PowerShell, and Azure Classic CLI.
3231
@@ -46,29 +45,29 @@ spark.sql.files.maxPartitionBytes 1099511627776
4645
spark.sql.files.openCostInBytes 1099511627776
4746
```
4847

49-
The example shown above overrides several default values for five Spark configuration parameters. These are the compression codec, Apache Hadoop MapReduce split minimum size and parquet block sizes, and also the Spar SQL partition and open file sizes default values. These configuration changes are chosen because the associated data and jobs (in this example, genomic data) have particular characteristics, which will perform better using these custom configuration settings.
48+
The example shown above overrides several default values for five Spark configuration parameters. These values are the compression codec, Apache Hadoop MapReduce split minimum size and parquet block sizes. Also, the Spar SQL partition and open file sizes default values. These configuration changes are chosen because the associated data and jobs (in this example, genomic data) have particular characteristics. These characteristics will do better using these custom configuration settings.
5049

5150
---
5251

5352
## View cluster configuration settings
5453

55-
Verify the current HDInsight cluster configuration settings before you perform performance optimization on the cluster. Launch the HDInsight Dashboard from the Azure portal by clicking the **Dashboard** link on the Spark cluster pane. Sign in with the cluster administrator's username and password.
54+
Verify the current HDInsight cluster configuration settings before you do performance optimization on the cluster. Launch the HDInsight Dashboard from the Azure portal by clicking the **Dashboard** link on the Spark cluster pane. Sign in with the cluster administrator's username and password.
5655

57-
The Apache Ambari Web UI appears, with a dashboard view of key cluster resource utilization metrics. The Ambari Dashboard shows you the Apache Spark configuration, and other services that you have installed. The Dashboard includes a **Config History** tab, where you can view configuration information for all installed services, including Spark.
56+
The Apache Ambari Web UI appears, with a dashboard of key cluster resource usage metrics. The Ambari Dashboard shows you the Apache Spark configuration, and other installed services. The Dashboard includes a **Config History** tab, where you view information for installed services, including Spark.
5857

5958
To see configuration values for Apache Spark, select **Config History**, then select **Spark2**. Select the **Configs** tab, then select the `Spark` (or `Spark2`, depending on your version) link in the service list. You see a list of configuration values for your cluster:
6059

6160
![Spark Configurations](./media/apache-spark-settings/spark-configurations.png)
6261

63-
To see and change individual Spark configuration values, select any link with the word "spark" in the link title. Configurations for Spark include both custom and advanced configuration values in these categories:
62+
To see and change individual Spark configuration values, select any link with "spark" in the title. Configurations for Spark include both custom and advanced configuration values in these categories:
6463

6564
* Custom Spark2-defaults
6665
* Custom Spark2-metrics-properties
6766
* Advanced Spark2-defaults
6867
* Advanced Spark2-env
6968
* Advanced spark2-hive-site-override
7069

71-
If you create a non-default set of configuration values, then you can also see the history of your configuration updates. This configuration history can be helpful to see which non-default configuration has optimal performance.
70+
If you create a non-default set of configuration values, your update history is visible. This configuration history can be helpful to see which non-default configuration has optimal performance.
7271

7372
> [!NOTE]
7473
> To see, but not change, common Spark cluster configuration settings, select the **Environment** tab on the top-level **Spark Job UI** interface.
@@ -81,33 +80,37 @@ The following diagram shows key Spark objects: the driver program and its associ
8180

8281
Spark jobs use worker resources, particularly memory, so it's common to adjust Spark configuration values for worker node Executors.
8382

84-
Three key parameters that are often adjusted to tune Spark configurations to improve application requirements are `spark.executor.instances`, `spark.executor.cores`, and `spark.executor.memory`. An Executor is a process launched for a Spark application. An Executor runs on the worker node and is responsible for the tasks for the application. For each cluster, the default number of executors, and the executor sizes, is calculated based on the number of worker nodes and the worker node size. These are stored in `spark-defaults.conf` on the cluster head nodes. You can edit these values in a running cluster by selecting the **Custom spark-defaults** link in the Ambari web UI. After you make changes, you're prompted by the UI to **Restart** all the affected services.
83+
Three key parameters that are often adjusted to tune Spark configurations to improve application requirements are `spark.executor.instances`, `spark.executor.cores`, and `spark.executor.memory`. An Executor is a process launched for a Spark application. An Executor runs on the worker node and is responsible for the tasks for the application. The number of worker nodes and worker node size determines the number of executors, and executor sizes. These values are stored in `spark-defaults.conf` on the cluster head nodes. You can edit these values in a running cluster by selecting **Custom spark-defaults** in the Ambari web UI. After you make changes, you're prompted by the UI to **Restart** all the affected services.
8584

8685
> [!NOTE]
8786
> These three configuration parameters can be configured at the cluster level (for all applications that run on the cluster) and also specified for each individual application.
8887
89-
Another source of information about the resources being used by the Spark Executors is the Spark Application UI. In the Spark UI, select the **Executors** tab to display Summary and Detail views of the configuration and resources consumed by the executors. These views can help you determine whether to change default values for Spark executors for the entire cluster, or a particular set of job executions.
88+
Another source of information about resources used by Spark Executors is the Spark Application UI. In the UI, **Executors** displays Summary and Detail views of the configuration and consumed resources. Determine whether to change executors values for entire cluster, or particular set of job executions.
9089

9190
![Spark Executors](./media/apache-spark-settings/apache-spark-executors.png)
9291

93-
Alternatively, you can use the Ambari REST API to programmatically verify HDInsight and Spark cluster configuration settings. More information is available at the [Apache Ambari API reference on GitHub](https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md).
92+
Or you can use the Ambari REST API to programmatically verify HDInsight and Spark cluster configuration settings. More information is available at the [Apache Ambari API reference on GitHub](https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md).
9493

95-
Depending on your Spark workload, you may determine that a non-default Spark configuration provides more optimized Spark job executions. You should perform benchmark testing with sample workloads to validate any non-default cluster configurations. Some of the common parameters that you may consider adjusting are:
94+
Depending on your Spark workload, you may determine that a non-default Spark configuration provides more optimized Spark job executions. Do benchmark testing with sample workloads to validate any non-default cluster configurations. Some of the common parameters that you may consider adjusting are:
9695

97-
* `--num-executors` sets the number of executors.
98-
* `--executor-cores` sets the number of cores for each executor. We recommend using middle-sized executors, as other processes also consume some portion of the available memory.
99-
* `--executor-memory` controls the memory size (heap size) of each executor on [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), and you'll need to leave some memory for execution overhead.
96+
|Parameter |Description|
97+
|---|---|
98+
|--num-executors|Sets the number of executors.|
99+
|--executor-cores|Sets the number of cores for each executor. We recommend using middle-sized executors, as other processes also consume some portion of the available memory.|
100+
|--executor-memory|Controls the memory size (heap size) of each executor on [Apache Hadoop YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html), and you'll need to leave some memory for execution overhead.|
100101

101102
Here is an example of two worker nodes with different configuration values:
102103

103104
![Two node configurations](./media/apache-spark-settings/executor-configuration.png)
104105

105106
The following list shows key Spark executor memory parameters.
106107

107-
* `spark.executor.memory` defines the total amount of memory available for an executor.
108-
* `spark.storage.memoryFraction` (default ~60%) defines the amount of memory available for storing persisted RDDs.
109-
* `spark.shuffle.memoryFraction` (default ~20%) defines the amount of memory reserved for shuffle.
110-
* `spark.storage.unrollFraction` and `spark.storage.safetyFraction` (totaling ~30% of total memory) - these values are used internally by Spark and shouldn't be changed.
108+
|Parameter |Description|
109+
|---|---|
110+
|spark.executor.memory|Defines the total amount of memory available for an executor.|
111+
|spark.storage.memoryFraction|(default ~60%) defines the amount of memory available for storing persisted RDDs.|
112+
|spark.shuffle.memoryFraction|(default ~20%) defines the amount of memory reserved for shuffle.|
113+
|spark.storage.unrollFraction and spark.storage.safetyFraction|(totaling ~30% of total memory) - these values are used internally by Spark and shouldn't be changed.|
111114

112115
YARN controls the maximum sum of memory used by the containers on each Spark node. The following diagram shows the per-node relationships between YARN configuration objects and Spark objects.
113116

@@ -117,13 +120,15 @@ YARN controls the maximum sum of memory used by the containers on each Spark nod
117120

118121
Spark clusters in HDInsight include a number of components by default. Each of these components includes default configuration values, which can be overridden as needed.
119122

120-
* Spark Core - Spark Core, Spark SQL, Spark streaming APIs, GraphX, and Apache Spark MLlib.
121-
* Anaconda - a python package manager.
122-
* [Apache Livy](https://livy.incubator.apache.org/) - the Apache Spark REST API, used to submit remote jobs to an HDInsight Spark cluster.
123-
* [Jupyter](https://jupyter.org/) and [Apache Zeppelin](https://zeppelin.apache.org/) notebooks - interactive browser-based UI for interacting with your Spark cluster.
124-
* ODBC driver - connects Spark clusters in HDInsight to business intelligence (BI) tools such as Microsoft Power BI and Tableau.
123+
|Component |Description|
124+
|---|---|
125+
|Spark Core|Spark Core, Spark SQL, Spark streaming APIs, GraphX, and Apache Spark MLlib.|
126+
|Anaconda|A python package manager.|
127+
|[Apache Livy](https://livy.incubator.apache.org/)|The Apache Spark REST API, used to submit remote jobs to an HDInsight Spark cluster.|
128+
|[Jupyter](https://jupyter.org/) and [Apache Zeppelin](https://zeppelin.apache.org/) notebooks|Interactive browser-based UI for interacting with your Spark cluster.|
129+
|ODBC driver|Connects Spark clusters in HDInsight to business intelligence (BI) tools such as Microsoft Power BI and Tableau.|
125130

126-
For applications running in the Jupyter notebook, use the `%%configure` command to make configuration changes from within the notebook itself. These configuration changes will be applied to the Spark jobs run from your notebook instance. You should make such changes at the beginning of the application, before you run your first code cell. The changed configuration is applied to the Livy session when it gets created.
131+
For applications running in the Jupyter notebook, use the `%%configure` command to make configuration changes from within the notebook itself. These configuration changes will be applied to the Spark jobs run from your notebook instance. Make such changes at the beginning of the application, before you run your first code cell. The changed configuration is applied to the Livy session when it gets created.
127132

128133
> [!NOTE]
129134
> To change the configuration at a later stage in the application, use the `-f` (force) parameter. However, all progress in the application will be lost.
@@ -137,7 +142,7 @@ The code below shows how to change the configuration for an application running
137142

138143
## Conclusion
139144

140-
There are a number of core configuration settings that you need to monitor and adjust to ensure your Spark jobs run in a predictable and performant way. These settings help determine the best Spark cluster configuration for your particular workloads. You'll also need to monitor the execution of long-running and/or resource-consuming Spark job executions. The most common challenges center around memory pressure due to improper configurations (particularly incorrectly-sized executors), long-running operations, and tasks, which result in Cartesian operations.
145+
Monitor core configuration settings to ensure your Spark jobs run in a predictable and performant way. These settings help determine the best Spark cluster configuration for your particular workloads. You'll also need to monitor the execution of long-running and, or resource-consuming Spark job executions. The most common challenges center around memory pressure from improper configurations, such as incorrectly sized executors. Also, long-running operations, and tasks, which result in Cartesian operations.
141146

142147
## Next steps
143148

0 commit comments

Comments
 (0)