You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/optimize-cluster-configuration.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ ms.topic: conceptual
9
9
ms.custom: hdinsightactive,seomay2020
10
10
ms.date: 05/18/2020
11
11
---
12
-
# Optimize cluster configuration
12
+
# Cluster configuration optimization
13
13
14
14
Depending on your Spark cluster workload, you may determine a non-default Spark configuration would result in more optimized Spark job execution. Do benchmark testing with sample workloads to validate any non-default cluster configurations.
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/optimize-data-storage.md
+11-7Lines changed: 11 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,17 @@ ms.topic: conceptual
9
9
ms.custom: hdinsightactive,seomay2020
10
10
ms.date: 05/18/2020
11
11
---
12
-
# Optimize data storage
12
+
# Data storage optimization
13
+
14
+
This article discusses strategies to optimize data storage for efficient Apache Spark job execution.
15
+
16
+
## Use optimal data format
17
+
18
+
Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information, see [Apache Spark packages](https://spark-packages.org).
19
+
20
+
The best format for performance is parquet with *snappy compression*, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.
21
+
22
+
## Choose data abstraction
13
23
14
24
Earlier Spark versions use RDDs to abstract data, Spark 1.3, and 1.6 introduced DataFrames and DataSets, respectively. Consider the following relative merits:
15
25
@@ -35,12 +45,6 @@ Earlier Spark versions use RDDs to abstract data, Spark 1.3, and 1.6 introduced
35
45
* High GC overhead.
36
46
* Must use Spark 1.x legacy APIs.
37
47
38
-
## Use optimal data format
39
-
40
-
Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information, see [Apache Spark packages](https://spark-packages.org).
41
-
42
-
The best format for performance is parquet with *snappy compression*, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.
43
-
44
48
## Select default storage
45
49
46
50
When you create a new Spark cluster, you can select Azure Blob Storage or Azure Data Lake Storage as your cluster's default storage. Both options give you the benefit of long-term storage for transient clusters. So your data doesn't get automatically deleted when you delete your cluster. You can recreate a transient cluster and still access your data.
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/optimize-memory-usage.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ ms.topic: conceptual
9
9
ms.custom: hdinsightactive,seomay2020
10
10
ms.date: 05/18/2020
11
11
---
12
-
# Optimize memory usage
12
+
# Memory usage optimization
13
13
14
14
Spark operates by placing data in memory. So managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster's memory efficiently.
15
15
@@ -20,7 +20,7 @@ Spark operates by placing data in memory. So managing memory resources is a key
20
20
21
21
For your reference, the Spark memory structure and some key executor memory parameters are shown in the next image.
22
22
23
-
###Spark memory considerations
23
+
## Spark memory considerations
24
24
25
25
If you're using Apache Hadoop YARN, then YARN controls the memory used by all containers on each Spark node. The following diagram shows the key objects and their relationships.
0 commit comments