Skip to content

Commit 1239746

Browse files
committed
adding toc entries
1 parent 6af8f08 commit 1239746

File tree

4 files changed

+26
-12
lines changed

4 files changed

+26
-12
lines changed

articles/hdinsight/TOC.yml

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -328,8 +328,18 @@
328328
href: ./spark/spark-best-practices.md
329329
- name: Configure Apache Spark settings
330330
href: ./spark/apache-spark-settings.md
331-
- name: Optimize Apache Spark jobs
332-
href: ./spark/apache-spark-perf.md
331+
- name: Optimization
332+
items:
333+
- name: Optimize Apache Spark jobs
334+
href: ./spark/apache-spark-perf.md
335+
- name: Optimize data processing
336+
href: ./spark/optimize-data-processing.md
337+
- name: Optimize data storage
338+
href: ./spark/optimize-data-storage.md
339+
- name: Optimize memory usage
340+
href: ./spark/optimize-memory-usage.md
341+
- name: Optimize cluster configuration
342+
href: ./spark/optimize-cluster-configuration.md
333343
- name: How to
334344
items:
335345
- name: Use tools

articles/hdinsight/spark/optimize-cluster-configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
1010
ms.date: 05/18/2020
1111
---
12-
# Optimize cluster configuration
12+
# Cluster configuration optimization
1313

1414
Depending on your Spark cluster workload, you may determine a non-default Spark configuration would result in more optimized Spark job execution. Do benchmark testing with sample workloads to validate any non-default cluster configurations.
1515

articles/hdinsight/spark/optimize-data-storage.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,17 @@ ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
1010
ms.date: 05/18/2020
1111
---
12-
# Optimize data storage
12+
# Data storage optimization
13+
14+
This article discusses strategies to optimize data storage for efficient Apache Spark job execution.
15+
16+
## Use optimal data format
17+
18+
Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information, see [Apache Spark packages](https://spark-packages.org).
19+
20+
The best format for performance is parquet with *snappy compression*, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.
21+
22+
## Choose data abstraction
1323

1424
Earlier Spark versions use RDDs to abstract data, Spark 1.3, and 1.6 introduced DataFrames and DataSets, respectively. Consider the following relative merits:
1525

@@ -35,12 +45,6 @@ Earlier Spark versions use RDDs to abstract data, Spark 1.3, and 1.6 introduced
3545
* High GC overhead.
3646
* Must use Spark 1.x legacy APIs.
3747

38-
## Use optimal data format
39-
40-
Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information, see [Apache Spark packages](https://spark-packages.org).
41-
42-
The best format for performance is parquet with *snappy compression*, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.
43-
4448
## Select default storage
4549

4650
When you create a new Spark cluster, you can select Azure Blob Storage or Azure Data Lake Storage as your cluster's default storage. Both options give you the benefit of long-term storage for transient clusters. So your data doesn't get automatically deleted when you delete your cluster. You can recreate a transient cluster and still access your data.

articles/hdinsight/spark/optimize-memory-usage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
1010
ms.date: 05/18/2020
1111
---
12-
# Optimize memory usage
12+
# Memory usage optimization
1313

1414
Spark operates by placing data in memory. So managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster's memory efficiently.
1515

@@ -20,7 +20,7 @@ Spark operates by placing data in memory. So managing memory resources is a key
2020

2121
For your reference, the Spark memory structure and some key executor memory parameters are shown in the next image.
2222

23-
### Spark memory considerations
23+
## Spark memory considerations
2424

2525
If you're using Apache Hadoop YARN, then YARN controls the memory used by all containers on each Spark node. The following diagram shows the key objects and their relationships.
2626

0 commit comments

Comments
 (0)