Skip to content

Commit 36ac838

Browse files
committed
updates
1 parent 1239746 commit 36ac838

File tree

4 files changed

+19
-7
lines changed

4 files changed

+19
-7
lines changed

articles/hdinsight/spark/optimize-cluster-configuration.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,15 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
10-
ms.date: 05/18/2020
10+
ms.date: 05/20/2020
1111
---
1212
# Cluster configuration optimization
1313

14-
Depending on your Spark cluster workload, you may determine a non-default Spark configuration would result in more optimized Spark job execution. Do benchmark testing with sample workloads to validate any non-default cluster configurations.
14+
This article discusses how to optimize the configuration of your Apache Spark cluster for best performance on Azure HDInsight.
15+
16+
## Overview
17+
18+
Depending on your Spark cluster workload, you may determine that a non-default Spark configuration would result in more optimized Spark job execution. Do benchmark testing with sample workloads to validate any non-default cluster configurations.
1519

1620
Here are some common parameters you can adjust:
1721

articles/hdinsight/spark/optimize-data-processing.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
10-
ms.date: 05/18/2020
10+
ms.date: 05/20/2020
1111
---
1212
# Data processing optimization
1313

14+
This article discusses how to optimize the configuration of your Apache Spark cluster for best performance on Azure HDInsight.
15+
16+
## Overview
17+
1418
If you have slow jobs on a Join or Shuffle, the cause is probably *data skew*. Data skew is asymmetry in your job data. For example, a map job may take 20 seconds. But running a job where the data is joined or shuffled takes hours. To fix data skew, you should salt the entire key, or use an *isolated salt* for only some subset of keys. If you're using an isolated salt, you should further filter to isolate your subset of salted keys in map joins. Another option is to introduce a bucket column and pre-aggregate in buckets first.
1519

1620
Another factor causing slow joins could be the join type. By default, Spark uses the `SortMerge` join type. This type of join is best suited for large data sets. But is otherwise computationally expensive because it must first sort the left and right sides of data before merging them.

articles/hdinsight/spark/optimize-data-storage.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
10-
ms.date: 05/18/2020
10+
ms.date: 05/20/2020
1111
---
1212
# Data storage optimization
1313

14-
This article discusses strategies to optimize data storage for efficient Apache Spark job execution.
14+
This article discusses strategies to optimize data storage for efficient Apache Spark job execution on Azure HDInsight.
1515

16-
## Use optimal data format
16+
## Overview
1717

1818
Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information, see [Apache Spark packages](https://spark-packages.org).
1919

articles/hdinsight/spark/optimize-memory-usage.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
99
ms.custom: hdinsightactive,seomay2020
10-
ms.date: 05/18/2020
10+
ms.date: 05/20/2020
1111
---
1212
# Memory usage optimization
1313

14+
This article discusses how to optimize memory management of your Apache Spark cluster for best performance on Azure HDInsight.
15+
16+
## Overview
17+
1418
Spark operates by placing data in memory. So managing memory resources is a key aspect of optimizing the execution of Spark jobs. There are several techniques you can apply to use your cluster's memory efficiently.
1519

1620
* Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.

0 commit comments

Comments
 (0)