updates

hsrasheed · hsrasheed · commit 36ac8389bef2 · 2020-05-20T15:45:43.000-05:00
diff --git a/articles/hdinsight/spark/optimize-cluster-configuration.md b/articles/hdinsight/spark/optimize-cluster-configuration.md
@@ -7,11 +7,15 @@ ms.reviewer: jasonh
 ms.service: hdinsight
 ms.topic: conceptual
 ms.custom: hdinsightactive,seomay2020
-ms.date: 05/18/2020
+ms.date: 05/20/2020
 ---
 # Cluster configuration optimization
 
-Depending on your Spark cluster workload, you may determine  a non-default Spark configuration would result in more optimized Spark job execution.  Do benchmark testing with sample workloads to validate any non-default cluster configurations.
+This article discusses how to optimize the configuration of your Apache Spark cluster for best performance on Azure HDInsight.
+
+## Overview
+
+Depending on your Spark cluster workload, you may determine that a non-default Spark configuration would result in more optimized Spark job execution.  Do benchmark testing with sample workloads to validate any non-default cluster configurations.
 
 Here are some common parameters you can adjust:
 
diff --git a/articles/hdinsight/spark/optimize-data-processing.md b/articles/hdinsight/spark/optimize-data-processing.md
@@ -7,10 +7,14 @@ ms.reviewer: jasonh
 ms.service: hdinsight
 ms.topic: conceptual
 ms.custom: hdinsightactive,seomay2020
-ms.date: 05/18/2020
+ms.date: 05/20/2020
 ---
 # Data processing optimization
 
+This article discusses how to optimize the configuration of your Apache Spark cluster for best performance on Azure HDInsight.
+
+## Overview
+
 If you have slow jobs on a Join or Shuffle, the cause is probably *data skew*. Data skew  is asymmetry in your job data. For example, a map job may take 20 seconds. But running a job where the data is joined or shuffled takes hours. To fix data skew, you should salt the entire key, or use an *isolated salt* for  only some subset of keys. If you're using an isolated salt, you should further filter to isolate your subset of salted keys in map joins. Another option is to introduce a bucket column and pre-aggregate in buckets first.
 
 Another factor causing slow joins could be the join type. By default, Spark uses the `SortMerge` join type. This type of join is best suited for large data sets. But is otherwise computationally expensive because it must first sort the left and right sides of data before merging them.
diff --git a/articles/hdinsight/spark/optimize-data-storage.md b/articles/hdinsight/spark/optimize-data-storage.md
@@ -7,13 +7,13 @@ ms.reviewer: jasonh
 ms.service: hdinsight
 ms.topic: conceptual
 ms.custom: hdinsightactive,seomay2020
-ms.date: 05/18/2020
+ms.date: 05/20/2020
 ---
 # Data storage optimization
 
-This article discusses strategies to optimize data storage for efficient Apache Spark job execution.
+This article discusses strategies to optimize data storage for efficient Apache Spark job execution on Azure HDInsight.
 
-## Use optimal data format
+## Overview
 
 Spark supports many formats, such as csv, json, xml, parquet, orc, and avro. Spark can be extended to support many more formats with external data sources - for more information, see [Apache Spark packages](https://spark-packages.org).
 
diff --git a/articles/hdinsight/spark/optimize-memory-usage.md b/articles/hdinsight/spark/optimize-memory-usage.md
@@ -7,10 +7,14 @@ ms.reviewer: jasonh
 ms.service: hdinsight
 ms.topic: conceptual
 ms.custom: hdinsightactive,seomay2020
-ms.date: 05/18/2020
+ms.date: 05/20/2020
 ---
 # Memory usage optimization
 
+This article discusses how to optimize memory management of your Apache Spark cluster for best performance on Azure HDInsight.
+
+## Overview
+
 Spark operates by placing data in memory. So managing memory resources is a key aspect of optimizing the execution of Spark jobs.  There are several techniques you can apply to use your cluster's memory efficiently.
 
 * Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.