Skip to content

Commit 551b7a5

Browse files
authored
Added updated image.
Added updated image after removing Strom.
1 parent cb087a5 commit 551b7a5

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

articles/hdinsight/hdinsight-hadoop-optimize-hive-query.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Choose the appropriate cluster type to help optimize performance for your worklo
1919

2020
* Choose **Interactive Query** cluster type to optimize for `ad hoc`, interactive queries.
2121
* Choose Apache **Hadoop** cluster type to optimize for Hive queries used as a batch process.
22-
* **Spark** and **HBase** cluster types can also run Hive queries, and might be appropriate if you are running those workloads.
22+
* **Spark** and **HBase** cluster types can also run Hive queries, and might be appropriate if you're running those workloads.
2323

2424
For more information on running Hive queries on various HDInsight cluster types, see [What is Apache Hive and HiveQL on Azure HDInsight?](hadoop/hdinsight-use-hive.md).
2525

@@ -41,6 +41,7 @@ For more information about scaling HDInsight, see [Scale HDInsight clusters](hdi
4141

4242
[Apache Tez](https://tez.apache.org/) is an alternative execution engine to the MapReduce engine. Linux-based HDInsight clusters have Tez enabled by default.
4343

44+
:::image type="content" source="./media/hdinsight-hadoop-optimize-hive-query/hdinsight-tez-engine-new.png" alt-text="HDInsight Apache Tez overview diagram":::
4445

4546
Tez is faster because:
4647

@@ -70,7 +71,7 @@ Some partitioning considerations:
7071

7172
* **Don't under partition** - Partitioning on columns with only a few values can cause few partitions. For example, partitioning on gender only creates two partitions to be created (male and female), so reduce the latency by a maximum of half.
7273
* **Don't over partition** - On the other extreme, creating a partition on a column with a unique value (for example, userid) causes multiple partitions. Over partition causes much stress on the cluster namenode as it has to handle the large number of directories.
73-
* **Avoid data skew** - Choose your partitioning key wisely so that all partitions are even size. For example, partitioning on *State* column may skew the distribution of data. Since the state of California has a population almost 30x that of Vermont, the partition size is potentially skewed and performance may vary tremendously.
74+
* **Avoid data skew** - Choose your partitioning key wisely so that all partitions are even size. For example, partitioning on *State* column may skew the distribution of data. Since the state of California has a population almost 30x that of Vermont, the partition size is potentially skewed, and performance may vary tremendously.
7475

7576
To create a partition table, use the *Partitioned By* clause:
7677

0 commit comments

Comments
 (0)