Skip to content

Commit 847dd8e

Browse files
committed
hive21
1 parent 6dfc04c commit 847dd8e

File tree

1 file changed

+11
-12
lines changed

1 file changed

+11
-12
lines changed

articles/hdinsight/interactive-query/gateway-best-practices.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,13 @@ ms.topic: conceptual
99
ms.date: 04/01/2020
1010
---
1111

12-
1312
# Gateway deep dive and best practices for Apache Hive in Azure HDInsight
1413

15-
The Azure HDInsight gateway (Gateway) is the HTTPS frontend for HDInsight clusters. The Gateway is responsible for authentication, host resolution, service discovery, and other helpful features necessary for a modern distributed system. The features provided by the Gateway result in some overhead for which this document will describe the best practices to navigate. Gateway troubleshooting techniques are also discussed.
14+
The Azure HDInsight gateway (Gateway) is the HTTPS frontend for HDInsight clusters. The Gateway is responsible for: authentication, host resolution, service discovery, and other helpful features necessary for a modern distributed system. The features provided by the Gateway result in some overhead for which this document will describe the best practices to navigate. Gateway troubleshooting techniques are also discussed.
1615

1716
## The HDInsight gateway
1817

19-
The HDInsight gateway is the only part of an HDInsight cluster that is publicly accessible over the internet. The Gateway service can be accessed without going over the public internet by using the `clustername-int.azurehdinsight.net` internal gateway endpoint. The internal gateway endpoint allows connections to be established to the gateway service without exiting the cluster's virtual network. The Gateway is responsible for handling all requests that are sent to the cluster, and for forwarding such requests to the correct components and cluster hosts.
18+
The HDInsight gateway is the only part of an HDInsight cluster that is publicly accessible over the internet. The Gateway service can be accessed without going over the public internet by using the `clustername-int.azurehdinsight.net` internal gateway endpoint. The internal gateway endpoint allows connections to be established to the gateway service without exiting the cluster's virtual network. The Gateway handles all requests that are sent to the cluster, and forwards such requests to the correct components and cluster hosts.
2019

2120
The following diagram provides a rough illustration of how the Gateway provides an abstraction in front of all the different host resolution possibilities within HDInsight.
2221

@@ -30,29 +29,29 @@ For service discovery, the advantage of the gateway is that each component withi
3029

3130
For authentication, the Gateway allows users to authenticate using a `username:password` credential pair. For ESP-enabled clusters, this credential would be the user's domain username and password. Authentication to HDInsight clusters via the Gateway doesn't require the client to acquire a kerberos ticket. Since the Gateway accepts `username:password` credentials and acquires the user's Kerberos ticket on the user's behalf, secure connections can be made to the Gateway from any client host, including clients joined to different AA-DDS domains than the (ESP) cluster.
3231

33-
## Best Practices
32+
## Best practices
3433

35-
Since the Gateway is a single service (load balanced across two hosts) responsible for request forwarding and authentication, the Gateway may become a throughput bottleneck for Hive queries exceeding a certain size. In particular, query performance degradation may be observed when very large **SELECT** queries are executed on the Gateway via ODBC or JDBC. "Very large" means queries that make up more than 5 GB of data across rows or columns. This could be a query with a long list of rows and, or a wide column count.
34+
The Gateway is a single service (load balanced across two hosts) responsible for request forwarding and authentication. The Gateway may become a throughput bottleneck for Hive queries exceeding a certain size. Query performance degradation may be observed when very large **SELECT** queries are executed on the Gateway via ODBC or JDBC. "Very large" means queries that make up more than 5 GB of data across rows or columns. This query could include a long list of rows and, or a wide column count.
3635

37-
The reason for the Gateway's performance degradation around queries of a large magnitude is that this data must ultimately be transferred from the underlying data store (ADLS Gen2), to the HDInsight Hive Server, to the Gateway, and finally via the JDBC or ODBC drivers to the client host.
36+
The Gateway's performance degradation around queries of a large size is because the data must be transferred from the underlying data store (ADLS Gen2) to: the HDInsight Hive Server, the Gateway, and finally via the JDBC or ODBC drivers to the client host.
3837

3938
The following diagram illustrates the steps involved in a SELECT query.
4039

4140
![Result Diagram](./media/gateway-best-practices/result-retrieval-diagram.png "Result Diagram")
4241

43-
Apache Hive is a relational abstraction on top of an HDFS-compatible filesystem. This means that **SELECT** statements in Hive correspond to READ operations on the filesystem, which are translated into the appropriate schema before being reported to the user. The latency of this process increases with data size and total hops required to reach the end user.
42+
Apache Hive is a relational abstraction on top of an HDFS-compatible filesystem. This abstraction means **SELECT** statements in Hive correspond to **READ** operations on the filesystem. The **READ** operations are translated into the appropriate schema before reported to the user. The latency of this process increases with data size and total hops required to reach the end user.
4443

45-
Similar behavior may be noticed when executing **CREATE** or **INSERT** statements of large data size, as these commands will correspond to **WRITE** operations in the underlying filesystem. In some cases, it may be prudent to write data such as raw ORC data to the filesystem/datalake instead of loading data to the filesystem using Hive **INSERT** or **LOAD** commands.
44+
Similar behavior may occur when executing **CREATE** or **INSERT** statements of large data, as these commands will correspond to **WRITE** operations in the underlying filesystem. Consider writing data, such as raw ORC, to the filesystem/datalake instead of loading it using **INSERT** or **LOAD**.
4645

47-
In Enterprise Security Pack-enabled clusters, sufficiently complex Apache Ranger policies may cause a slowdown in query compilation time, which may lead to a gateway timeout. If a gateway timeout is noticed in an ESP cluster, consider reducing or combining the number of ranger policies that must be evaluated for the statement that leads to the timeout.
46+
In Enterprise Security Pack-enabled clusters, sufficiently complex Apache Ranger policies may cause a slowdown in query compilation time, which may lead to a gateway timeout. If a gateway timeout is noticed in an ESP cluster, consider reducing or combining the number of ranger policies.
4847

49-
## Troubleshooting Techniques
48+
## Troubleshooting techniques
5049

51-
There are multiple venues for mitigating and understanding performance issues met as part of the above behavior. Use the following checklist when experiencing query performance degradation over the HDInsight gateway to determine what options may be appropriate to improve your scenario:
50+
There are multiple venues for mitigating and understanding performance issues met as part of the above behavior. Use the following checklist when experiencing query performance degradation over the HDInsight gateway:
5251

5352
* Use the **LIMIT** clause when executing large **SELECT** queries. The **LIMIT** clause will reduce the total rows reported to the client host. The **LIMIT** clause only affects result generation and doesn't change the query plan. To apply the **LIMIT** clause to the query plan, use the configuration `hive.limit.optimize.enable`. **LIMIT** can be combined with an offset using the argument form **LIMIT x,y**.
5453

55-
* Name your columns of interest when running **SELECT** queries instead of using **SELECT \***. Selecting fewer columns will decrease the amount of data read.
54+
* Name your columns of interest when running **SELECT** queries instead of using **SELECT \***. Selecting fewer columns will lower the amount of data read.
5655

5756
* Try running the query of interest through Apache Beeline. If result retrieval via Apache Beeline takes an extended period of time,
5857
expect delays when retrieving the same results via external tools.

0 commit comments

Comments
 (0)