You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/interactive-query/gateway-best-practices.md
+11-12Lines changed: 11 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,14 +9,13 @@ ms.topic: conceptual
9
9
ms.date: 04/01/2020
10
10
---
11
11
12
-
13
12
# Gateway deep dive and best practices for Apache Hive in Azure HDInsight
14
13
15
-
The Azure HDInsight gateway (Gateway) is the HTTPS frontend for HDInsight clusters. The Gateway is responsible for authentication, host resolution, service discovery, and other helpful features necessary for a modern distributed system. The features provided by the Gateway result in some overhead for which this document will describe the best practices to navigate. Gateway troubleshooting techniques are also discussed.
14
+
The Azure HDInsight gateway (Gateway) is the HTTPS frontend for HDInsight clusters. The Gateway is responsible for: authentication, host resolution, service discovery, and other helpful features necessary for a modern distributed system. The features provided by the Gateway result in some overhead for which this document will describe the best practices to navigate. Gateway troubleshooting techniques are also discussed.
16
15
17
16
## The HDInsight gateway
18
17
19
-
The HDInsight gateway is the only part of an HDInsight cluster that is publicly accessible over the internet. The Gateway service can be accessed without going over the public internet by using the `clustername-int.azurehdinsight.net` internal gateway endpoint. The internal gateway endpoint allows connections to be established to the gateway service without exiting the cluster's virtual network. The Gateway is responsible for handling all requests that are sent to the cluster, and for forwarding such requests to the correct components and cluster hosts.
18
+
The HDInsight gateway is the only part of an HDInsight cluster that is publicly accessible over the internet. The Gateway service can be accessed without going over the public internet by using the `clustername-int.azurehdinsight.net` internal gateway endpoint. The internal gateway endpoint allows connections to be established to the gateway service without exiting the cluster's virtual network. The Gateway handles all requests that are sent to the cluster, and forwards such requests to the correct components and cluster hosts.
20
19
21
20
The following diagram provides a rough illustration of how the Gateway provides an abstraction in front of all the different host resolution possibilities within HDInsight.
22
21
@@ -30,29 +29,29 @@ For service discovery, the advantage of the gateway is that each component withi
30
29
31
30
For authentication, the Gateway allows users to authenticate using a `username:password` credential pair. For ESP-enabled clusters, this credential would be the user's domain username and password. Authentication to HDInsight clusters via the Gateway doesn't require the client to acquire a kerberos ticket. Since the Gateway accepts `username:password` credentials and acquires the user's Kerberos ticket on the user's behalf, secure connections can be made to the Gateway from any client host, including clients joined to different AA-DDS domains than the (ESP) cluster.
32
31
33
-
## Best Practices
32
+
## Best practices
34
33
35
-
Since the Gateway is a single service (load balanced across two hosts) responsible for request forwarding and authentication, the Gateway may become a throughput bottleneck for Hive queries exceeding a certain size. In particular, query performance degradation may be observed when very large **SELECT** queries are executed on the Gateway via ODBC or JDBC. "Very large" means queries that make up more than 5 GB of data across rows or columns. This could be a query with a long list of rows and, or a wide column count.
34
+
The Gateway is a single service (load balanced across two hosts) responsible for request forwarding and authentication. The Gateway may become a throughput bottleneck for Hive queries exceeding a certain size. Query performance degradation may be observed when very large **SELECT** queries are executed on the Gateway via ODBC or JDBC. "Very large" means queries that make up more than 5 GB of data across rows or columns. This query could include a long list of rows and, or a wide column count.
36
35
37
-
The reason for the Gateway's performance degradation around queries of a large magnitude is that this data must ultimately be transferred from the underlying data store (ADLS Gen2), to the HDInsight Hive Server, to the Gateway, and finally via the JDBC or ODBC drivers to the client host.
36
+
The Gateway's performance degradation around queries of a large size is because the data must be transferred from the underlying data store (ADLS Gen2) to: the HDInsight Hive Server, the Gateway, and finally via the JDBC or ODBC drivers to the client host.
38
37
39
38
The following diagram illustrates the steps involved in a SELECT query.
Apache Hive is a relational abstraction on top of an HDFS-compatible filesystem. This means that **SELECT** statements in Hive correspond to READ operations on the filesystem, which are translated into the appropriate schema before being reported to the user. The latency of this process increases with data size and total hops required to reach the end user.
42
+
Apache Hive is a relational abstraction on top of an HDFS-compatible filesystem. This abstraction means **SELECT** statements in Hive correspond to **READ** operations on the filesystem. The **READ** operations are translated into the appropriate schema before reported to the user. The latency of this process increases with data size and total hops required to reach the end user.
44
43
45
-
Similar behavior may be noticed when executing **CREATE** or **INSERT** statements of large data size, as these commands will correspond to **WRITE** operations in the underlying filesystem. In some cases, it may be prudent to write data such as raw ORC data to the filesystem/datalake instead of loading data to the filesystem using Hive **INSERT** or **LOAD** commands.
44
+
Similar behavior may occur when executing **CREATE** or **INSERT** statements of large data, as these commands will correspond to **WRITE** operations in the underlying filesystem. Consider writing data, such as raw ORC, to the filesystem/datalake instead of loading it using **INSERT** or **LOAD**.
46
45
47
-
In Enterprise Security Pack-enabled clusters, sufficiently complex Apache Ranger policies may cause a slowdown in query compilation time, which may lead to a gateway timeout. If a gateway timeout is noticed in an ESP cluster, consider reducing or combining the number of ranger policies that must be evaluated for the statement that leads to the timeout.
46
+
In Enterprise Security Pack-enabled clusters, sufficiently complex Apache Ranger policies may cause a slowdown in query compilation time, which may lead to a gateway timeout. If a gateway timeout is noticed in an ESP cluster, consider reducing or combining the number of ranger policies.
48
47
49
-
## Troubleshooting Techniques
48
+
## Troubleshooting techniques
50
49
51
-
There are multiple venues for mitigating and understanding performance issues met as part of the above behavior. Use the following checklist when experiencing query performance degradation over the HDInsight gateway to determine what options may be appropriate to improve your scenario:
50
+
There are multiple venues for mitigating and understanding performance issues met as part of the above behavior. Use the following checklist when experiencing query performance degradation over the HDInsight gateway:
52
51
53
52
* Use the **LIMIT** clause when executing large **SELECT** queries. The **LIMIT** clause will reduce the total rows reported to the client host. The **LIMIT** clause only affects result generation and doesn't change the query plan. To apply the **LIMIT** clause to the query plan, use the configuration `hive.limit.optimize.enable`. **LIMIT** can be combined with an offset using the argument form **LIMIT x,y**.
54
53
55
-
* Name your columns of interest when running **SELECT** queries instead of using **SELECT \***. Selecting fewer columns will decrease the amount of data read.
54
+
* Name your columns of interest when running **SELECT** queries instead of using **SELECT \***. Selecting fewer columns will lower the amount of data read.
56
55
57
56
* Try running the query of interest through Apache Beeline. If result retrieval via Apache Beeline takes an extended period of time,
58
57
expect delays when retrieving the same results via external tools.
0 commit comments