You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-use-hive-beeline.md
+39-35Lines changed: 39 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,14 +6,14 @@ ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
-
ms.date: 12/12/2019
9
+
ms.date: 02/25/2020
10
10
---
11
11
12
12
# Use the Apache Beeline client with Apache Hive
13
13
14
14
Learn how to use [Apache Beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline–NewCommandLineShell) to run Apache Hive queries on HDInsight.
15
15
16
-
Beeline is a Hive client that is included on the head nodes of your HDInsight cluster. To install Beeline locally, see [Install beeline client](#install-beeline-client), below. Beeline uses JDBC to connect to HiveServer2, a service hosted on your HDInsight cluster. You can also use Beeline to access Hive on HDInsight remotely over the internet. The following examples provide the most common connection strings used to connect to HDInsight from Beeline:
16
+
Beeline is a Hive client that is included on the head nodes of your HDInsight cluster. To install Beeline locally, see [Install beeline client](#install-beeline-client), below. Beeline uses JDBC to connect to HiveServer2, a service hosted on your HDInsight cluster. You can also use Beeline to access Hive on HDInsight remotely over the internet. The following examples provide the most common connection strings used to connect to HDInsight from Beeline.
17
17
18
18
## Types of connections
19
19
@@ -54,7 +54,9 @@ Replace `<username>` with the name of an account on the domain with permissions
54
54
55
55
### Over public or private endpoints
56
56
57
-
When connecting to a cluster using the public or private endpoints, you must provide the cluster login account name (default `admin`) and password. For example, using Beeline from a client system to connect to the `clustername.azurehdinsight.net` address. This connection is made over port `443`, and is encrypted using SSL:
57
+
When connecting to a cluster using the public or private endpoints, you must provide the cluster login account name (default `admin`) and password. For example, using Beeline from a client system to connect to the `clustername.azurehdinsight.net` address. This connection is made over port `443`, and is encrypted using SSL.
58
+
59
+
Replace `clustername` with the name of your HDInsight cluster. Replace `admin` with the cluster login account for your cluster. For ESP clusters, use the full UPN (for example, [email protected]). Replace `password` with the password for the cluster login account.
Replace `clustername` with the name of your HDInsight cluster. Replace `admin` with the cluster login account for your cluster. For ESP clusters, use the full UPN (for example, [email protected]). Replace `password` with the password for the cluster login account.
70
-
71
71
Private endpoints point to a basic load balancer, which can only be accessed from the VNETs peered in the same region. See [constraints on global VNet peering and load balancers](../../virtual-network/virtual-networks-faq.md#what-are-the-constraints-related-to-global-vnet-peering-and-load-balancers) for more info. You can use the `curl` command with `-v` option to troubleshoot any connectivity problems with public or private endpoints before using beeline.
72
72
73
73
---
74
74
75
-
### <aid="sparksql"></a>Use Beeline with Apache Spark
75
+
### Use Beeline with Apache Spark
76
76
77
77
Apache Spark provides its own implementation of HiveServer2, which is sometimes referred to as the Spark Thrift server. This service uses Spark SQL to resolve queries instead of Hive, and may provide better performance depending on your query.
78
78
79
79
#### Through public or private endpoints
80
80
81
-
The connection string used is slightly different. Instead of containing `httpPath=/hive2` it's `httpPath/sparkhive2`:
81
+
The connection string used is slightly different. Instead of containing `httpPath=/hive2` it's `httpPath/sparkhive2`. Replace `clustername` with the name of your HDInsight cluster. Replace `admin` with the cluster login account for your cluster. For ESP clusters, use the full UPN (for example, [email protected]). Replace `password` with the password for the cluster login account.
Replace `clustername` with the name of your HDInsight cluster. Replace `admin` with the cluster login account for your cluster. For ESP clusters, use the full UPN (e.g. [email protected]). Replace `password` with the password for the cluster login account.
94
-
95
93
Private endpoints point to a basic load balancer, which can only be accessed from the VNETs peered in the same region. See [constraints on global VNet peering and load balancers](../../virtual-network/virtual-networks-faq.md#what-are-the-constraints-related-to-global-vnet-peering-and-load-balancers) for more info. You can use the `curl` command with `-v` option to troubleshoot any connectivity problems with public or private endpoints before using beeline.
96
94
97
95
---
@@ -106,7 +104,7 @@ When connecting directly from the cluster head node, or from a resource inside t
106
104
107
105
---
108
106
109
-
## <aid="prereq"></a>Prerequisites
107
+
## Prerequisites for examples
110
108
111
109
* A Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](./apache-hadoop-linux-tutorial-get-started.md).
112
110
@@ -116,7 +114,7 @@ When connecting directly from the cluster head node, or from a resource inside t
116
114
117
115
* Option 2: A local Beeline client.
118
116
119
-
## <aid="beeline"></a>Run a Hive query
117
+
## Run a Hive query
120
118
121
119
This example is based on using the Beeline client from an SSH connection.
122
120
@@ -183,24 +181,21 @@ This example is based on using the Beeline client from an SSH connection.
183
181
t7 string)
184
182
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
185
183
STORED AS TEXTFILE LOCATION 'wasbs:///example/data/';
186
-
SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs
187
-
WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log'
184
+
SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs
185
+
WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log'
188
186
GROUP BY t4;
189
187
```
190
188
191
189
These statements do the following actions:
192
190
193
-
* `DROP TABLE` - If the table exists, it's deleted.
194
-
195
-
* `CREATE EXTERNAL TABLE` - Creates an **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location.
196
-
197
-
* `ROW FORMAT` - How the data is formatted. In this case, the fields in each log are separated by a space.
198
-
199
-
* `STORED AS TEXTFILE LOCATION` - Where the data is stored and in what file format.
200
-
201
-
* `SELECT` - Selects a count of all rows where column **t4** contains the value **[ERROR]**. This query returns a value of **3** as there are three rows that contain this value.
202
-
203
-
* `INPUT__FILE__NAME LIKE '%.log'` - Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that it should only return data from files ending in .log.
191
+
|Statement |Description |
192
+
|---|---|
193
+
|DROP TABLE|If the table exists, it's deleted.|
194
+
|CREATE EXTERNAL TABLE|Creates an **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location.|
195
+
|ROW FORMAT|How the data is formatted. In this case, the fields in each log are separated by a space.|
196
+
|STORED AS TEXTFILE LOCATION|Where the data is stored and in what file format.|
197
+
|SELECT|Selects a count of all rows where column **t4** contains the value **[ERROR]**. This query returns a value of **3** as there are three rows that contain this value.|
198
+
|INPUT__FILE__NAME LIKE '%.log'|Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that it should only return data from files ending in .log.|
204
199
205
200
> [!NOTE]
206
201
> External tables should be used when you expect the underlying data to be updated by an external source. For example, an automated data upload process or a MapReduce operation.
@@ -231,7 +226,11 @@ This example is based on using the Beeline client from an SSH connection.
231
226
+----------+--------+--+
232
227
1 row selected (47.351 seconds)
233
228
234
-
6. To exit Beeline, use `!exit`.
229
+
6. Exit Beeline:
230
+
231
+
```bash
232
+
!exit
233
+
```
235
234
236
235
## Run a HiveQL file
237
236
@@ -243,7 +242,7 @@ This is a continuation from the prior example. Use the following steps to create
243
242
nano query.hql
244
243
```
245
244
246
-
2. Use the following text as the contents of the file. This query creates a new 'internal' table named **errorLogs**:
245
+
1. Use the following text as the contents of the file. This query creates a new 'internal' table named **errorLogs**:
247
246
248
247
```hiveql
249
248
CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) STORED AS ORC;
@@ -252,16 +251,18 @@ This is a continuation from the prior example. Use the following steps to create
252
251
253
252
These statements do the following actions:
254
253
255
-
* **CREATE TABLE IF NOT EXISTS** - If the table doesn't already exist, it's created. Since the **EXTERNAL** keyword isn't used, this statement creates an internal table. Internal tables are stored in the Hive data warehouse and are managed completely by Hive.
256
-
* **STORED AS ORC** - Stores the data in Optimized Row Columnar (ORC) format. ORC format is a highly optimized and efficient format for storing Hive data.
257
-
* **INSERT OVERWRITE ... SELECT** - Selects rows from the **log4jLogs** table that contain **[ERROR]**, then inserts the data into the **errorLogs** table.
254
+
|Statement |Description |
255
+
|---|---|
256
+
|CREATE TABLE IF NOT EXISTS|If the table doesn't already exist, it's created. Since the **EXTERNAL** keyword isn't used, this statement creates an internal table. Internal tables are stored in the Hive data warehouse and are managed completely by Hive.|
257
+
|STORED AS ORC|Stores the data in Optimized Row Columnar (ORC) format. ORC format is a highly optimized and efficient format for storing Hive data.|
258
+
|INSERT OVERWRITE ... SELECT|Selects rows from the **log4jLogs** table that contain **[ERROR]**, then inserts the data into the **errorLogs** table.|
258
259
259
260
> [!NOTE]
260
261
> Unlike external tables, dropping an internal table deletes the underlying data as well.
261
262
262
-
3. To save the file, use **Ctrl**+**X**, then enter **Y**, and finally **Enter**.
263
+
1. To save the file, use **Ctrl**+**X**, then enter **Y**, and finally **Enter**.
263
264
264
-
4. Use the following to run the file using Beeline:
265
+
1. Use the following to run the file using Beeline:
@@ -270,7 +271,7 @@ This is a continuation from the prior example. Use the following steps to create
270
271
> [!NOTE]
271
272
> The `-i` parameter starts Beeline and runs the statements in the `query.hql` file. Once the query completes, you arrive at the `jdbc:hive2://headnodehost:10001/>` prompt. You can also run a file using the `-f` parameter, which exits Beeline after the query completes.
272
273
273
-
5. To verify that the **errorLogs** table was created, use the following statement to return all the rows from **errorLogs**:
274
+
1. To verify that the **errorLogs** table was created, use the following statement to return all the rows from **errorLogs**:
274
275
275
276
```hiveql
276
277
SELECT * from errorLogs;
@@ -305,7 +306,9 @@ Although Beeline is included on the head nodes of your HDInsight cluster, you ma
305
306
sudo apt install openjdk-11-jre-headless
306
307
```
307
308
308
-
1. Amend the bashrc file (usually found in ~/.bashrc). Open the file with `nano ~/.bashrc` and then add the following line at the end of the file:
309
+
1. Open the bashrc file (usually found in ~/.bashrc): `nano ~/.bashrc`.
310
+
311
+
1. Amend the bashrc file. Add the following line at the end of the file:
@@ -330,11 +333,12 @@ Although Beeline is included on the head nodes of your HDInsight cluster, you ma
330
333
1. Further amend the bashrc file. You'll need to identify the path to where the archives were unpacked. If using the [Windows Subsystem for Linux](https://docs.microsoft.com/windows/wsl/install-win10), and you followed the steps exactly, your path would be `/mnt/c/Users/user/`, where `user` is your user name.
331
334
332
335
1. Open the file: `nano ~/.bashrc`
336
+
333
337
1. Modify the commands below with the appropriate path and then enter them at the end of the bashrc file:
0 commit comments