Skip to content

Commit e670ea7

Browse files
authored
Merge pull request #78926 from dagiro/freshness115
freshness115
2 parents c995133 + a9d1d01 commit e670ea7

File tree

1 file changed

+63
-39
lines changed

1 file changed

+63
-39
lines changed

articles/hdinsight/hdinsight-using-spark-query-hbase.md

Lines changed: 63 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,20 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.custom: hdinsightactive
99
ms.topic: conceptual
10-
ms.date: 03/12/2019
10+
ms.date: 06/06/2019
1111
---
12+
1213
# Use Apache Spark to read and write Apache HBase data
1314

1415
Apache HBase is typically queried either with its low-level API (scans, gets, and puts) or with a SQL syntax using Apache Phoenix. Apache also provides the Apache Spark HBase Connector, which is a convenient and performant alternative to query and modify data stored by HBase.
1516

1617
## Prerequisites
1718

18-
* Two separate HDInsight clusters, one HBase, and one Spark with at least Spark 2.1 (HDInsight 3.6) installed.
19-
* The Spark cluster needs to communicate directly with the HBase cluster with minimal latency, so the recommended configuration is deploying both clusters in the same virtual network. For more information, see [Create Linux-based clusters in HDInsight using the Azure portal](hdinsight-hadoop-create-linux-clusters-portal.md).
19+
* Two separate HDInsight clusters deployed in the same virtual network. One HBase, and one Spark with at least Spark 2.1 (HDInsight 3.6) installed. For more information, see [Create Linux-based clusters in HDInsight using the Azure portal](hdinsight-hadoop-create-linux-clusters-portal.md).
20+
2021
* An SSH client. For more information, see [Connect to HDInsight (Apache Hadoop) using SSH](hdinsight-hadoop-linux-use-ssh-unix.md).
21-
* The [URI scheme](hdinsight-hadoop-linux-information.md#URI-and-scheme) for your clusters primary storage. This would be wasb:// for Azure Blob Storage, abfs:// for Azure Data Lake Storage Gen2 or adl:// for Azure Data Lake Storage Gen1. If secure transfer is enabled for Blob Storage or Data Lake Storage Gen2, the URI would be wasbs:// or abfss://, respectively See also, [secure transfer](../storage/common/storage-require-secure-transfer.md).
2222

23+
* The [URI scheme](hdinsight-hadoop-linux-information.md#URI-and-scheme) for your clusters primary storage. This would be wasb:// for Azure Blob Storage, abfs:// for Azure Data Lake Storage Gen2 or adl:// for Azure Data Lake Storage Gen1. If secure transfer is enabled for Blob Storage or Data Lake Storage Gen2, the URI would be wasbs:// or abfss://, respectively See also, [secure transfer](../storage/common/storage-require-secure-transfer.md).
2324

2425
## Overall process
2526

@@ -34,38 +35,47 @@ The high-level process for enabling your Spark cluster to query your HDInsight c
3435

3536
## Prepare sample data in Apache HBase
3637

37-
In this step, you create and populate a simple table in Apache HBase that you can then query using Spark.
38+
In this step, you create and populate a table in Apache HBase that you can then query using Spark.
3839

39-
1. Connect to the head node of your HBase cluster using SSH. For more information, see [Connect to HDInsight using SSH](hdinsight-hadoop-linux-use-ssh-unix.md). Edit the command below by replacing `HBASECLUSTER` with the name of your HBase cluster, `sshuser` with the ssh user account name, and then enter the command.
40+
1. Use the `ssh` command to connect to your HBase cluster. Edit the command below by replacing `HBASECLUSTER` with the name of your HBase cluster, and then enter the command:
4041

41-
```
42+
```cmd
4243
4344
```
4445
45-
2. Enter the command below to start the HBase shell:
46-
47-
hbase shell
46+
2. Use the `hbase shell` command to start the HBase interactive shell. Enter the following command in your SSH connection:
4847
49-
3. Enter the command below to create a `Contacts` table with the column families `Personal` and `Office`:
48+
```bash
49+
hbase shell
50+
```
5051
51-
create 'Contacts', 'Personal', 'Office'
52+
3. Use the `create` command to create an HBase table with two-column families. Enter the following command:
5253
53-
4. Enter the commands below to load a few sample rows of data:
54+
```hbase
55+
create 'Contacts', 'Personal', 'Office'
56+
```
5457
55-
put 'Contacts', '1000', 'Personal:Name', 'John Dole'
56-
put 'Contacts', '1000', 'Personal:Phone', '1-425-000-0001'
57-
put 'Contacts', '1000', 'Office:Phone', '1-425-000-0002'
58-
put 'Contacts', '1000', 'Office:Address', '1111 San Gabriel Dr.'
59-
put 'Contacts', '8396', 'Personal:Name', 'Calvin Raji'
60-
put 'Contacts', '8396', 'Personal:Phone', '230-555-0191'
61-
put 'Contacts', '8396', 'Office:Phone', '230-555-0191'
62-
put 'Contacts', '8396', 'Office:Address', '5415 San Gabriel Dr.'
58+
4. Use the `put` command to insert values at a specified column in a specified row in a particular table. Enter the following command:
59+
60+
```hbase
61+
put 'Contacts', '1000', 'Personal:Name', 'John Dole'
62+
put 'Contacts', '1000', 'Personal:Phone', '1-425-000-0001'
63+
put 'Contacts', '1000', 'Office:Phone', '1-425-000-0002'
64+
put 'Contacts', '1000', 'Office:Address', '1111 San Gabriel Dr.'
65+
put 'Contacts', '8396', 'Personal:Name', 'Calvin Raji'
66+
put 'Contacts', '8396', 'Personal:Phone', '230-555-0191'
67+
put 'Contacts', '8396', 'Office:Phone', '230-555-0191'
68+
put 'Contacts', '8396', 'Office:Address', '5415 San Gabriel Dr.'
69+
```
6370
64-
5. Enter the command below to exit the HBase shell:
71+
5. Use the `exit` command to stop the HBase interactive shell. Enter the following command:
6572
66-
exit
73+
```hbase
74+
exit
75+
```
6776
6877
## Copy hbase-site.xml to Spark cluster
78+
6979
Copy the hbase-site.xml from local storage to the root of your Spark cluster's default storage. Edit the command below to reflect your configuration. Then, from your open SSH session to the HBase cluster, enter the command:
7080
7181
| Syntax value | New value|
@@ -74,23 +84,27 @@ Copy the hbase-site.xml from local storage to the root of your Spark cluster's d
7484
|`SPARK_STORAGE_CONTAINER`|Replace with the default storage container name used for the Spark cluster.|
7585
|`SPARK_STORAGE_ACCOUNT`|Replace with the default storage account name used for the Spark cluster.|
7686
77-
```
87+
```bash
7888
hdfs dfs -copyFromLocal /etc/hbase/conf/hbase-site.xml wasbs://SPARK_STORAGE_CONTAINER@SPARK_STORAGE_ACCOUNT.blob.core.windows.net/
7989
```
8090

91+
Then exit your ssh connection to your HBase cluster.
92+
8193
## Put hbase-site.xml on your Spark cluster
8294

8395
1. Connect to the head node of your Spark cluster using SSH.
8496

8597
2. Enter the command below to copy `hbase-site.xml` from your Spark cluster's default storage to the Spark 2 configuration folder on the cluster's local storage:
8698

87-
sudo hdfs dfs -copyToLocal /hbase-site.xml /etc/spark2/conf
99+
```bash
100+
sudo hdfs dfs -copyToLocal /hbase-site.xml /etc/spark2/conf
101+
```
88102

89103
## Run Spark Shell referencing the Spark HBase Connector
90104

91105
1. From your open SSH session to the Spark cluster, enter the command below to start a spark shell:
92106

93-
```
107+
```bash
94108
spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories https://repo.hortonworks.com/content/groups/public/
95109
```
96110

@@ -179,12 +193,14 @@ In this step, you define a catalog object that maps the schema from Apache Spark
179193

180194
9. You should see results like these:
181195

182-
+-------------+--------------------+
183-
| personalName| officeAddress|
184-
+-------------+--------------------+
185-
| John Dole|1111 San Gabriel Dr.|
186-
| Calvin Raji|5415 San Gabriel Dr.|
187-
+-------------+--------------------+
196+
```output
197+
+-------------+--------------------+
198+
| personalName| officeAddress|
199+
+-------------+--------------------+
200+
| John Dole|1111 San Gabriel Dr.|
201+
| Calvin Raji|5415 San Gabriel Dr.|
202+
+-------------+--------------------+
203+
```
188204

189205
## Insert new data
190206

@@ -223,13 +239,21 @@ In this step, you define a catalog object that maps the schema from Apache Spark
223239
224240
5. You should see output like this:
225241
226-
+------+--------------------+--------------+------------+--------------+
227-
|rowkey| officeAddress| officePhone|personalName| personalPhone|
228-
+------+--------------------+--------------+------------+--------------+
229-
| 1000|1111 San Gabriel Dr.|1-425-000-0002| John Dole|1-425-000-0001|
230-
| 16891| 40 Ellis St.| 674-555-0110|John Jackson| 230-555-0194|
231-
| 8396|5415 San Gabriel Dr.| 230-555-0191| Calvin Raji| 230-555-0191|
232-
+------+--------------------+--------------+------------+--------------+
242+
```output
243+
+------+--------------------+--------------+------------+--------------+
244+
|rowkey| officeAddress| officePhone|personalName| personalPhone|
245+
+------+--------------------+--------------+------------+--------------+
246+
| 1000|1111 San Gabriel Dr.|1-425-000-0002| John Dole|1-425-000-0001|
247+
| 16891| 40 Ellis St.| 674-555-0110|John Jackson| 230-555-0194|
248+
| 8396|5415 San Gabriel Dr.| 230-555-0191| Calvin Raji| 230-555-0191|
249+
+------+--------------------+--------------+------------+--------------+
250+
```
251+
252+
6. Close the spark shell by entering the following command:
253+
254+
```scala
255+
:q
256+
```
233257
234258
## Next steps
235259

0 commit comments

Comments
 (0)