You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-using-spark-query-hbase.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ description: Use the Spark HBase Connector to read and write data from a Spark c
4
4
ms.service: hdinsight
5
5
ms.topic: how-to
6
6
ms.custom: hdinsightactive,seoapr2020
7
-
ms.date: 04/01/2022
7
+
ms.date: 12/09/2022
8
8
---
9
9
10
10
# Use Apache Spark to read and write Apache HBase data
@@ -31,7 +31,7 @@ The high-level process for enabling your Spark cluster to query your HBase clust
31
31
32
32
In this step, you create and populate a table in Apache HBase that you can then query using Spark.
33
33
34
-
1. Use the `ssh` command to connect to your HBase cluster. Edit the command below by replacing `HBASECLUSTER` with the name of your HBase cluster, and then enter the command:
34
+
1. Use the `ssh` command to connect to your HBase cluster. Edit the command by replacing `HBASECLUSTER` with the name of your HBase cluster, and then enter the command:
@@ -70,14 +70,14 @@ In this step, you create and populate a table in Apache HBase that you can then
70
70
71
71
## Run scripts to set up connection between clusters
72
72
73
-
To set up the communication between clusters, follow the below steps to run two scripts on your clusters. These scripts will automate the process of file copying described in 'Set up communication manually' section below.
73
+
To set up the communication between clusters, follow the steps to run two scripts on your clusters. These scripts will automate the process of file copying described in 'Set up communication manually' section.
74
74
75
75
* The script you run from the HBase cluster will upload `hbase-site.xml` and HBase IP-mapping information to the default storage attached to your Spark cluster.
76
76
* The script that you run from the Spark cluster sets up two cron jobs to run two helper scripts periodically:
77
77
1. HBase cron job – download new `hbase-site.xml` files and HBase IP mapping from Spark default storage account to local node
78
78
2. Spark cron job – checks if a Spark scaling occurred and if cluster is secure. If so, edit `/etc/hosts` to include HBase IP mapping stored locally
79
79
80
-
__NOTE__: Before proceeding, make sure you have added the Spark cluster’s storage account to your HBase cluster as secondary storage account. Make sure you the scripts in order as indicated below.
80
+
__NOTE__: Before proceeding, make sure you've added the Spark cluster’s storage account to your HBase cluster as secondary storage account. Make sure you the scripts in order as indicated.
81
81
82
82
83
83
1. Use [Script Action](hdinsight-hadoop-customize-cluster-linux.md#script-action-to-a-running-cluster) on your HBase cluster to apply the changes with the following considerations:
@@ -104,19 +104,19 @@ __NOTE__: Before proceeding, make sure you have added the Spark cluster’s stor
104
104
105
105
106
106
* You can specify how often you want this cluster to automatically check if update. Default: -s “*/1 * * * *” -h 0 (In this example, the Spark cron runs every minute, while the HBase cron doesn't run)
107
-
* Since HBase cron is not set up by default, you need to rerun this script when perform scaling to your HBase cluster. If your HBase cluster scales often, you may choose to set up HBase cron job automatically. For example: `-h "*/30 * * * *"` configures the script to perform checks every 30 minutes. This will run HBase cron schedule periodically to automate downloading of new HBase information on the common storage account to local node.
107
+
* Since HBase cron isn't set up by default, you need to rerun this script when perform scaling to your HBase cluster. If your HBase cluster scales often, you may choose to set up HBase cron job automatically. For example: `-h "*/30 * * * *"` configures the script to perform checks every 30 minutes. This will run HBase cron schedule periodically to automate downloading of new HBase information on the common storage account to local node.
108
108
109
109
110
110
111
111
## Set up communication manually (Optional, if provided script in above step fails)
112
112
113
113
__NOTE:__ These steps need to perform every time one of the clusters undergoes a scaling activity.
114
114
115
-
1. Copy the hbase-site.xml from local storage to the root of your Spark cluster's default storage. Edit the command below to reflect your configuration. Then, from your open SSH session to the HBase cluster, enter the command:
115
+
1. Copy the hbase-site.xml from local storage to the root of your Spark cluster's default storage. Edit the command to reflect your configuration. Then, from your open SSH session to the HBase cluster, enter the command:
116
116
117
117
| Syntax value | New value|
118
118
|---|---|
119
-
|[URI scheme](hdinsight-hadoop-linux-information.md#URI-and-scheme) | Modify to reflect your storage. The syntax below is for blob storage with secure transfer enabled.|
119
+
|[URI scheme](hdinsight-hadoop-linux-information.md#URI-and-scheme) | Modify to reflect your storage. The syntax is for blob storage with secure transfer enabled.|
120
120
|`SPARK_STORAGE_CONTAINER`|Replace with the default storage container name used for the Spark cluster.|
121
121
|`SPARK_STORAGE_ACCOUNT`|Replace with the default storage account name used for the Spark cluster.|
122
122
@@ -131,13 +131,13 @@ __NOTE:__ These steps need to perform every time one of the clusters undergoes a
131
131
```
132
132
133
133
134
-
3. Connect to the head node of your Spark cluster using SSH. Edit the command below by replacing `SPARKCLUSTER` with the name of your Spark cluster, and then enter the command:
134
+
3. Connect to the head node of your Spark cluster using SSH. Edit the command by replacing `SPARKCLUSTER` with the name of your Spark cluster, and then enter the command:
4. Enter the command below to copy `hbase-site.xml` from your Spark cluster's default storage to the Spark 2 configuration folder on the cluster's local storage:
140
+
4. Enter the command to copy `hbase-site.xml` from your Spark cluster's default storage to the Spark 2 configuration folder on the cluster's local storage:
@@ -159,7 +159,7 @@ As an example, the following table lists two versions and the corresponding comm
159
159
160
160
2. Keep this Spark shell instance open and continue to [Define a catalog and query](#define-a-catalog-and-query). If you don't find the jars that correspond to your versions in the SHC Core repository, continue reading.
161
161
162
-
For subsequent combinations of Spark and HBase versions, these artifacts are no longer published at above repo. You can build the jars directly from the [spark-hbase-connector](https://github.com/hortonworks-spark/shc) GitHub branch. For example, if you are running with Spark 2.4 and HBase 2.1, complete these steps:
162
+
For subsequent combinations of Spark and HBase versions, these artifacts are no longer published at above repo. You can build the jars directly from the [spark-hbase-connector](https://github.com/hortonworks-spark/shc) GitHub branch. For example, if you're running with Spark 2.4 and HBase 2.1, complete these steps:
163
163
164
164
1. Clone the repo:
165
165
@@ -224,7 +224,7 @@ In this step, you define a catalog object that maps the schema from Apache Spark
224
224
1. Identifies the rowkey as `key`, and map the column names used in Spark to the column family, column name, and column type as used in HBase.
225
225
1. Defines the rowkey in detail as a named column (`rowkey`), which has a specific column family `cf` of `rowkey`.
226
226
227
-
1. Enter the command below to define a method that provides a DataFrame around your `Contacts` table in HBase:
227
+
1. Enter the command to define a method that provides a DataFrame around your `Contacts` table in HBase:
0 commit comments