Skip to content

Commit 0a56ed0

Browse files
authored
Merge pull request #100204 from dagiro/freshness165
freshness165
2 parents a734bbd + 5ea5072 commit 0a56ed0

File tree

1 file changed

+38
-19
lines changed

1 file changed

+38
-19
lines changed

articles/hdinsight/hadoop/apache-hadoop-use-sqoop-curl.md

Lines changed: 38 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,41 +6,56 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 04/15/2019
9+
ms.date: 01/06/2020
1010
---
1111

1212
# Run Apache Sqoop jobs in HDInsight with Curl
13+
1314
[!INCLUDE [sqoop-selector](../../../includes/hdinsight-selector-use-sqoop.md)]
1415

15-
Learn how to use Curl to run Apache Sqoop jobs on an Apache Hadoop cluster in HDInsight. This article demonstrates how to export data from Azure storage and import it into a SQL Server database using Curl. This article is a continuation of [Use Apache Sqoop with Hadoop in HDInsight](./hdinsight-use-sqoop.md).
16+
Learn how to use Curl to run Apache Sqoop jobs on an Apache Hadoop cluster in HDInsight. This article demonstrates how to export data from Azure Storage and import it into a SQL Server database using Curl. This article is a continuation of [Use Apache Sqoop with Hadoop in HDInsight](./hdinsight-use-sqoop.md).
1617

1718
Curl is used to demonstrate how you can interact with HDInsight by using raw HTTP requests to run, monitor, and retrieve the results of Sqoop jobs. This works by using the WebHCat REST API (formerly known as Templeton) provided by your HDInsight cluster.
1819

1920
## Prerequisites
2021

2122
* Completion of [Set up test environment](./hdinsight-use-sqoop.md#create-cluster-and-sql-database) from [Use Apache Sqoop with Hadoop in HDInsight](./hdinsight-use-sqoop.md).
2223

23-
* A client to query the Azure SQL database. Consider using [SQL Server Management Studio](../../sql-database/sql-database-connect-query-ssms.md) or [Visual Studio Code](../../sql-database/sql-database-connect-query-vscode.md).
24+
* A client to query the Azure SQL Database. Consider using [SQL Server Management Studio](../../sql-database/sql-database-connect-query-ssms.md) or [Visual Studio Code](../../sql-database/sql-database-connect-query-vscode.md).
2425

2526
* [Curl](https://curl.haxx.se/). Curl is a tool to transfer data from or to a HDInsight cluster.
2627

2728
* [jq](https://stedolan.github.io/jq/). The jq utility is used to process the JSON data returned from REST requests.
2829

30+
* Familiarity with Sqoop. For more information, see [Sqoop User Guide](https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html).
31+
2932
## Submit Apache Sqoop jobs by using Curl
3033

31-
Use Curl to export data using Apache Sqoop jobs from Azure storage to SQL Server.
34+
Use Curl to export data using Apache Sqoop jobs from Azure Storage to SQL Server.
3235

3336
> [!NOTE]
3437
> When using Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. You must also use the cluster name as part of the Uniform Resource Identifier (URI) used to send the requests to the server.
3538
3639
For the commands in this section, replace `USERNAME` with the user to authenticate to the cluster, and replace `PASSWORD` with the password for the user account. Replace `CLUSTERNAME` with the name of your cluster.
37-
40+
3841
The REST API is secured via [basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication). You should always make requests by using Secure HTTP (HTTPS) to help ensure that your credentials are securely sent to the server.
3942

43+
1. For ease of use, set the variables below. This example is based on a Windows environment, revise as needed for your environment.
44+
45+
```cmd
46+
set CLUSTERNAME=
47+
set USERNAME=admin
48+
set PASSWORD=
49+
set SQLDATABASESERVERNAME=
50+
set SQLDATABASENAME=
51+
set SQLPASSWORD=
52+
set SQLUSER=sqluser
53+
```
54+
4055
1. From a command line, use the following command to verify that you can connect to your HDInsight cluster:
4156
4257
```cmd
43-
curl -u USERNAME:PASSWORD -G https://CLUSTERNAME.azurehdinsight.net/templeton/v1/status
58+
curl -u %USERNAME%:%PASSWORD% -G https://%CLUSTERNAME%.azurehdinsight.net/templeton/v1/status
4459
```
4560
4661
You should receive a response similar to the following:
@@ -49,65 +64,69 @@ The REST API is secured via [basic authentication](https://en.wikipedia.org/wiki
4964
{"status":"ok","version":"v1"}
5065
```
5166
52-
2. Replace `SQLDATABASESERVERNAME`, `USERNAME@SQLDATABASESERVERNAME`, `PASSWORD`, `SQLDATABASENAME` with the appropriate values from the prerequisites. Use the following to submit a sqoop job:
67+
1. Use the following to submit a sqoop job:
5368
5469
```cmd
55-
curl -u USERNAME:PASSWORD -d user.name=USERNAME -d command="export --connect jdbc:sqlserver://SQLDATABASESERVERNAME.database.windows.net;user=USERNAME@SQLDATABASESERVERNAME;password=PASSWORD;database=SQLDATABASENAME --table log4jlogs --export-dir /example/data/sample.log --input-fields-terminated-by \0x20 -m 1" -d statusdir="wasb:///example/data/sqoop/curl" https://CLUSTERNAME.azurehdinsight.net/templeton/v1/sqoop
70+
curl -u %USERNAME%:%PASSWORD% -d user.name=%USERNAME% -d command="export --connect jdbc:sqlserver://%SQLDATABASESERVERNAME%.database.windows.net;user=%SQLUSER%@%SQLDATABASESERVERNAME%;password=%PASSWORD%;database=%SQLDATABASENAME% --table log4jlogs --export-dir /example/data/sample.log --input-fields-terminated-by \0x20 -m 1" -d statusdir="wasb:///example/data/sqoop/curl" https://%CLUSTERNAME%.azurehdinsight.net/templeton/v1/sqoop
5671
```
5772
5873
The parameters used in this command are as follows:
5974
60-
* **-d** - Since `-G` is not used, the request defaults to the POST method. `-d` specifies the data values that are sent with the request.
75+
* **-d** - Since `-G` isn't used, the request defaults to the POST method. `-d` specifies the data values that are sent with the request.
6176
6277
* **user.name** - The user that is running the command.
6378
6479
* **command** - The Sqoop command to execute.
6580
6681
* **statusdir** - The directory that the status for this job will be written to.
6782
68-
This command shall return a job ID that can be used to check the status of the job.
83+
This command will return a job ID that can be used to check the status of the job.
6984
7085
```json
7186
{"id":"job_1415651640909_0026"}
7287
```
7388
74-
3. To check the status of the job, use the following command. Replace `JOBID` with the value returned in the previous step. For example, if the return value was `{"id":"job_1415651640909_0026"}`, then `JOBID` would be `job_1415651640909_0026`.
89+
1. To check the status of the job, use the following command. Replace `JOBID` with the value returned in the previous step. For example, if the return value was `{"id":"job_1415651640909_0026"}`, then `JOBID` would be `job_1415651640909_0026`. Revise location of `jq` as needed.
7590
7691
```cmd
77-
curl -G -u USERNAME:PASSWORD -d user.name=USERNAME https://CLUSTERNAME.azurehdinsight.net/templeton/v1/jobs/JOBID | jq .status.state
92+
set JOBID=job_1415651640909_0026
93+
94+
curl -G -u %USERNAME%:%PASSWORD% -d user.name=%USERNAME% https://%CLUSTERNAME%.azurehdinsight.net/templeton/v1/jobs/%JOBID% | C:\HDI\jq-win64.exe .status.state
7895
```
7996
8097
If the job has finished, the state will be **SUCCEEDED**.
81-
98+
8299
> [!NOTE]
83100
> This Curl request returns a JavaScript Object Notation (JSON) document with information about the job; jq is used to retrieve only the state value.
84101
85-
4. Once the state of the job has changed to **SUCCEEDED**, you can retrieve the results of the job from Azure Blob storage. The `statusdir` parameter passed with the query contains the location of the output file; in this case, `wasb:///example/data/sqoop/curl`. This address stores the output of the job in the `example/data/sqoop/curl` directory on the default storage container used by your HDInsight cluster.
102+
1. Once the state of the job has changed to **SUCCEEDED**, you can retrieve the results of the job from Azure Blob storage. The `statusdir` parameter passed with the query contains the location of the output file; in this case, `wasb:///example/data/sqoop/curl`. This address stores the output of the job in the `example/data/sqoop/curl` directory on the default storage container used by your HDInsight cluster.
86103
87104
You can use the Azure portal to access stderr and stdout blobs.
88105
89-
5. To verify that data was exported, use the following queries from your SQL client to view the exported data:
106+
1. To verify that data was exported, use the following queries from your SQL client to view the exported data:
90107
91108
```sql
92109
SELECT COUNT(*) FROM [dbo].[log4jlogs] WITH (NOLOCK);
93110
SELECT TOP(25) * FROM [dbo].[log4jlogs] WITH (NOLOCK);
94111
```
95112
96113
## Limitations
97-
* Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database does not currently support bulk inserts.
114+
115+
* Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database doesn't currently support bulk inserts.
98116
* Batching - With Linux-based HDInsight, When using the `-batch` switch when performing inserts, Sqoop will perform multiple inserts instead of batching the insert operations.
99117
100118
## Summary
119+
101120
As demonstrated in this document, you can use a raw HTTP request to run, monitor, and view the results of Sqoop jobs on your HDInsight cluster.
102121
103122
For more information on the REST interface used in this article, see the <a href="https://sqoop.apache.org/docs/1.99.3/RESTAPI.html" target="_blank">Apache Sqoop REST API guide</a>.
104123
105124
## Next steps
125+
106126
[Use Apache Sqoop with Apache Hadoop on HDInsight](hdinsight-use-sqoop.md)
107127
108128
For other HDInsight articles involving curl:
109-
129+
110130
* [Create Apache Hadoop clusters using the Azure REST API](../hdinsight-hadoop-create-linux-clusters-curl-rest.md)
111131
* [Run Apache Hive queries with Apache Hadoop in HDInsight using REST](apache-hadoop-use-hive-curl.md)
112-
* [Run MapReduce jobs with Apache Hadoop on HDInsight using REST](apache-hadoop-use-mapreduce-curl.md)
113-
132+
* [Run MapReduce jobs with Apache Hadoop on HDInsight using REST](apache-hadoop-use-mapreduce-curl.md)

0 commit comments

Comments
 (0)