Skip to content

Commit 7ba8651

Browse files
authored
Merge pull request #98386 from dagiro/freshness107
freshness107
2 parents dbf10be + 442203a commit 7ba8651

File tree

1 file changed

+104
-87
lines changed

1 file changed

+104
-87
lines changed

articles/hdinsight/domain-joined/hdinsight-use-oozie-domain-joined-clusters.md

Lines changed: 104 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,19 @@
11
---
22
title: Apache Oozie workflows & Enterprise Security - Azure HDInsight
33
description: Secure Apache Oozie workflows using the Azure HDInsight Enterprise Security Package. Learn how to define an Oozie workflow and submit an Oozie job.
4-
ms.service: hdinsight
54
author: omidm1
65
ms.author: omidm
76
ms.reviewer: jasonh
8-
ms.custom: hdinsightactive,seodec18
7+
ms.service: hdinsight
98
ms.topic: conceptual
10-
ms.date: 02/15/2019
9+
ms.custom: hdinsightactive,seodec18
10+
ms.date: 12/09/2019
1111
---
1212

1313
# Run Apache Oozie in HDInsight Hadoop clusters with Enterprise Security Package
1414

1515
Apache Oozie is a workflow and coordination system that manages Apache Hadoop jobs. Oozie is integrated with the Hadoop stack, and it supports the following jobs:
16+
1617
- Apache MapReduce
1718
- Apache Pig
1819
- Apache Hive
@@ -22,51 +23,58 @@ You can also use Oozie to schedule jobs that are specific to a system, like Java
2223

2324
## Prerequisite
2425

25-
- An Azure HDInsight Hadoop cluster with Enterprise Security Package (ESP). See [Configure HDInsight clusters with ESP](./apache-domain-joined-configure-using-azure-adds.md).
26+
An Azure HDInsight Hadoop cluster with Enterprise Security Package (ESP). See [Configure HDInsight clusters with ESP](./apache-domain-joined-configure-using-azure-adds.md).
2627

27-
> [!NOTE]
28-
> For detailed instructions on using Oozie on non-ESP clusters, see [Use Apache Oozie workflows in Linux-based Azure HDInsight](../hdinsight-use-oozie-linux-mac.md).
28+
> [!NOTE]
29+
> For detailed instructions on how to use Oozie on non-ESP clusters, see [Use Apache Oozie workflows in Linux-based Azure HDInsight](../hdinsight-use-oozie-linux-mac.md).
2930
3031
## Connect to an ESP cluster
3132

3233
For more information on Secure Shell (SSH), see [Connect to HDInsight (Hadoop) using SSH](../hdinsight-hadoop-linux-use-ssh-unix.md).
3334

34-
1. Connect to the HDInsight cluster by using SSH:
35-
```bash
36-
ssh [DomainUserName]@<clustername>-ssh.azurehdinsight.net
37-
```
35+
1. Connect to the HDInsight cluster by using SSH:
36+
37+
```bash
38+
ssh [DomainUserName]@<clustername>-ssh.azurehdinsight.net
39+
```
3840

39-
2. To verify successful Kerberos authentication, use the `klist` command. If not, use `kinit` to start Kerberos authentication.
41+
1. To verify successful Kerberos authentication, use the `klist` command. If not, use `kinit` to start Kerberos authentication.
4042

41-
3. Sign in to the HDInsight gateway to register the OAuth token required to access Azure Data Lake Storage:
42-
```bash
43-
curl -I -u [[email protected]]:[DomainUserPassword] https://<clustername>.azurehdinsight.net
44-
```
43+
1. Sign in to the HDInsight gateway to register the OAuth token required to access Azure Data Lake Storage:
44+
45+
```bash
46+
curl -I -u [[email protected]]:[DomainUserPassword] https://<clustername>.azurehdinsight.net
47+
```
4548

4649
A status response code of **200 OK** indicates successful registration. Check the username and password if an unauthorized response is received, such as 401.
4750

4851
## Define the workflow
52+
4953
Oozie workflow definitions are written in Apache Hadoop Process Definition Language (hPDL). hPDL is an XML process definition language. Take the following steps to define the workflow:
5054

5155
1. Set up a domain user’s workspace:
56+
5257
```bash
5358
hdfs dfs -mkdir /user/<DomainUser>
5459
cd /home/<DomainUserPath>
5560
cp /usr/hdp/<ClusterVersion>/oozie/doc/oozie-examples.tar.gz .
5661
tar -xvf oozie-examples.tar.gz
5762
hdfs dfs -put examples /user/<DomainUser>/
5863
```
64+
5965
Replace `DomainUser` with the domain user name.
6066
Replace `DomainUserPath` with the home directory path for the domain user.
6167
Replace `ClusterVersion` with your cluster data platform version.
6268

6369
2. Use the following statement to create and edit a new file:
70+
6471
```bash
6572
nano workflow.xml
6673
```
6774

6875
3. After the nano editor opens, enter the following XML as the file contents:
69-
```xml
76+
77+
```xml
7078
<?xml version="1.0" encoding="UTF-8"?>
7179
<workflow-app xmlns="uri:oozie:workflow:0.4" name="map-reduce-wf">
7280
<credentials>
@@ -161,19 +169,21 @@ Oozie workflow definitions are written in Apache Hadoop Process Definition Langu
161169
</kill>
162170
<end name="end" />
163171
</workflow-app>
164-
```
165-
4. Replace `clustername` with the name of the cluster.
172+
```
173+
174+
4. Replace `clustername` with the name of the cluster.
166175

167-
5. To save the file, select Ctrl+X. Enter `Y`. Then select **Enter**.
176+
5. To save the file, select **Ctrl+X**. Enter **Y**. Then select **Enter**.
168177

169178
The workflow is divided into two parts:
170-
* **Credential section.** This section takes in the credentials that are used for authenticating Oozie actions:
179+
180+
- **Credential.** This section takes in the credentials that are used for authenticating Oozie actions:
171181

172182
This example uses authentication for Hive actions. To learn more, see [Action Authentication](https://oozie.apache.org/docs/4.2.0/DG_ActionAuthentication.html).
173183

174184
The credential service allows Oozie actions to impersonate the user for accessing Hadoop services.
175185

176-
* **Action section.** This section has three actions: map-reduce, Hive server 2, and Hive server 1:
186+
- **Action.** This section has three actions: map-reduce, Hive server 2, and Hive server 1:
177187

178188
- The map-reduce action runs an example from an Oozie package for map-reduce that outputs the aggregated word count.
179189

@@ -182,43 +192,44 @@ Oozie workflow definitions are written in Apache Hadoop Process Definition Langu
182192
The Hive actions use the credentials defined in the credentials section for authentication by using the keyword `cred` in the action element.
183193

184194
6. Use the following command to copy the `workflow.xml` file to `/user/<domainuser>/examples/apps/map-reduce/workflow.xml`:
185-
```bash
195+
196+
```bash
186197
hdfs dfs -put workflow.xml /user/<domainuser>/examples/apps/map-reduce/workflow.xml
187-
```
198+
```
188199

189200
7. Replace `domainuser` with your username for the domain.
190201

191202
## Define the properties file for the Oozie job
192203

193204
1. Use the following statement to create and edit a new file for job properties:
194205

195-
```bash
196-
nano job.properties
197-
```
206+
```bash
207+
nano job.properties
208+
```
198209

199210
2. After the nano editor opens, use the following XML as the contents of the file:
200211

201-
```bash
202-
nameNode=adl://home
203-
jobTracker=headnodehost:8050
204-
queueName=default
205-
examplesRoot=examples
206-
oozie.wf.application.path=${nameNode}/user/[domainuser]/examples/apps/map-reduce/workflow.xml
207-
hiveScript1=${nameNode}/user/${user.name}/countrowshive1.hql
208-
hiveScript2=${nameNode}/user/${user.name}/countrowshive2.hql
209-
oozie.use.system.libpath=true
210-
user.name=[domainuser]
211-
jdbcPrincipal=hive/hn0-<ClusterShortName>.<Domain>.com@<Domain>.COM
212-
jdbcURL=[jdbcurlvalue]
213-
hiveOutputDirectory1=${nameNode}/user/${user.name}/hiveresult1
214-
hiveOutputDirectory2=${nameNode}/user/${user.name}/hiveresult2
215-
```
212+
```bash
213+
nameNode=adl://home
214+
jobTracker=headnodehost:8050
215+
queueName=default
216+
examplesRoot=examples
217+
oozie.wf.application.path=${nameNode}/user/[domainuser]/examples/apps/map-reduce/workflow.xml
218+
hiveScript1=${nameNode}/user/${user.name}/countrowshive1.hql
219+
hiveScript2=${nameNode}/user/${user.name}/countrowshive2.hql
220+
oozie.use.system.libpath=true
221+
user.name=[domainuser]
222+
jdbcPrincipal=hive/hn0-<ClusterShortName>.<Domain>.com@<Domain>.COM
223+
jdbcURL=[jdbcurlvalue]
224+
hiveOutputDirectory1=${nameNode}/user/${user.name}/hiveresult1
225+
hiveOutputDirectory2=${nameNode}/user/${user.name}/hiveresult2
226+
```
216227

217-
* Use the `adl://home` URI for the `nameNode` property if you have Azure Data Lake Storage Gen1 as your primary cluster storage. If you are using Azure Blob Storage, then change this to `wasb://home`. If you are using Azure Data Lake Storage Gen2, then change this to `abfs://home`.
218-
* Replace `domainuser` with your username for the domain.
219-
* Replace `ClusterShortName` with the short name for the cluster. For example, if the cluster name is https:// *[example link]* sechadoopcontoso.azurehdisnight.net, the `clustershortname` is the first six characters of the cluster: **sechad**.
220-
* Replace `jdbcurlvalue` with the JDBC URL from the Hive configuration. An example is jdbc:hive2://headnodehost:10001/;transportMode=http.
221-
* To save the file, select Ctrl+X, enter `Y`, and then select **Enter**.
228+
- Use the `adl://home` URI for the `nameNode` property if you have Azure Data Lake Storage Gen1 as your primary cluster storage. If you're using Azure Blob Storage, then change this to `wasb://home`. If you're using Azure Data Lake Storage Gen2, then change this to `abfs://home`.
229+
- Replace `domainuser` with your username for the domain.
230+
- Replace `ClusterShortName` with the short name for the cluster. For example, if the cluster name is https:// *[example link]* sechadoopcontoso.azurehdisnight.net, the `clustershortname` is the first six characters of the cluster: **sechad**.
231+
- Replace `jdbcurlvalue` with the JDBC URL from the Hive configuration. An example is jdbc:hive2://headnodehost:10001/;transportMode=http.
232+
- To save the file, select Ctrl+X, enter `Y`, and then select **Enter**.
222233

223234
This properties file needs to be present locally when running Oozie jobs.
224235

@@ -228,38 +239,44 @@ You can create the two Hive scripts for Hive server 1 and Hive server 2 as shown
228239

229240
### Hive server 1 file
230241

231-
1. Create and edit a file for Hive server 1 actions:
242+
1. Create and edit a file for Hive server 1 actions:
243+
232244
```bash
233245
nano countrowshive1.hql
234246
```
235247

236-
2. Create the script:
248+
2. Create the script:
249+
237250
```sql
238-
INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}'
239-
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
251+
INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}'
252+
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
240253
select devicemake from hivesampletable limit 2;
241254
```
242255
243-
3. Save the file to Apache Hadoop Distributed File System (HDFS):
256+
3. Save the file to Apache Hadoop Distributed File System (HDFS):
257+
244258
```bash
245259
hdfs dfs -put countrowshive1.hql countrowshive1.hql
246260
```
247261
248262
### Hive server 2 file
249263
250-
1. Create and edit a field for Hive server 2 actions:
264+
1. Create and edit a field for Hive server 2 actions:
265+
251266
```bash
252267
nano countrowshive2.hql
253268
```
254269
255-
2. Create the script:
270+
2. Create the script:
271+
256272
```sql
257273
INSERT OVERWRITE DIRECTORY '${hiveOutputDirectory1}'
258274
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
259275
select devicemodel from hivesampletable limit 2;
260276
```
261277
262-
3. Save the file to HDFS:
278+
3. Save the file to HDFS:
279+
263280
```bash
264281
hdfs dfs -put countrowshive2.hql countrowshive2.hql
265282
```
@@ -271,39 +288,38 @@ Submitting Oozie jobs for ESP clusters is like submitting Oozie jobs in non-ESP
271288
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on Linux-based Azure HDInsight](../hdinsight-use-oozie-linux-mac.md).
272289
273290
## Results from an Oozie job submission
274-
Oozie jobs are run for the user. So both Apache Hadoop YARN and Apache Ranger audit logs show the jobs being run as the impersonated user. The command-line interface output of an Oozie job looks like the following code:
275-
276291
292+
Oozie jobs are run for the user. So both Apache Hadoop YARN and Apache Ranger audit logs show the jobs being run as the impersonated user. The command-line interface output of an Oozie job looks like the following code:
277293
278-
```bash
279-
Job ID : 0000015-180626011240801-oozie-oozi-W
280-
------------------------------------------------------------------------------------------------
281-
Workflow Name : map-reduce-wf
282-
App Path : adl://home/user/alicetest/examples/apps/map-reduce/wf.xml
283-
Status : SUCCEEDED
284-
Run : 0
285-
User : alicetest
286-
Group : -
287-
Created : 2018-06-26 19:25 GMT
288-
Started : 2018-06-26 19:25 GMT
289-
Last Modified : 2018-06-26 19:30 GMT
290-
Ended : 2018-06-26 19:30 GMT
291-
CoordAction ID: -
292-
293-
Actions
294-
------------------------------------------------------------------------------------------------
295-
ID Status Ext ID ExtStatus ErrCode
296-
------------------------------------------------------------------------------------------------
297-
0000015-180626011240801-oozie-oozi-W@:start: OK - OK -
298-
------------------------------------------------------------------------------------------------
299-
0000015-180626011240801-oozie-oozi-W@mr-test OK job_1529975666160_0051 SUCCEEDED -
300-
------------------------------------------------------------------------------------------------
301-
0000015-180626011240801-oozie-oozi-W@myHive2 OK job_1529975666160_0053 SUCCEEDED -
302-
------------------------------------------------------------------------------------------------
303-
0000015-180626011240801-oozie-oozi-W@myHive OK job_1529975666160_0055 SUCCEEDED -
304-
------------------------------------------------------------------------------------------------
305-
0000015-180626011240801-oozie-oozi-W@end OK - OK -
306-
-----------------------------------------------------------------------------------------------
294+
```output
295+
Job ID : 0000015-180626011240801-oozie-oozi-W
296+
------------------------------------------------------------------------------------------------
297+
Workflow Name : map-reduce-wf
298+
App Path : adl://home/user/alicetest/examples/apps/map-reduce/wf.xml
299+
Status : SUCCEEDED
300+
Run : 0
301+
User : alicetest
302+
Group : -
303+
Created : 2018-06-26 19:25 GMT
304+
Started : 2018-06-26 19:25 GMT
305+
Last Modified : 2018-06-26 19:30 GMT
306+
Ended : 2018-06-26 19:30 GMT
307+
CoordAction ID: -
308+
309+
Actions
310+
------------------------------------------------------------------------------------------------
311+
ID Status Ext ID ExtStatus ErrCode
312+
------------------------------------------------------------------------------------------------
313+
0000015-180626011240801-oozie-oozi-W@:start: OK - OK -
314+
------------------------------------------------------------------------------------------------
315+
0000015-180626011240801-oozie-oozi-W@mr-test OK job_1529975666160_0051 SUCCEEDED -
316+
------------------------------------------------------------------------------------------------
317+
0000015-180626011240801-oozie-oozi-W@myHive2 OK job_1529975666160_0053 SUCCEEDED -
318+
------------------------------------------------------------------------------------------------
319+
0000015-180626011240801-oozie-oozi-W@myHive OK job_1529975666160_0055 SUCCEEDED -
320+
------------------------------------------------------------------------------------------------
321+
0000015-180626011240801-oozie-oozi-W@end OK - OK -
322+
-----------------------------------------------------------------------------------------------
307323
```
308324
309325
The Ranger audit logs for Hive server 2 actions show Oozie running the action for the user. The Ranger and YARN views are visible only to the cluster admin.
@@ -325,5 +341,6 @@ The Oozie web UI provides a web-based view into the status of Oozie jobs on the
325341
2. Follow the [Oozie web UI](../hdinsight-use-oozie-linux-mac.md) steps to enable SSH tunneling to the edge node and access the web UI.
326342
327343
## Next steps
328-
* [Use Apache Oozie with Apache Hadoop to define and run a workflow on Linux-based Azure HDInsight](../hdinsight-use-oozie-linux-mac.md).
329-
* [Connect to HDInsight (Apache Hadoop) using SSH](../hdinsight-hadoop-linux-use-ssh-unix.md#domainjoined).
344+
345+
- [Use Apache Oozie with Apache Hadoop to define and run a workflow on Linux-based Azure HDInsight](../hdinsight-use-oozie-linux-mac.md).
346+
- [Connect to HDInsight (Apache Hadoop) using SSH](../hdinsight-hadoop-linux-use-ssh-unix.md#domainjoined).

0 commit comments

Comments
 (0)