You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-giraph-install-linux.md
+44-46Lines changed: 44 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,14 +6,14 @@ ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
-
ms.date: 04/22/2019
9
+
ms.date: 12/26/2019
10
10
---
11
11
12
12
# Install Apache Giraph on HDInsight Hadoop clusters, and use Giraph to process large-scale graphs
13
13
14
14
Learn how to install Apache Giraph on an HDInsight cluster. The script action feature of HDInsight allows you to customize your cluster by running a bash script. Scripts can be used to customize clusters during and after cluster creation.
15
15
16
-
## <aname="whatis"></a>What is Giraph
16
+
## What is Giraph
17
17
18
18
[Apache Giraph](https://giraph.apache.org/) allows you to perform graph processing by using Hadoop, and can be used with Azure HDInsight. Graphs model relationships between objects. For example, the connections between routers on a large network like the Internet, or relationships between people on social networks. Graph processing allows you to reason about the relationships between objects in a graph, such as:
19
19
@@ -28,20 +28,17 @@ Learn how to install Apache Giraph on an HDInsight cluster. The script action fe
28
28
>
29
29
> Custom components, such as Giraph, receive commercially reasonable support to help you to further troubleshoot the issue. Microsoft Support may be able to resolving the issue. If not, you must consult open source communities where deep expertise for that technology is found. For example, there are many community sites that can be used, like: [MSDN forum for HDInsight](https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=hdinsight), [https://stackoverflow.com](https://stackoverflow.com). Also Apache projects have project sites on [https://apache.org](https://apache.org), for example: [Hadoop](https://hadoop.apache.org/).
30
30
31
-
32
31
## What the script does
33
32
34
33
This script performs the following actions:
35
34
36
-
* Installs Giraph to `/usr/hdp/current/giraph`
37
-
38
-
* Copies the `giraph-examples.jar` file to default storage (WASB) for your cluster: `/example/jars/giraph-examples.jar`
35
+
* Installs Giraph to `/usr/hdp/current/giraph`.
39
36
40
-
## <aname="install"></a>Install Giraph using Script Actions
37
+
* Copies the `giraph-examples.jar` file to default storage (WASB) for your cluster: `/example/jars/giraph-examples.jar`.
41
38
42
-
A sample script to install Giraph on an HDInsight cluster is available at the following location:
A sample script to install Giraph on an HDInsight cluster is available at `https://hdiconfigactions.blob.core.windows.net/linuxgiraphconfigactionv01/giraph-installer-v01.sh`
45
42
46
43
This section provides instructions on how to use the sample script while creating the cluster by using the Azure portal.
47
44
@@ -54,38 +51,34 @@ This section provides instructions on how to use the sample script while creatin
54
51
>
55
52
> You can also apply script actions to already running clusters. For more information, see [Customize HDInsight clusters with Script Actions](hdinsight-hadoop-customize-cluster-linux.md).
56
53
57
-
1. Start creating a cluster by using the steps in [Create Linux-based HDInsight clusters](hdinsight-hadoop-create-linux-clusters-portal.md), but do not complete creation.
58
-
59
-
2. In the **Optional Configuration** section, select **Script Actions**, and provide the following information:
60
-
61
-
***NAME**: Enter a friendly name for the script action.
1. Start creating a cluster by using the steps in [Create Linux-based HDInsight clusters](hdinsight-hadoop-create-linux-clusters-portal.md), but don't complete creation. You'll need to use the **classic create experience** and **Custom(size, settings, apps)**.
64
55
65
-
***HEAD**: Check this entry.
56
+
1. In the **Cluster size** section, ensure **Number of Worker nodes** is at least 2, for this example.
66
57
67
-
***WORKER**: Leave this entry unchecked.
58
+
1. In the **Script actions** section, provide the following information:
For more information, see [Use a script action during cluster creation](./hdinsight-hadoop-customize-cluster-linux.md#use-a-script-action-during-cluster-creation).
72
69
73
-
3. At the bottom of the **Script Actions**, use the **Select** button to save the configuration. Finally, use the **Select** button at the bottom of the **Optional Configuration** section to save the optional configuration information.
70
+
1. Continue creating the cluster as described in [Create Linux-based HDInsight clusters](hdinsight-hadoop-create-linux-clusters-portal.md).
74
71
75
-
4. Continue creating the cluster as described in [Create Linux-based HDInsight clusters](hdinsight-hadoop-create-linux-clusters-portal.md).
76
-
77
-
## <aname="usegiraph"></a>How do I use Giraph in HDInsight?
72
+
## How do I use Giraph in HDInsight?
78
73
79
74
Once the cluster has been created, use the following steps to run the SimpleShortestPathsComputation example included with Giraph. This example uses the basic [Pregel](https://people.apache.org/~edwardyoon/documents/pregel.pdf) implementation for finding the shortest path between objects in a graph.
80
75
81
-
1.Connect to the HDInsight cluster using SSH:
76
+
1.Use [ssh command](./hdinsight-hadoop-linux-use-ssh-unix.md)to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
82
77
83
-
```bash
84
-
ssh USERNAME@CLUSTERNAME-ssh.azurehdinsight.net
78
+
```cmd
79
+
ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
85
80
```
86
81
87
-
For information, see [Use SSH with HDInsight](hdinsight-hadoop-linux-use-ssh-unix.md).
88
-
89
82
2. Use the following command to create a file named **tiny_graph.txt**:
90
83
91
84
```bash
@@ -122,42 +115,47 @@ Once the cluster has been created, use the following steps to run the SimpleShor
> The value passed to `-w` must be less than or equal to the actual number of worker nodes.
120
+
125
121
The parameters used with this command are described in the following table:
126
122
127
123
| Parameter | What it does |
128
124
| --- | --- |
129
-
|`jar`|The jar file containing the examples. |
130
-
|`org.apache.giraph.GiraphRunner`|The class used to start the examples. |
131
-
|`org.apache.giraph.examples.SimpleShortestPathsCoputation`|The example that is used. In this example, it computes the shortest path between ID 1 and all other IDs in the graph. |
132
-
|`-ca mapred.job.tracker`|The headnode for the cluster. |
133
-
|`-vif`|The input format to use for the input data. |
134
-
|`-vip`|The input data file. |
135
-
|`-vof`|The output format. In this example, ID and value as plain text. |
136
-
|`-op`|The output location. |
137
-
|`-w 2`|The number of workers to use. In this example, 2. |
125
+
| jar |The jar file containing the examples. |
126
+
| org.apache.giraph.GiraphRunner |The class used to start the examples. |
127
+
| org.apache.giraph.examples.SimpleShortestPathsComputation |The example that is used. In this example, it computes the shortest path between ID 1 and all other IDs in the graph. |
128
+
| -ca mapred.job.tracker |The headnode for the cluster. |
129
+
| -vif |The input format to use for the input data. |
130
+
| -vip |The input data file. |
131
+
| -vof |The output format. In this example, ID and value as plain text. |
132
+
| -op |The output location. |
133
+
| -w 2 |The number of workers to use. In this example, 2. |
138
134
139
135
For more information on these, and other parameters used with Giraph samples, see the [Giraph quickstart](https://giraph.apache.org/quick_start.html).
140
136
141
-
6. Once the job has finished, the results are stored in the **/example/out/shortestpaths** directory. The output file names begin with **part-m-** and end with a number indicating the first, second, etc. file. Use the following command to view the output:
137
+
6. Once the job has finished, the results are stored in the **/example/output/shortestpaths** directory. The output file names begin with **part-m-** and end with a number indicating the first, second, and so on, file. Use the following command to view the output:
142
138
143
139
```bash
144
140
hdfs dfs -text /example/output/shortestpaths/*
145
141
```
146
142
147
143
The output appears similar to the following text:
148
144
149
-
0 1.0
150
-
4 5.0
151
-
2 2.0
152
-
1 0.0
153
-
3 1.0
145
+
```output
146
+
0 1.0
147
+
4 5.0
148
+
2 2.0
149
+
1 0.0
150
+
3 1.0
151
+
```
154
152
155
153
The SimpleShortestPathComputation example is hard coded to start with object ID 1 and find the shortest path to other objects. The output is in the format of `destination_id` and `distance`. The `distance` is the value (or weight) of the edges traveled between object ID 1 and the target ID.
156
154
157
-
Visualizing this data, you can verify the results by traveling the shortest paths between ID 1 and all other objects. The shortest path between ID 1 and ID 4 is 5. This value is the total distance between <span style="color:orange">ID 1 and 3</span>, and then<span style="color:red">ID 3 and 4</span>.
155
+
Visualizing this data, you can verify the results by traveling the shortest paths between ID 1 and all other objects. The shortest path between ID 1 and ID 4 is 5. This value is the total distance between ID 1 and 3, and then ID 3 and 4.
158
156
159
157

160
158
161
159
## Next steps
162
160
163
-
*[Install and use Hue on HDInsight clusters](hdinsight-hadoop-hue-linux.md).
161
+
[Install and use Hue on HDInsight clusters](hdinsight-hadoop-hue-linux.md).
0 commit comments