You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-jupyter-spark-sql-use-portal.md
+24-32Lines changed: 24 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,47 +6,47 @@ ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: quickstart
9
-
ms.date: 09/27/2019
10
9
ms.custom: mvc
10
+
ms.date: 02/25/2020
11
11
#Customer intent: As a developer new to Apache Spark on Azure, I need to see how to create a Spark cluster and query some data.
12
12
---
13
13
14
14
# Quickstart: Create Apache Spark cluster in Azure HDInsight using Azure portal
15
15
16
16
In this quickstart, you use the Azure portal to create an Apache Spark cluster in Azure HDInsight. You then create a Jupyter notebook, and use it to run Spark SQL queries against Apache Hive tables. Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. The Apache Spark framework for HDInsight enables fast data analytics and cluster computing using in-memory processing. Jupyter notebook lets you interact with your data, combine code with markdown text, and do simple visualizations.
For in-depth explanations of available configurations, see [Set up clusters in HDInsight](../hdinsight-hadoop-provision-linux-clusters.md). For more information regarding the use of the portal to create clusters, see [Create clusters in the portal](../hdinsight-hadoop-create-linux-clusters-portal.md).
19
+
20
+
> [!IMPORTANT]
21
+
> Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see the [Clean up resources](#clean-up-resources) section of this article.
19
22
20
23
## Prerequisites
21
24
22
-
-An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?ref=microsoft.com&utm_source=microsoft.com&utm_medium=docs&utm_campaign=visualstudio).
25
+
An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?ref=microsoft.com&utm_source=microsoft.com&utm_medium=docs&utm_campaign=visualstudio).
23
26
24
27
## Create an Apache Spark cluster in HDInsight
25
28
26
29
You use the Azure portal to create an HDInsight cluster that uses Azure Storage Blobs as the cluster storage. For more information on using Data Lake Storage Gen2, see [Quickstart: Set up clusters in HDInsight](../../storage/data-lake-storage/quickstart-create-connect-hdi-cluster.md).
27
30
28
-
> [!IMPORTANT]
29
-
> Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see the [Clean up resources](#clean-up-resources) section of this article.
30
-
31
-
1. In the Azure portal, select **Create a resource**.
31
+
1. Sign in to the [Azure portal](https://portal.azure.com/).
32
32
33
-

33
+
1. From the top menu, select **+ Create a resource**.
34
34
35
-
1. On the **New** page, select **Analytics** > **HDInsight**.
35
+

36
36
37
-

37
+
1. Select **Analytics** > **Azure HDInsight** to go to the **Create HDInsight cluster** page.
38
38
39
-
1.Under **Basics**, provide the following values:
39
+
1.From the **Basics** tab, provide the following information:
40
40
41
41
|Property |Description |
42
42
|---------|---------|
43
-
|Subscription | From the drop-down, select an Azure subscription used for this cluster. The subscription used for this quickstart is **Azure**. |
44
-
|Resource group |Specify whether you want to create a new resource group or use an existing one. A resource group is a container that holds related resources for an Azure solution. The resource group name used for this quickstart is **myResourceGroup**.|
45
-
|Cluster name |Give a name to your HDInsight cluster. The cluster name used for this quickstart is **myspark2019**.|
46
-
|Location|Select a location for the resource group. The template uses this location for creating the cluster as well as for the default cluster storage. The location used for this quickstart is **East US**. |
47
-
|Cluster type| Select **Spark** as the cluster type.|
43
+
|Subscription | From the drop-down list, select the Azure subscription that's used for the cluster. |
44
+
|Resource group |From the drop-down list, select your existing resource group, or select **Create new**.|
45
+
|Cluster name |Enter a globally unique name.|
46
+
|Region|From the drop-down list, select a region where the cluster is created. |
47
+
|Cluster type| Select Select cluster type to open a list. From the list, select **Spark**.|
48
48
|Cluster version|This field will auto-populate with the default version once the cluster type has been selected.|
49
-
|Cluster login username| Enter the cluster login username. The default name is *admin*. You use this account to login in to the Jupyter notebook later in the quickstart. |
49
+
|Cluster login username| Enter the cluster login username. The default name is **admin**. You use this account to login in to the Jupyter notebook later in the quickstart. |
50
50
|Cluster login password| Enter the cluster login password. |
51
51
|Secure Shell (SSH) username| Enter the SSH username. The SSH username used for this quickstart is **sshuser**. By default, this account shares the same password as the *Cluster Login username* account. |
52
52
@@ -69,25 +69,17 @@ You use the Azure portal to create an HDInsight cluster that uses Azure Storage
69
69
70
70
1. Under **Review + create**, select **Create**. It takes about 20 minutes to create the cluster. The cluster must be created before you can proceed to the next session.
71
71
72
-
If you run into an issue with creating HDInsight clusters, it could be that you don't have the right permissions to do so. For more information, see [Access control requirements](../hdinsight-hadoop-create-linux-clusters-portal.md).
72
+
If you run into an issue with creating HDInsight clusters, it could be that you don't have the right permissions to do so. For more information, see [Access control requirements](../hdinsight-hadoop-customize-cluster-linux.md#access-control).
73
73
74
74
## Create a Jupyter notebook
75
75
76
76
Jupyter Notebook is an interactive notebook environment that supports various programming languages. The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.
77
77
78
-
1. Open the [Azure portal](https://portal.azure.com).
79
-
80
-
1. Select **HDInsight clusters**, and then select the cluster you created.
81
-
82
-

83
-
84
-
1. From the portal, select **Cluster dashboards**, and then select **Jupyter Notebook**. If prompted, enter the cluster login credentials for the cluster.
85
-
86
-

78
+
1. From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/jupyter`, where `CLUSTERNAME` is the name of your cluster. If prompted, enter the cluster login credentials for the cluster.
87
79
88
80
1. Select **New** > **PySpark** to create a notebook.
89
81
90
-

82
+

91
83
92
84
A new notebook is created and opened with the name Untitled(Untitled.pynb).
93
85
@@ -110,7 +102,7 @@ SQL (Structured Query Language) is the most common and widely used language for
110
102
111
103
When you use a Jupyter Notebook with your HDInsight cluster, you get a preset `sqlContext` that you can use to run Hive queries using Spark SQL. `%%sql` tells Jupyter Notebook to use the preset `sqlContext` to run the Hive query. The query retrieves the top 10 rows from a Hive table (**hivesampletable**) that comes with all HDInsight clusters by default. It takes about 30 seconds to get the results. The output looks like:
112
104
113
-

105
+

114
106
115
107
Every time you run a query in Jupyter, your web browser window title shows a **(Busy)** status along with the notebook title. You also see a solid circle next to the **PySpark** text in the top-right corner.
116
108
@@ -123,17 +115,17 @@ SQL (Structured Query Language) is the most common and widely used language for
123
115
124
116
The screen shall refresh to show the query output.
125
117
126
-

118
+

127
119
128
120
1. From the **File** menu on the notebook, select **Close and Halt**. Shutting down the notebook releases the cluster resources.
129
121
130
122
## Clean up resources
131
123
132
-
HDInsight saves your data in Azure Storage or Azure Data Lake Storage, so you can safely delete a cluster when it is not in use. You are also charged for an HDInsight cluster, even when it is not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use. If you plan to work on the tutorial listed in [Next steps](#next-steps) immediately, you might want to keep the cluster.
124
+
HDInsight saves your data in Azure Storage or Azure Data Lake Storage, so you can safely delete a cluster when it isn't in use. You're also charged for an HDInsight cluster, even when it isn't in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use. If you plan to work on the tutorial listed in [Next steps](#next-steps) immediately, you might want to keep the cluster.
133
125
134
126
Switch back to the Azure portal, and select **Delete**.
135
127
136
-

128
+

137
129
138
130
You can also select the resource group name to open the resource group page, and then select **Delete resource group**. By deleting the resource group, you delete both the HDInsight cluster, and the default storage account.
0 commit comments