|
| 1 | +--- |
| 2 | +title: 'Quickstart: Create Apache Spark cluster using Bicep - Azure HDInsight' |
| 3 | +description: This quickstart shows how to use Bicep to create an Apache Spark cluster in Azure HDInsight, and run a Spark SQL query. |
| 4 | +author: schaffererin |
| 5 | +ms.author: v-eschaffer |
| 6 | +ms.date: 05/02/2022 |
| 7 | +ms.topic: quickstart |
| 8 | +ms.service: hdinsight |
| 9 | +ms.custom: subject-armqs, mode-arm |
| 10 | +#Customer intent: As a developer new to Apache Spark on Azure, I need to see how to create a Spark cluster and query some data. |
| 11 | +--- |
| 12 | + |
| 13 | +# Quickstart: Create Apache Spark cluster in Azure HDInsight using Bicep |
| 14 | + |
| 15 | +In this quickstart, you use Bicep to create an [Apache Spark](./apache-spark-overview.md) cluster in Azure HDInsight. You then create a Jupyter Notebook file, and use it to run Spark SQL queries against Apache Hive tables. Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. The Apache Spark framework for HDInsight enables fast data analytics and cluster computing using in-memory processing. Jupyter Notebook lets you interact with your data, combine code with markdown text, and do simple visualizations. |
| 16 | + |
| 17 | +If you're using multiple clusters together, you'll want to create a virtual network, and if you're using a Spark cluster you'll also want to use the Hive Warehouse Connector. For more information, see [Plan a virtual network for Azure HDInsight](../hdinsight-plan-virtual-network-deployment.md) and [Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector](../interactive-query/apache-hive-warehouse-connector.md). |
| 18 | + |
| 19 | +[!INCLUDE [About Bicep](../../../includes/resource-manager-quickstart-bicep-introduction.md)] |
| 20 | + |
| 21 | +## Prerequisites |
| 22 | + |
| 23 | +If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin. |
| 24 | + |
| 25 | +## Review the Bicep file |
| 26 | + |
| 27 | +The Bicep file used in this quickstart is from [Azure Quickstart Templates](https://azure.microsoft.com/resources/templates/hdinsight-spark-linux/). |
| 28 | + |
| 29 | +:::code language="bicep" source="~/quickstart-templates/quickstarts/microsoft.hdinsight/hdinsight-spark-linux/main.bicep"::: |
| 30 | + |
| 31 | +Two Azure resources are defined in the Bicep file: |
| 32 | + |
| 33 | +* [Microsoft.Storage/storageAccounts](/azure/templates/microsoft.storage/storageaccounts): create an Azure Storage Account. |
| 34 | +* [Microsoft.HDInsight/cluster](/azure/templates/microsoft.hdinsight/clusters): create an HDInsight cluster. |
| 35 | + |
| 36 | +## Deploy the Bicep file |
| 37 | + |
| 38 | +1. Save the Bicep file as **main.bicep** to your local computer. |
| 39 | +1. Deploy the Bicep file using either Azure CLI or Azure PowerShell. |
| 40 | + |
| 41 | + # [CLI](#tab/CLI) |
| 42 | + |
| 43 | + ```azurecli |
| 44 | + az group create --name exampleRG --location eastus |
| 45 | + az deployment group create --resource-group exampleRG --template-file main.bicep --parameters clusterName=<cluster-name> clusterLoginUserName=<cluster-username> sshUserName=<ssh-username> |
| 46 | + ``` |
| 47 | +
|
| 48 | + # [PowerShell](#tab/PowerShell) |
| 49 | +
|
| 50 | + ```azurepowershell |
| 51 | + New-AzResourceGroup -Name exampleRG -Location eastus |
| 52 | + New-AzResourceGroupDeployment -ResourceGroupName exampleRG -TemplateFile ./main.bicep -clusterName "<cluster-name>" -clusterLoginUserName "<cluster-username>" -sshUserName "<ssh-username>" |
| 53 | + ``` |
| 54 | +
|
| 55 | + --- |
| 56 | +
|
| 57 | + You need to provide values for the parameters: |
| 58 | +
|
| 59 | + * Replace **\<cluster-name\>** with the name of the HDInsight cluster to create. |
| 60 | + * Replace **\<cluster-username\>** with the credentials used to submit jobs to the cluster and to log in to cluster dashboards. The username has a minimum length of two characters and a maximum length of 20 characters. It must consist of digits, upper or lowercase letters, and/or the following special characters: (!#$%&\'()-^_`{}~).'). |
| 61 | + * Replace **\<ssh-username\>** with the credentials used to remotely access the cluster. The username has a minimum length of two characters. It must consist of digits, upper or lowercase letters, and/or the following special characters: (%&\'^_`{}~). It cannot be the same as the cluster username. |
| 62 | +
|
| 63 | + You'll be prompted to enter the following: |
| 64 | +
|
| 65 | + * **clusterLoginPassword**, which must be at least 10 characters long and must contain at least one digit, one uppercase letter, one lowercase letter, and one non-alphanumeric character except single-quote, double-quote, backslash, right-bracket, full-stop. It also must not contain three consecutive characters from the cluster username or SSH username. |
| 66 | + * **sshPassword**, which must be 6-72 characters long and must contain at least one digit, one uppercase letter, and one lowercase letter. It must not contain any three consecutive characters from the cluster login name. |
| 67 | +
|
| 68 | + > [!NOTE] |
| 69 | + > When the deployment finishes, you should see a message indicating the deployment succeeded. |
| 70 | +
|
| 71 | +If you run into an issue with creating HDInsight clusters, it could be that you don't have the right permissions to do so. For more information, see [Access control requirements](../hdinsight-hadoop-customize-cluster-linux.md#access-control). |
| 72 | +
|
| 73 | +## Review deployed resources |
| 74 | +
|
| 75 | +Use the Azure portal, Azure CLI, or Azure PowerShell to list the deployed resources in the resource group. |
| 76 | +
|
| 77 | +# [CLI](#tab/CLI) |
| 78 | +
|
| 79 | +```azurecli-interactive |
| 80 | +az resource list --resource-group exampleRG |
| 81 | +``` |
| 82 | + |
| 83 | +# [PowerShell](#tab/PowerShell) |
| 84 | + |
| 85 | +```azurepowershell-interactive |
| 86 | +Get-AzResource -ResourceGroupName exampleRG |
| 87 | +``` |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +## Create a Jupyter Notebook file |
| 92 | + |
| 93 | +[Jupyter Notebook](https://jupyter.org/) is an interactive notebook environment that supports various programming languages. You can use a Jupyter Notebook file to interact with your data, combine code with markdown text, and perform simple visualizations. |
| 94 | + |
| 95 | +1. Open the [Azure portal](https://portal.azure.com). |
| 96 | + |
| 97 | +2. Select **HDInsight clusters**, and then select the cluster you created. |
| 98 | + |
| 99 | + :::image type="content" source="./media/apache-spark-jupyter-spark-sql/azure-portal-open-hdinsight-cluster.png" alt-text="Open HDInsight cluster in the Azure portal." border="true"::: |
| 100 | + |
| 101 | +3. From the portal, in **Cluster dashboards** section, select **Jupyter Notebook**. If prompted, enter the cluster login credentials for the cluster. |
| 102 | + |
| 103 | + :::image type="content" source="./media/apache-spark-jupyter-spark-sql/hdinsight-spark-open-jupyter-interactive-spark-sql-query.png " alt-text="Open Jupyter Notebook to run interactive Spark SQL query." border="true"::: |
| 104 | + |
| 105 | +4. Select **New** > **PySpark** to create a notebook. |
| 106 | + |
| 107 | + :::image type="content" source="./media/apache-spark-jupyter-spark-sql/hdinsight-spark-create-jupyter-interactive-spark-sql-query.png " alt-text="Create a Jupyter Notebook file to run interactive Spark SQL query." border="true"::: |
| 108 | + |
| 109 | + A new notebook is created and opened with the name Untitled(Untitled.pynb). |
| 110 | + |
| 111 | +## Run Apache Spark SQL statements |
| 112 | + |
| 113 | +SQL (Structured Query Language) is the most common and widely used language for querying and transforming data. Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax. |
| 114 | + |
| 115 | +1. Verify the kernel is ready. The kernel is ready when you see a hollow circle next to the kernel name in the notebook. Solid circle denotes that the kernel is busy. |
| 116 | + |
| 117 | + :::image type="content" source="./media/apache-spark-jupyter-spark-sql/jupyter-spark-kernel-status.png " alt-text="Screenshot showing that the kernel is ready." border="true"::: |
| 118 | + |
| 119 | + When you start the notebook for the first time, the kernel performs some tasks in the background. Wait for the kernel to be ready. |
| 120 | + |
| 121 | +1. Paste the following code in an empty cell, and then press **SHIFT + ENTER** to run the code. The command lists the Hive tables on the cluster: |
| 122 | + |
| 123 | + ```sql |
| 124 | + %%sql |
| 125 | + SHOW TABLES |
| 126 | + ``` |
| 127 | + |
| 128 | + When you use a Jupyter Notebook file with your HDInsight cluster, you get a preset `spark` session that you can use to run Hive queries using Spark SQL. `%%sql` tells Jupyter Notebook to use the preset `spark` session to run the Hive query. The query retrieves the top 10 rows from a Hive table (**hivesampletable**) that comes with all HDInsight clusters by default. The first time you submit the query, Jupyter will create a Spark application for the notebook. It takes about 30 seconds to complete. Once the Spark application is ready, the query is executed in about a second and produces the results. The output looks like: |
| 129 | + |
| 130 | + :::image type="content" source="./media/apache-spark-jupyter-spark-sql/hdinsight-spark-get-started-hive-query.png " alt-text="Screenshot that shows an Apache Hive query in HDInsight." border="true"::: |
| 131 | + |
| 132 | + Every time you run a query in Jupyter, your web browser window title shows a **(Busy)** status along with the notebook title. You also see a solid circle next to the **PySpark** text in the top-right corner. |
| 133 | + |
| 134 | +1. Run another query to see the data in `hivesampletable`. |
| 135 | + |
| 136 | + ```sql |
| 137 | + %%sql |
| 138 | + SELECT * FROM hivesampletable LIMIT 10 |
| 139 | + ``` |
| 140 | + |
| 141 | + The screen should refresh to show the query output. |
| 142 | + |
| 143 | + :::image type="content" source="./media/apache-spark-jupyter-spark-sql/hdinsight-spark-get-started-hive-query-output.png " alt-text="Screenshot that shows Hive query output in HDInsight." border="true"::: |
| 144 | + |
| 145 | +1. From the **File** menu on the notebook, select **Close and Halt**. Shutting down the notebook releases the cluster resources, including Spark application. |
| 146 | + |
| 147 | +## Clean up resources |
| 148 | + |
| 149 | +When no longer needed, use the Azure portal, Azure CLI, or Azure PowerShell to delete the resource group and its resources. |
| 150 | + |
| 151 | +# [CLI](#tab/CLI) |
| 152 | + |
| 153 | +```azurecli-interactive |
| 154 | +az group delete --name exampleRG |
| 155 | +``` |
| 156 | + |
| 157 | +# [PowerShell](#tab/PowerShell) |
| 158 | + |
| 159 | +```azurepowershell-interactive |
| 160 | +Remove-AzResourceGroup -Name exampleRG |
| 161 | +``` |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Next steps |
| 166 | + |
| 167 | +In this quickstart, you learned how to create an Apache Spark cluster in HDInsight and run a basic Spark SQL query. Advance to the next tutorial to learn how to use an HDInsight cluster to run interactive queries on sample data. |
| 168 | + |
| 169 | +> [!div class="nextstepaction"] |
| 170 | +> [Run interactive queries on Apache Spark](./apache-spark-load-data-run-query.md) |
0 commit comments