Update Spark Package Management Documentation

Rahul Ajmera · Rahul Ajmera · commit a359f1e8a978 · 2020-01-27T11:07:36.000-08:00
diff --git a/samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb b/samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb
@@ -16,76 +16,125 @@
     "cells": [
         {
             "cell_type": "markdown",
-            "source": "# Packaging in Spark\r\n",
-            "metadata": {}
+            "source": [
+                "<p align=\"center\">\n",
+                "<img src =\"https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft_logo_gray.svg?sanitize=true\" width=\"250\" align=\"center\">\n",
+                "</p>\n",
+                "\n",
+                "# **Spark Package Management in SQL Server 2019 Big Data Clusters**\n",
+                "This guide covers installing packages and submitting jobs to a SQL Server 2019 Big Data Cluster using Spark.\n",
+                "* Built-In Tools\n",
+                "* Install Packages from a Maven Repository onto the Spark Cluster at Runtime\n",
+                "* Import .jar from HDFS for use at runtime\n",
+                "* Import .jar at runtime through Azure Data Studio notebook cell configuration\n",
+                "* Install Python Packages at Runtime for use with PySpark \n",
+                "* Submit local .jar or python file\n",
+                "<!-- <span style=\"color:red\"><font size=\"3\">Please press the \"Run Cells\" button to run the notebook</font></span> -->"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "cbc8ced8-8931-4302-b252-7e7e478a16d4"
+            }
         },
         {
             "cell_type": "markdown",
-            "source": "## Use Case 1: I can have key packages in boxed\r\n   - All pacakges that come with spark and hadoop distribution\r\n   - Python3.5 and Python 2.7\r\n   - Pandas, Sklearn and several other supporting ml packages\r\n   - R and supporting pacakges as part of MRO\r\n   - sparklyr\r\n\r\n   \r\n   ",
-            "metadata": {}
+            "source": [
+                "# Built-in Tools\n",
+                "* Spark and Hadoop base packages\n",
+                "* Python 3.5 and Python 2.7\n",
+                "* Pandas, Sklearn, Numpy, and other data processing packages.\n",
+                "* R and MRO packages\n",
+                "* Sparklyr\n",
+                ""
+            ],
+            "metadata": {
+                "azdata_cell_guid": "2fc8a069-115e-4d9b-bedc-5c55f79466b1"
+            }
         },
         {
             "cell_type": "markdown",
-            "source": "## Use Case 2: I can install pacakges from maven repo to my spark cluster\r\nMaven central is a source of lot of packages. A lot of spark ecosystem pacakges are availble there. These pacakages can be installed to your spark cluster using notebook cell configuration at the start of your spark session.\r\n",
-            "metadata": {}
-        },
-        {
-            "cell_type": "code",
-            "source": "%%configure -f\n{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}",
-            "metadata": {
-                "language": "scala"
-            },
-            "outputs": [
-                {
-                    "output_type": "display_data",
-                    "data": {
-                        "text/plain": "<IPython.core.display.HTML object>",
-                        "text/html": "Current session configs: <tt>{'conf': {'spark.jars.packages': 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.50'}, 'kind': 'spark'}</tt><br>"
-                    },
-                    "metadata": {}
-                },
-                {
-                    "output_type": "display_data",
-                    "data": {
-                        "text/plain": "<IPython.core.display.HTML object>",
-                        "text/html": "No active sessions."
-                    },
-                    "metadata": {}
-                }
+            "source": [
+                "# Install Packages from a Maven Repository onto the Spark Cluster at Runtime\r\n",
+                "Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:\r\n",
+                "\r\n",
+                "```\r\n",
+                "%%configure -f` \\\r\n",
+                "{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}\r\n",
+                "```\r\n",
+                ""
             ],
-            "execution_count": 3
+            "metadata": {
+                "azdata_cell_guid": "a0fecc05-f094-4dda-9afe-0de8ddad87eb"
+            }
         },
         {
-            "cell_type": "code",
-            "source": "import com.microsoft.azure.eventhubs._",
-            "metadata": {},
-            "outputs": [
-                {
-                    "output_type": "stream",
-                    "name": "stdout",
-                    "text": "import com.microsoft.azure.eventhubs._\n"
-                }
+            "cell_type": "markdown",
+            "source": [
+                "# Import .jar from HDFS for use at runtime\n",
+                "\n",
+                "Import jar at runtime through Azure Data Studio notebook cell configuration.\n",
+                "\n",
+                "```\n",
+                "%%configure -f\n",
+                "{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n",
+                "```\n",
+                ""
             ],
-            "execution_count": 5
+            "metadata": {
+                "azdata_cell_guid": "c5e65fa2-faf0-4e22-aac1-69d7ff8c9989"
+            }
         },
         {
             "cell_type": "markdown",
-            "source": "## Use Case 3: I have a local jar that i want to run in the spark cluster\r\nAs a user you may build your own customer pacakges that want to run as part of your spark jobs. These pacakges can be uploaded as HDFS and using a notebook configuration spark can consume these pacakges in a jar.\r\n\r\n\r\n",
-            "metadata": {}
+            "source": [
+                "# Import .jar at runtime through Azure Data Studio notebook cell configuration\n",
+                "\n",
+                "```\n",
+                "%%configure -f\n",
+                "{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n",
+                "```\n",
+                ""
+            ],
+            "metadata": {
+                "azdata_cell_guid": "6fc4085f-e142-4355-b215-148dbf6c5b86"
+            }
         },
         {
-            "cell_type": "code",
-            "source": "%%configure -f\r\n    {\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}",
-            "metadata": {},
-            "outputs": [],
-            "execution_count": 0
+            "cell_type": "markdown",
+            "source": [
+                "# Install Python Packages at Runtime for use with PySpark\n",
+                "\n",
+                "The following code can be used to install packages on each executor node at runtime. \\\n",
+                "**Note**: This installation is temporary, and must be performed each time a new Spark session is invoked.\n",
+                "\n",
+                "``` Python\n",
+                "import subprocess\n",
+                "\n",
+                "# Install TensorFlow\n",
+                "stdout = subprocess.check_output(\n",
+                "    \"pip3 install tensorflow\",\n",
+                "    stderr=subprocess.STDOUT,\n",
+                "    shell=True).decode(\"utf-8\")\n",
+                "print(stdout)\n",
+                "```"
+            ],
+            "metadata": {
+                "azdata_cell_guid": "07944b55-7266-4fcd-8e9b-9fd6cb8cfef5"
+            }
         },
         {
-            "cell_type": "code",
-            "source": "import com.my.mycodeJar._",
-            "metadata": {},
-            "outputs": [],
-            "execution_count": 0
+            "cell_type": "markdown",
+            "source": [
+                "# Submit local .jar or python file\r\n",
+                "One of the key scenarios for big data clusters is the ability to submit Spark jobs for SQL Server. The Spark job submission feature allows you to submit a local Jar or Py files with references to SQL Server 2019 big data cluster. It also enables you to execute a Jar or Py files, which are already located in the HDFS file system.\r\n",
+                "\r\n",
+                "* [Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job?view=sqlallproducts-allversions)\r\n",
+                "* [Submit Spark jobs on SQL Server Big Data Clusters in IntelliJ](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-intellij-tool-plugin?view=sqlallproducts-allversions)\r\n",
+                "* [Submit Spark jobs on SQL Server big data cluster in Visual Studio Code](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-hive-tools-vscode?view=sqlallproducts-allversions)\r\n",
+                ""
+            ],
+            "metadata": {
+                "azdata_cell_guid": "7d1b55c0-1961-45f7-8449-a24a913106e4"
+            }
         }
     ]
 }