Skip to content

Commit a359f1e

Browse files
author
Rahul Ajmera
committed
Update Spark Package Management Documentation
1 parent 80599d7 commit a359f1e

File tree

1 file changed

+102
-53
lines changed

1 file changed

+102
-53
lines changed

samples/features/sql-big-data-cluster/spark/config-install/installpackage_Spark.ipynb

Lines changed: 102 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -16,76 +16,125 @@
1616
"cells": [
1717
{
1818
"cell_type": "markdown",
19-
"source": "# Packaging in Spark\r\n",
20-
"metadata": {}
19+
"source": [
20+
"<p align=\"center\">\n",
21+
"<img src =\"https://raw.githubusercontent.com/microsoft/azuredatastudio/master/src/sql/media/microsoft_logo_gray.svg?sanitize=true\" width=\"250\" align=\"center\">\n",
22+
"</p>\n",
23+
"\n",
24+
"# **Spark Package Management in SQL Server 2019 Big Data Clusters**\n",
25+
"This guide covers installing packages and submitting jobs to a SQL Server 2019 Big Data Cluster using Spark.\n",
26+
"* Built-In Tools\n",
27+
"* Install Packages from a Maven Repository onto the Spark Cluster at Runtime\n",
28+
"* Import .jar from HDFS for use at runtime\n",
29+
"* Import .jar at runtime through Azure Data Studio notebook cell configuration\n",
30+
"* Install Python Packages at Runtime for use with PySpark \n",
31+
"* Submit local .jar or python file\n",
32+
"<!-- <span style=\"color:red\"><font size=\"3\">Please press the \"Run Cells\" button to run the notebook</font></span> -->"
33+
],
34+
"metadata": {
35+
"azdata_cell_guid": "cbc8ced8-8931-4302-b252-7e7e478a16d4"
36+
}
2137
},
2238
{
2339
"cell_type": "markdown",
24-
"source": "## Use Case 1: I can have key packages in boxed\r\n - All pacakges that come with spark and hadoop distribution\r\n - Python3.5 and Python 2.7\r\n - Pandas, Sklearn and several other supporting ml packages\r\n - R and supporting pacakges as part of MRO\r\n - sparklyr\r\n\r\n \r\n ",
25-
"metadata": {}
40+
"source": [
41+
"# Built-in Tools\n",
42+
"* Spark and Hadoop base packages\n",
43+
"* Python 3.5 and Python 2.7\n",
44+
"* Pandas, Sklearn, Numpy, and other data processing packages.\n",
45+
"* R and MRO packages\n",
46+
"* Sparklyr\n",
47+
""
48+
],
49+
"metadata": {
50+
"azdata_cell_guid": "2fc8a069-115e-4d9b-bedc-5c55f79466b1"
51+
}
2652
},
2753
{
2854
"cell_type": "markdown",
29-
"source": "## Use Case 2: I can install pacakges from maven repo to my spark cluster\r\nMaven central is a source of lot of packages. A lot of spark ecosystem pacakges are availble there. These pacakages can be installed to your spark cluster using notebook cell configuration at the start of your spark session.\r\n",
30-
"metadata": {}
31-
},
32-
{
33-
"cell_type": "code",
34-
"source": "%%configure -f\n{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}",
35-
"metadata": {
36-
"language": "scala"
37-
},
38-
"outputs": [
39-
{
40-
"output_type": "display_data",
41-
"data": {
42-
"text/plain": "<IPython.core.display.HTML object>",
43-
"text/html": "Current session configs: <tt>{'conf': {'spark.jars.packages': 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.50'}, 'kind': 'spark'}</tt><br>"
44-
},
45-
"metadata": {}
46-
},
47-
{
48-
"output_type": "display_data",
49-
"data": {
50-
"text/plain": "<IPython.core.display.HTML object>",
51-
"text/html": "No active sessions."
52-
},
53-
"metadata": {}
54-
}
55+
"source": [
56+
"# Install Packages from a Maven Repository onto the Spark Cluster at Runtime\r\n",
57+
"Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:\r\n",
58+
"\r\n",
59+
"```\r\n",
60+
"%%configure -f` \\\r\n",
61+
"{\"conf\": {\"spark.jars.packages\": \"com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.1\"}}\r\n",
62+
"```\r\n",
63+
""
5564
],
56-
"execution_count": 3
65+
"metadata": {
66+
"azdata_cell_guid": "a0fecc05-f094-4dda-9afe-0de8ddad87eb"
67+
}
5768
},
5869
{
59-
"cell_type": "code",
60-
"source": "import com.microsoft.azure.eventhubs._",
61-
"metadata": {},
62-
"outputs": [
63-
{
64-
"output_type": "stream",
65-
"name": "stdout",
66-
"text": "import com.microsoft.azure.eventhubs._\n"
67-
}
70+
"cell_type": "markdown",
71+
"source": [
72+
"# Import .jar from HDFS for use at runtime\n",
73+
"\n",
74+
"Import jar at runtime through Azure Data Studio notebook cell configuration.\n",
75+
"\n",
76+
"```\n",
77+
"%%configure -f\n",
78+
"{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n",
79+
"```\n",
80+
""
6881
],
69-
"execution_count": 5
82+
"metadata": {
83+
"azdata_cell_guid": "c5e65fa2-faf0-4e22-aac1-69d7ff8c9989"
84+
}
7085
},
7186
{
7287
"cell_type": "markdown",
73-
"source": "## Use Case 3: I have a local jar that i want to run in the spark cluster\r\nAs a user you may build your own customer pacakges that want to run as part of your spark jobs. These pacakges can be uploaded as HDFS and using a notebook configuration spark can consume these pacakges in a jar.\r\n\r\n\r\n",
74-
"metadata": {}
88+
"source": [
89+
"# Import .jar at runtime through Azure Data Studio notebook cell configuration\n",
90+
"\n",
91+
"```\n",
92+
"%%configure -f\n",
93+
"{\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}\n",
94+
"```\n",
95+
""
96+
],
97+
"metadata": {
98+
"azdata_cell_guid": "6fc4085f-e142-4355-b215-148dbf6c5b86"
99+
}
75100
},
76101
{
77-
"cell_type": "code",
78-
"source": "%%configure -f\r\n {\"conf\": {\"spark.jars\": \"/jar/mycodeJar.jar\"}}",
79-
"metadata": {},
80-
"outputs": [],
81-
"execution_count": 0
102+
"cell_type": "markdown",
103+
"source": [
104+
"# Install Python Packages at Runtime for use with PySpark\n",
105+
"\n",
106+
"The following code can be used to install packages on each executor node at runtime. \\\n",
107+
"**Note**: This installation is temporary, and must be performed each time a new Spark session is invoked.\n",
108+
"\n",
109+
"``` Python\n",
110+
"import subprocess\n",
111+
"\n",
112+
"# Install TensorFlow\n",
113+
"stdout = subprocess.check_output(\n",
114+
" \"pip3 install tensorflow\",\n",
115+
" stderr=subprocess.STDOUT,\n",
116+
" shell=True).decode(\"utf-8\")\n",
117+
"print(stdout)\n",
118+
"```"
119+
],
120+
"metadata": {
121+
"azdata_cell_guid": "07944b55-7266-4fcd-8e9b-9fd6cb8cfef5"
122+
}
82123
},
83124
{
84-
"cell_type": "code",
85-
"source": "import com.my.mycodeJar._",
86-
"metadata": {},
87-
"outputs": [],
88-
"execution_count": 0
125+
"cell_type": "markdown",
126+
"source": [
127+
"# Submit local .jar or python file\r\n",
128+
"One of the key scenarios for big data clusters is the ability to submit Spark jobs for SQL Server. The Spark job submission feature allows you to submit a local Jar or Py files with references to SQL Server 2019 big data cluster. It also enables you to execute a Jar or Py files, which are already located in the HDFS file system.\r\n",
129+
"\r\n",
130+
"* [Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job?view=sqlallproducts-allversions)\r\n",
131+
"* [Submit Spark jobs on SQL Server Big Data Clusters in IntelliJ](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-submit-job-intellij-tool-plugin?view=sqlallproducts-allversions)\r\n",
132+
"* [Submit Spark jobs on SQL Server big data cluster in Visual Studio Code](https://docs.microsoft.com/en-us/sql/big-data-cluster/spark-hive-tools-vscode?view=sqlallproducts-allversions)\r\n",
133+
""
134+
],
135+
"metadata": {
136+
"azdata_cell_guid": "7d1b55c0-1961-45f7-8449-a24a913106e4"
137+
}
89138
}
90139
]
91140
}

0 commit comments

Comments
 (0)