You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-jupyter-notebook-kernels.md
+15-38Lines changed: 15 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,13 @@
1
1
---
2
2
title: Kernels for Jupyter notebook on Spark clusters in Azure HDInsight
3
3
description: Learn about the PySpark, PySpark3, and Spark kernels for Jupyter notebook available with Spark clusters on Azure HDInsight.
4
-
keywords: jupyter notebook on spark,jupyter spark
5
4
author: hrasheed-msft
6
5
ms.author: hrasheed
7
6
ms.reviewer: jasonh
8
7
ms.service: hdinsight
9
8
ms.topic: conceptual
10
9
ms.custom: hdinsightactive,hdiseo17may2017
11
-
ms.date: 03/20/2020
10
+
ms.date: 04/24/2020
12
11
---
13
12
14
13
# Kernels for Jupyter notebook on Apache Spark clusters in Azure HDInsight
@@ -48,7 +47,7 @@ An Apache Spark cluster in HDInsight. For instructions, see [Create Apache Spark
48
47
49
48
Here are a few benefits of using the new kernels with Jupyter notebook on Spark HDInsight clusters.
50
49
51
-
-**Preset contexts**. With **PySpark**, **PySpark3**, or the **Spark** kernels, you don't need to set the Spark or Hive contexts explicitly before you start working with your applications. These are available by default. These contexts are:
50
+
-**Preset contexts**. With **PySpark**, **PySpark3**, or the **Spark** kernels, you don't need to set the Spark or Hive contexts explicitly before you start working with your applications. These contexts are available by default. These contexts are:
52
51
53
52
-**sc** - for Spark context
54
53
-**sqlContext** - for Hive context
@@ -68,11 +67,11 @@ Here are a few benefits of using the new kernels with Jupyter notebook on Spark
68
67
| --- | --- | --- |
69
68
| help |`%%help`|Generates a table of all the available magics with example and description |
70
69
| info |`%%info`|Outputs session information for the current Livy endpoint |
71
-
| configure |`%%configure -f`<br>`{"executorMemory": "1000M"`,<br>`"executorCores": 4`} |Configures the parameters for creating a session. The force flag (-f) is mandatory if a session has already been created, which ensures that the session is dropped and recreated. Look at [Livy's POST /sessions Request Body](https://github.com/cloudera/livy#request-body) for a list of valid parameters. Parameters must be passed in as a JSON string and must be on the next line after the magic, as shown in the example column. |
70
+
| configure |`%%configure -f`<br>`{"executorMemory": "1000M"`,<br>`"executorCores": 4`} |Configures the parameters for creating a session. The force flag (`-f`) is mandatory if a session has already been created, which ensures that the session is dropped and recreated. Look at [Livy's POST /sessions Request Body](https://github.com/cloudera/livy#request-body) for a list of valid parameters. Parameters must be passed in as a JSON string and must be on the next line after the magic, as shown in the example column. |
72
71
| sql |`%%sql -o <variable name>`<br> `SHOW TABLES`|Executes a Hive query against the sqlContext. If the `-o` parameter is passed, the result of the query is persisted in the %%local Python context as a [Pandas](https://pandas.pydata.org/) dataframe. |
73
-
| local |`%%local`<br>`a=1`|All the code in subsequent lines is executed locally. Code must be valid Python2 code even irrespective of the kernel you're using. So, even if you selected **PySpark3** or **Spark** kernels while creating the notebook, if you use the `%%local` magic in a cell, that cell must only have valid Python2 code. |
72
+
| local |`%%local`<br>`a=1`|All the code in later lines is executed locally. Code must be valid Python2 code no matter which kernel you're using. So, even if you selected **PySpark3** or **Spark** kernels while creating the notebook, if you use the `%%local` magic in a cell, that cell must only have valid Python2 code. |
74
73
| logs |`%%logs`|Outputs the logs for the current Livy session. |
75
-
| delete |`%%delete -f -s <session number>`|Deletes a specific session of the current Livy endpoint. You can't delete the session that is initiated for the kernel itself. |
74
+
| delete |`%%delete -f -s <session number>`|Deletes a specific session of the current Livy endpoint. You can't delete the session that is started for the kernel itself. |
76
75
| cleanup |`%%cleanup -f`|Deletes all the sessions for the current Livy endpoint, including this notebook's session. The force flag -f is mandatory. |
77
76
78
77
> [!NOTE]
@@ -87,8 +86,8 @@ The `%%sql` magic supports different parameters that you can use to control the
87
86
| Parameter | Example | Description |
88
87
| --- | --- | --- |
89
88
| -o |`-o <VARIABLE NAME>`|Use this parameter to persist the result of the query, in the %%local Python context, as a [Pandas](https://pandas.pydata.org/) dataframe. The name of the dataframe variable is the variable name you specify. |
90
-
| -q |`-q`|Use this to turn off visualizations for the cell. If you don't want to autovisualize the content of a cell and just want to capture it as a dataframe, then use `-q -o <VARIABLE>`. If you want to turn off visualizations without capturing the results (for example, for running a SQL query, like a `CREATE TABLE` statement), use `-q` without specifying a `-o` argument. |
91
-
| -m |`-m <METHOD>`|Where **METHOD** is either **take** or **sample** (default is **take**). If the method is **take**, the kernel picks elements from the top of the result data set specified by MAXROWS (described later in this table). If the method is **sample**, the kernel randomly samples elements of the data set according to `-r` parameter, described next in this table. |
89
+
| -q |`-q`|Use this parameter to turn off visualizations for the cell. If you don't want to autovisualize the content of a cell and just want to capture it as a dataframe, then use `-q -o <VARIABLE>`. If you want to turn off visualizations without capturing the results (for example, for running a SQL query, like a `CREATE TABLE` statement), use `-q` without specifying a `-o` argument. |
90
+
| -m |`-m <METHOD>`|Where **METHOD** is either **take** or **sample** (default is **take**). If the method is **`take`**, the kernel picks elements from the top of the result data set specified by MAXROWS (described later in this table). If the method is **sample**, the kernel randomly samples elements of the data set according to `-r` parameter, described next in this table. |
92
91
| -r |`-r <FRACTION>`|Here **FRACTION** is a floating-point number between 0.0 and 1.0. If the sample method for the SQL query is `sample`, then the kernel randomly samples the specified fraction of the elements of the result set for you. For example, if you run a SQL query with the arguments `-m sample -r 0.01`, then 1% of the result rows are randomly sampled. |
93
92
| -n |`-n <MAXROWS>`|**MAXROWS** is an integer value. The kernel limits the number of output rows to **MAXROWS**. If **MAXROWS** is a negative number such as **-1**, then the number of rows in the result set isn't limited. |
94
93
@@ -97,65 +96,43 @@ The `%%sql` magic supports different parameters that you can use to control the
97
96
%%sql -q -m sample -r 0.1 -n 500 -o query2
98
97
SELECT * FROM hivesampletable
99
98
100
-
The statement above does the following:
99
+
The statement above does the following actions:
101
100
102
101
- Selects all records from **hivesampletable**.
103
102
- Because we use -q, it turns off autovisualization.
104
-
- Because we use `-m sample -r 0.1 -n 500` it randomly samples 10% of the rows in the hivesampletable and limits the size of the result set to 500 rows.
103
+
- Because we use `-m sample -r 0.1 -n 500`, it randomly samples 10% of the rows in the hivesampletable and limits the size of the result set to 500 rows.
105
104
- Finally, because we used `-o query2` it also saves the output into a dataframe called **query2**.
106
105
107
106
## Considerations while using the new kernels
108
107
109
-
Whichever kernel you use, leaving the notebooks running consumes the cluster resources. With these kernels, because the contexts are preset, simply exiting the notebooks doesn't kill the context and hence the cluster resources continue to be in use. A good practice is to use the **Close and Halt** option from the notebook's **File** menu when you're finished using the notebook, which kills the context and then exits the notebook.
108
+
Whichever kernel you use, leaving the notebooks running consumes the cluster resources. With these kernels, because the contexts are preset, simply exiting the notebooks doesn't kill the context. And so the cluster resources continue to be in use. A good practice is to use the **Close and Halt** option from the notebook's **File** menu when you're finished using the notebook. The closure kills the context and then exits the notebook.
110
109
111
110
## Where are the notebooks stored?
112
111
113
-
If your cluster uses Azure Storage as the default storage account, Jupyter notebooks are saved to storage account under the **/HdiNotebooks** folder. Notebooks, text files, and folders that you create from within Jupyter are accessible from the storage account. For example, if you use Jupyter to create a folder **myfolder** and a notebook **myfolder/mynotebook.ipynb**, you can access that notebook at `/HdiNotebooks/myfolder/mynotebook.ipynb` within the storage account. The reverse is also true, that is, if you upload a notebook directly to your storage account at `/HdiNotebooks/mynotebook1.ipynb`, the notebook is visible from Jupyter as well. Notebooks remain in the storage account even after the cluster is deleted.
112
+
If your cluster uses Azure Storage as the default storage account, Jupyter notebooks are saved to storage account under the **/HdiNotebooks** folder. Notebooks, text files, and folders that you create from within Jupyter are accessible from the storage account. For example, if you use Jupyter to create a folder **`myfolder`** and a notebook **myfolder/mynotebook.ipynb**, you can access that notebook at `/HdiNotebooks/myfolder/mynotebook.ipynb` within the storage account. The reverse is also true, that is, if you upload a notebook directly to your storage account at `/HdiNotebooks/mynotebook1.ipynb`, the notebook is visible from Jupyter as well. Notebooks remain in the storage account even after the cluster is deleted.
114
113
115
114
> [!NOTE]
116
115
> HDInsight clusters with Azure Data Lake Storage as the default storage do not store notebooks in associated storage.
117
116
118
-
The way notebooks are saved to the storage account is compatible with [Apache Hadoop HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html). So, if you SSH into the cluster you can use file management commands as shown in the following snippet:
117
+
The way notebooks are saved to the storage account is compatible with [Apache Hadoop HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html). If you SSH into the cluster you can use the file management commands:
119
118
120
119
hdfs dfs -ls /HdiNotebooks # List everything at the root directory – everything in this directory is visible to Jupyter from the home page
121
120
hdfs dfs –copyToLocal /HdiNotebooks # Download the contents of the HdiNotebooks folder
122
121
hdfs dfs –copyFromLocal example.ipynb /HdiNotebooks # Upload a notebook example.ipynb to the root folder so it's visible from Jupyter
123
122
124
-
Irrespective of whether the cluster uses Azure Storage or Azure Data Lake Storage as the default storage account, the notebooks are also saved on the cluster headnode at `/var/lib/jupyter`.
123
+
Whether the cluster uses Azure Storage or Azure Data Lake Storage as the default storage account, the notebooks are also saved on the cluster headnode at `/var/lib/jupyter`.
125
124
126
125
## Supported browser
127
126
128
127
Jupyter notebooks on Spark HDInsight clusters are supported only on Google Chrome.
129
128
130
129
## Feedback
131
130
132
-
The new kernels are in evolving stage and will mature over time. This could also mean that APIs could change as these kernels mature. We would appreciate any feedback that you have while using these new kernels. This is useful in shaping the final release of these kernels. You can leave your comments/feedback under the **Feedback** section at the bottom of this article.
131
+
The new kernels are in evolving stage and will mature over time. So the APIs could change as these kernels mature. We would appreciate any feedback that you have while using these new kernels. The feedback is useful in shaping the final release of these kernels. You can leave your comments/feedback under the **Feedback** section at the bottom of this article.
133
132
134
-
## See also
133
+
## Next steps
135
134
136
135
-[Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
137
-
138
-
### Scenarios
139
-
140
-
-[Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools](apache-spark-use-bi-tools.md)
141
-
-[Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data](apache-spark-ipython-notebook-machine-learning.md)
142
-
-[Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results](apache-spark-machine-learning-mllib-ipython.md)
143
-
-[Website log analysis using Apache Spark in HDInsight](apache-spark-custom-library-website-log-analysis.md)
144
-
145
-
### Create and run applications
146
-
147
-
-[Create a standalone application using Scala](apache-spark-create-standalone-application.md)
148
-
-[Run jobs remotely on an Apache Spark cluster using Apache Livy](apache-spark-livy-rest-interface.md)
149
-
150
-
### Tools and extensions
151
-
152
-
-[Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications](apache-spark-intellij-tool-plugin.md)
153
-
-[Use HDInsight Tools Plugin for IntelliJ IDEA to debug Apache Spark applications remotely](apache-spark-intellij-tool-plugin-debug-jobs-remotely.md)
154
136
-[Use Apache Zeppelin notebooks with an Apache Spark cluster on HDInsight](apache-spark-zeppelin-notebook.md)
155
137
-[Use external packages with Jupyter notebooks](apache-spark-jupyter-notebook-use-external-packages.md)
156
138
-[Install Jupyter on your computer and connect to an HDInsight Spark cluster](apache-spark-jupyter-notebook-install-locally.md)
157
-
158
-
### Manage resources
159
-
160
-
-[Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
161
-
-[Track and debug jobs running on an Apache Spark cluster in HDInsight](apache-spark-job-debugging.md)
0 commit comments