Skip to content

Commit b818686

Browse files
Merge pull request #96452 from yanancai/master
Change HDInsight python managemet best practice
2 parents 385a4e1 + 599535b commit b818686

File tree

7 files changed

+66
-40
lines changed

7 files changed

+66
-40
lines changed

articles/hdinsight/TOC.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -355,7 +355,7 @@
355355
href: ./spark/apache-spark-jupyter-notebook-kernels.md
356356
- name: Use external packages with Jupyter using cell magic
357357
href: ./spark/apache-spark-jupyter-notebook-use-external-packages.md
358-
- name: Use external packages with Jupyter using script action
358+
- name: Safely manage Python environment on Azure HDInsight using Script Action
359359
href: ./spark/apache-spark-python-package-installation.md
360360
- name: Use Apache Zeppelin notebooks
361361
href: ./spark/apache-spark-zeppelin-notebook.md

articles/hdinsight/spark/apache-spark-python-package-installation.md

Lines changed: 65 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,16 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 11/05/2019
9+
ms.date: 11/19/2019
1010
---
1111

12-
# Script Action to install external Python packages for Jupyter notebooks in Apache Spark on HDInsight
12+
# Safely manage Python environment on Azure HDInsight using Script Action
1313

1414
> [!div class="op_single_selector"]
1515
> * [Using cell magic](apache-spark-jupyter-notebook-use-external-packages.md)
1616
> * [Using Script Action](apache-spark-python-package-installation.md)
1717
18-
Learn how to use Script Actions to configure an [Apache Spark](https://spark.apache.org/) cluster on HDInsight to use external, community-contributed **python** packages that aren't included out-of-the-box in the cluster.
19-
20-
> [!NOTE]
21-
> You can also configure a Jupyter notebook by using `%%configure` magic to use external packages. For instructions, see [Use external packages with Jupyter notebooks in Apache Spark clusters on HDInsight](apache-spark-jupyter-notebook-use-external-packages.md).
22-
23-
You can search the [package index](https://pypi.python.org/pypi) for the complete list of packages that are available. You can also get a list of available packages from other sources. For example, you can install packages made available through [conda-forge](https://conda-forge.org/feedstocks/).
24-
25-
In this article, you learn how to install the [TensorFlow](https://www.tensorflow.org/) package using Script Action on your cluster and use it via the Jupyter notebook as an example.
18+
HDInsight has two built-in Python installations in the Spark cluster, Anaconda Python 2.7 and Python 3.5. In some cases, customers need to customize the Python environment, like installing external Python packages or another Python version. In this article, we show the best practice of safely managing Python environments for an [Apache Spark](https://spark.apache.org/) cluster on HDInsight.
2619

2720
## Prerequisites
2821

@@ -39,61 +32,94 @@ The Microsoft Azure HDInsight service uses an ecosystem of open-source technolog
3932

4033
There are two types of open-source components that are available in the HDInsight service:
4134

42-
* **Built-in components** - These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. For example, Apache Hadoop YARN ResourceManager, the Apache Hive query language (HiveQL), and the Mahout library belong to this category. A full list of cluster components is available in [What's new in the Apache Hadoop cluster versions provided by HDInsight](https://docs.microsoft.com/azure/hdinsight/hdinsight-component-versioning).
35+
* **Built-in components** - These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. For example, Apache Hadoop YARN Resource Manager, the Apache Hive query language (HiveQL), and the Mahout library belong to this category. A full list of cluster components is available in [What's new in the Apache Hadoop cluster versions provided by HDInsight](https://docs.microsoft.com/azure/hdinsight/hdinsight-component-versioning).
4336
* **Custom components** - You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.
4437

4538
> [!IMPORTANT]
4639
> Components provided with the HDInsight cluster are fully supported. Microsoft Support helps to isolate and resolve issues related to these components.
4740
>
4841
> Custom components receive commercially reasonable support to help you to further troubleshoot the issue. Microsoft support may be able to resolve the issue OR they may ask you to engage available channels for the open source technologies where deep expertise for that technology is found. For example, there are many community sites that can be used, like: [MSDN forum for HDInsight](https://social.msdn.microsoft.com/Forums/azure/home?forum=hdinsight), [https://stackoverflow.com](https://stackoverflow.com). Also Apache projects have project sites on [https://apache.org](https://apache.org), for example: [Hadoop](https://hadoop.apache.org/).
4942
50-
## Use external packages with Jupyter notebooks
43+
## Understand default Python installation
44+
45+
HDInsight Spark cluster is created with Anaconda installation. There are two Python installations in the cluster, Anaconda Python 2.7 and Python 3.5. The table below shows the default Python settings for Spark, Livy, and Jupyter.
5146

52-
1. From the [Azure portal](https://portal.azure.com/), navigate to your cluster.
47+
| |Python 2.7|Python 3.5|
48+
|----|----|----|
49+
|Path|/usr/bin/anaconda/bin|/usr/bin/anaconda/envs/py35/bin|
50+
|Spark|Default set to 2.7|N/A|
51+
|Livy|Default set to 2.7|N/A|
52+
|Jupyter|PySpark kernel|PySpark3 kernel|
5353

54-
2. With your cluster selected, from the left pane, under **Settings**, select **Script actions**.
54+
## Safely install external Python packages
5555

56-
3. Select **+ Submit new**.
56+
HDInsight cluster depends on the built-in Python environment, both Python 2.7 and Python 3.5. Directly installing custom packages in those default built-in environments may cause unexpected library version changes, and break the cluster further. In order to safely install custom external Python packages for your Spark applications, follow below steps.
57+
58+
1. Create Python virtual environment using conda. A virtual environment provides an isolated space for your projects without breaking others. When creating the Python virtual environment, you can specify python version that you want to use. Note that you still need to create virtual environment even though you would like to use Python 2.7 and 3.5. This is to make sure the cluster’s default environment not getting broke. Run script actions on your cluster for all nodes with below script to create a Python virtual environment.
59+
60+
- `--prefix` specifies a path where a conda virtual environment lives. There are several configs that need to be changed further based on the path specified here. In this example, we use the py35new, as the cluster has an existing virtual environment called py35 already.
61+
- `python=` specifies the Python version for the virtual environment. In this example, we use version 3.5, the same version as the cluster built in one. You can also use other Python versions to create the virtual environment.
62+
- `anaconda` specifies the package_spec as anaconda to install Anaconda packages in the virtual environment.
63+
64+
```bash
65+
sudo /usr/bin/anaconda/bin/conda create --prefix /usr/bin/anaconda/envs/py35new python=3.5 anaconda --yes
66+
```
5767

58-
4. Enter the following values for the **Submit script action** window:
68+
2. Install external Python packages in the created virtual environment if needed. Run script actions on your cluster for all nodes with below script to install external Python packages. You need to have sudo privilege here in order to write files to the virtual environment folder.
5969

60-
|Parameter | Value |
61-
|---|---|
62-
|Script type | Select **- Custom** from the drop-down list.|
63-
|Name |Enter `tensorflow` in the text box.|
64-
|Bash script URI |Enter `https://hdiconfigactions.blob.core.windows.net/linuxtensorflow/tensorflowinstall.sh` in the text box. |
65-
|Node type(s) | Select the **Head**, and **Worker** check boxes. |
70+
You can search the [package index](https://pypi.python.org/pypi) for the complete list of packages that are available. You can also get a list of available packages from other sources. For example, you can install packages made available through [conda-forge](https://conda-forge.org/feedstocks/).
6671

67-
`tensorflowinstall.sh` contains the following commands:
72+
- `seaborn` is the package name that you would like to install.
73+
- `-n py35new` specify the virtual environment name that just gets created. Make sure to change the name correspondingly based on your virtual environment creation.
6874

6975
```bash
70-
#!/usr/bin/env bash
71-
/usr/bin/anaconda/bin/conda install -c conda-forge tensorflow
76+
sudo /usr/bin/anaconda/bin/conda install seaborn -n py35new --yes
7277
```
7378

74-
5. Select **Create**. Visit the documentation on [how to use custom script actions](../hdinsight-hadoop-customize-cluster-linux.md).
79+
if you don't know the virtual environment name, you can SSH to the header node of the cluster and run `/usr/bin/anaconda/bin/conda info -e` to show all virtual environments.
7580
76-
6. Wait for the script to complete. The **Script actions** pane will state **New script actions can be submitted after the current cluster operation finishes** while the script is executing. A progress bar can be viewed from the Ambari UI **Background Operations** window.
81+
3. Change Spark and Livy configs and point to the created virtual environment.
7782
78-
7. Open a PySpark Jupyter notebook. See [Create a Jupyter notebook on Spark HDInsight](./apache-spark-jupyter-notebook-kernels.md#create-a-jupyter-notebook-on-spark-hdinsight) for steps.
83+
1. Open Ambari UI, go to Spark2 page, Configs tab.
84+
85+
![Change Spark and Livy config through Ambari](./media/apache-spark-python-package-installation/ambari-spark-and-livy-config.png)
86+
87+
2. Expand Advanced livy2-env, add below statements at bottom. If you installed the virtual environment with a different prefix, change the path correspondingly.
7988
80-
![Create a new Jupyter notebook](./media/apache-spark-python-package-installation/hdinsight-spark-create-notebook.png "Create a new Jupyter notebook")
89+
```
90+
export PYSPARK_PYTHON=/usr/bin/anaconda/envs/py35new/bin/python
91+
export PYSPARK_DRIVER_PYTHON=/usr/bin/anaconda/envs/py35new/bin/python
92+
```
8193
82-
8. You will now `import tensorflow` and run a hello world example. Enter the following code:
94+
![Change Livy config through Ambari](./media/apache-spark-python-package-installation/ambari-livy-config.png)
95+
96+
3. Expand Advanced spark2-env, replace the existing export PYSPARK_PYTHON statement at bottom. If you installed the virtual environment with a different prefix, change the path correspondingly.
97+
98+
```
99+
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/anaconda/envs/py35new/bin/python}
100+
```
101+
102+
![Change Spark config through Ambari](./media/apache-spark-python-package-installation/ambari-spark-config.png)
103+
104+
4. Save the changes and restart affected services. These changes need a restart of Spark2 service. Ambari UI will prompt a required restart reminder, click Restart to restart all affected services.
105+
106+
![Change Spark config through Ambari](./media/apache-spark-python-package-installation/ambari-restart-services.png)
107+
108+
4. If you would like to use the new created virtual environment on Jupyter. You need to change Jupyter configs and restart Jupyter. Run script actions on all header nodes with below statement to point Jupyter to the new created virtual environment. Make sure to modify the path to the prefix you specified for your virtual environment. After running this script action, restart Jupyter service through Ambari UI to make this change available.
83109
84110
```
85-
import tensorflow as tf
86-
hello = tf.constant('Hello, TensorFlow!')
87-
sess = tf.Session()
88-
print(sess.run(hello))
111+
sudo sed -i '/python3_executable_path/c\ \"python3_executable_path\" : \"/usr/bin/anaconda/envs/py35new/bin/python3\"' /home/spark/.sparkmagic/config.json
89112
```
90113
91-
The result looks like this:
92-
93-
![TensorFlow code execution](./media/apache-spark-python-package-installation/tensorflow-execution.png "Execute TensorFlow code")
114+
You could double confirm the Python environment in Jupyter Notebook by running below code:
115+
116+
![Check Python version in Jupyter Notebook](./media/apache-spark-python-package-installation/check-python-version-in-jupyter.png)
117+
118+
## Known issue
119+
120+
There is a known bug for Anaconda version 4.7.11 and 4.7.12. If you see your script actions hanging at `"Collecting package metadata (repodata.json): ...working..."` and failing with `"Python script has been killed due to timeout after waiting 3600 secs"`. You can download [this script](https://gregorysfixes.blob.core.windows.net/public/fix-conda.sh) and run it as script actions on all nodes to fix the issue.
94121
95-
> [!NOTE]
96-
> There are two python installations in the cluster. Spark will use the Anaconda python installation located at `/usr/bin/anaconda/bin` and will default to the Python 2.7 environment. To use Python 3.x and install packages in the PySpark3 kernel, use the path to the `conda` executable for that environment and use the `-n` parameter to specify the environment. For example, the command `/usr/bin/anaconda/envs/py35/bin/conda install -c conda-forge ggplot -n py35`, installs the `ggplot` package to the Python 3.5 environment using the `conda-forge` channel.
122+
To check your Anaconda version, you can SSH to the cluster header node and run `/usr/bin/anaconda/bin/conda --v`.
97123
98124
## <a name="seealso"></a>See also
99125
26.6 KB
Loading
37.5 KB
Loading
34.2 KB
Loading
47.3 KB
Loading
7.48 KB
Loading

0 commit comments

Comments
 (0)