Skip to content

Commit 65bbed2

Browse files
authored
Merge pull request #107861 from dagiro/freshness28
freshness28
2 parents 5ff63ad + 7a7a601 commit 65bbed2

File tree

7 files changed

+17
-43
lines changed

7 files changed

+17
-43
lines changed

articles/hdinsight/spark/apache-spark-python-package-installation.md

Lines changed: 17 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 11/19/2019
9+
ms.date: 03/16/2020
1010
---
1111

1212
# Safely manage Python environment on Azure HDInsight using Script Action
@@ -15,25 +15,22 @@ ms.date: 11/19/2019
1515
> * [Using cell magic](apache-spark-jupyter-notebook-use-external-packages.md)
1616
> * [Using Script Action](apache-spark-python-package-installation.md)
1717
18-
HDInsight has two built-in Python installations in the Spark cluster, Anaconda Python 2.7 and Python 3.5. In some cases, customers need to customize the Python environment, like installing external Python packages or another Python version. In this article, we show the best practice of safely managing Python environments for an [Apache Spark](https://spark.apache.org/) cluster on HDInsight.
18+
HDInsight has two built-in Python installations in the Spark cluster, Anaconda Python 2.7 and Python 3.5. In some cases, customers need to customize the Python environment, like installing external Python packages or another Python version. In this article, we show the best practice of safely managing Python environments for an [Apache Spark](./apache-spark-overview.md) cluster on HDInsight.
1919

2020
## Prerequisites
2121

22-
* An Azure subscription. See [Get Azure free trial](https://azure.microsoft.com/documentation/videos/get-azure-free-trial-for-testing-hadoop-in-hdinsight/).
23-
24-
* An Apache Spark cluster on HDInsight. For instructions, see [Create Apache Spark clusters in Azure HDInsight](apache-spark-jupyter-spark-sql.md).
25-
26-
> [!NOTE]
27-
> If you do not already have a Spark cluster on HDInsight Linux, you can run script actions during cluster creation. Visit the documentation on [how to use custom script actions](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux).
22+
An Apache Spark cluster on HDInsight. For instructions, see [Create Apache Spark clusters in Azure HDInsight](apache-spark-jupyter-spark-sql.md). If you do not already have a Spark cluster on HDInsight, you can run script actions during cluster creation. Visit the documentation on [how to use custom script actions](../hdinsight-hadoop-customize-cluster-linux.md).
2823

2924
## Support for open-source software used on HDInsight clusters
3025

3126
The Microsoft Azure HDInsight service uses an ecosystem of open-source technologies formed around Apache Hadoop. Microsoft Azure provides a general level of support for open-source technologies. For more information, see [Azure Support FAQ website](https://azure.microsoft.com/support/faq/). The HDInsight service provides an additional level of support for built-in components.
3227

3328
There are two types of open-source components that are available in the HDInsight service:
3429

35-
* **Built-in components** - These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. For example, Apache Hadoop YARN Resource Manager, the Apache Hive query language (HiveQL), and the Mahout library belong to this category. A full list of cluster components is available in [What's new in the Apache Hadoop cluster versions provided by HDInsight](https://docs.microsoft.com/azure/hdinsight/hdinsight-component-versioning).
36-
* **Custom components** - You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.
30+
|Component |Description |
31+
|---|---|
32+
|Built-in|These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. For example, Apache Hadoop YARN Resource Manager, the Apache Hive query language (HiveQL), and the Mahout library belong to this category. A full list of cluster components is available in [What's new in the Apache Hadoop cluster versions provided by HDInsight](../hdinsight-component-versioning.md).|
33+
|Custom|You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.|
3734

3835
> [!IMPORTANT]
3936
> Components provided with the HDInsight cluster are fully supported. Microsoft Support helps to isolate and resolve issues related to these components.
@@ -55,22 +52,22 @@ HDInsight Spark cluster is created with Anaconda installation. There are two Pyt
5552

5653
HDInsight cluster depends on the built-in Python environment, both Python 2.7 and Python 3.5. Directly installing custom packages in those default built-in environments may cause unexpected library version changes, and break the cluster further. In order to safely install custom external Python packages for your Spark applications, follow below steps.
5754

58-
1. Create Python virtual environment using conda. A virtual environment provides an isolated space for your projects without breaking others. When creating the Python virtual environment, you can specify python version that you want to use. Note that you still need to create virtual environment even though you would like to use Python 2.7 and 3.5. This is to make sure the clusters default environment not getting broke. Run script actions on your cluster for all nodes with below script to create a Python virtual environment.
55+
1. Create Python virtual environment using conda. A virtual environment provides an isolated space for your projects without breaking others. When creating the Python virtual environment, you can specify python version that you want to use. Note that you still need to create virtual environment even though you would like to use Python 2.7 and 3.5. This is to make sure the cluster's default environment not getting broke. Run script actions on your cluster for all nodes with below script to create a Python virtual environment.
5956

6057
- `--prefix` specifies a path where a conda virtual environment lives. There are several configs that need to be changed further based on the path specified here. In this example, we use the py35new, as the cluster has an existing virtual environment called py35 already.
6158
- `python=` specifies the Python version for the virtual environment. In this example, we use version 3.5, the same version as the cluster built in one. You can also use other Python versions to create the virtual environment.
6259
- `anaconda` specifies the package_spec as anaconda to install Anaconda packages in the virtual environment.
6360

6461
```bash
65-
sudo /usr/bin/anaconda/bin/conda create --prefix /usr/bin/anaconda/envs/py35new python=3.5 anaconda --yes
62+
sudo /usr/bin/anaconda/bin/conda create --prefix /usr/bin/anaconda/envs/py35new python=3.5 anaconda --yes
6663
```
6764

6865
2. Install external Python packages in the created virtual environment if needed. Run script actions on your cluster for all nodes with below script to install external Python packages. You need to have sudo privilege here in order to write files to the virtual environment folder.
6966

7067
You can search the [package index](https://pypi.python.org/pypi) for the complete list of packages that are available. You can also get a list of available packages from other sources. For example, you can install packages made available through [conda-forge](https://conda-forge.org/feedstocks/).
7168

7269
Use below command if you would like to install a library with its latest version:
73-
70+
7471
- Use conda channel:
7572

7673
- `seaborn` is the package name that you would like to install.
@@ -83,7 +80,7 @@ HDInsight cluster depends on the built-in Python environment, both Python 2.7 an
8380
- Or use PyPi repo, change `seaborn` and `py35new` correspondingly:
8481
```bash
8582
sudo /usr/bin/anaconda/env/py35new/bin/pip install seaborn
86-
```
83+
```
8784

8885
Use below command if you would like to install a library with a specific version:
8986

@@ -107,9 +104,9 @@ HDInsight cluster depends on the built-in Python environment, both Python 2.7 an
107104
3. Change Spark and Livy configs and point to the created virtual environment.
108105
109106
1. Open Ambari UI, go to Spark2 page, Configs tab.
110-
107+
111108
![Change Spark and Livy config through Ambari](./media/apache-spark-python-package-installation/ambari-spark-and-livy-config.png)
112-
109+
113110
2. Expand Advanced livy2-env, add below statements at bottom. If you installed the virtual environment with a different prefix, change the path correspondingly.
114111
115112
```
@@ -130,10 +127,10 @@ HDInsight cluster depends on the built-in Python environment, both Python 2.7 an
130127
4. Save the changes and restart affected services. These changes need a restart of Spark2 service. Ambari UI will prompt a required restart reminder, click Restart to restart all affected services.
131128
132129
![Change Spark config through Ambari](./media/apache-spark-python-package-installation/ambari-restart-services.png)
133-
134-
4. If you would like to use the new created virtual environment on Jupyter. You need to change Jupyter configs and restart Jupyter. Run script actions on all header nodes with below statement to point Jupyter to the new created virtual environment. Make sure to modify the path to the prefix you specified for your virtual environment. After running this script action, restart Jupyter service through Ambari UI to make this change available.
135130
136-
```
131+
4. If you would like to use the new created virtual environment on Jupyter. You need to change Jupyter configs and restart Jupyter. Run script actions on all header nodes with below statement to point Jupyter to the new created virtual environment. Make sure to modify the path to the prefix you specified for your virtual environment. After running this script action, restart Jupyter service through Ambari UI to make this change available.
132+
133+
```bash
137134
sudo sed -i '/python3_executable_path/c\ \"python3_executable_path\" : \"/usr/bin/anaconda/envs/py35new/bin/python3\"' /home/spark/.sparkmagic/config.json
138135
```
139136
@@ -147,32 +144,9 @@ There is a known bug for Anaconda version 4.7.11, 4.7.12 and 4.8.0. If you see y
147144
148145
To check your Anaconda version, you can SSH to the cluster header node and run `/usr/bin/anaconda/bin/conda --v`.
149146
150-
## <a name="seealso"></a>See also
147+
## Next steps
151148
152149
* [Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
153-
154-
### Scenarios
155-
156150
* [Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools](apache-spark-use-bi-tools.md)
157-
* [Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data](apache-spark-ipython-notebook-machine-learning.md)
158-
* [Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results](apache-spark-machine-learning-mllib-ipython.md)
159-
* [Website log analysis using Apache Spark in HDInsight](apache-spark-custom-library-website-log-analysis.md)
160-
161-
### Create and run applications
162-
163-
* [Create a standalone application using Scala](apache-spark-create-standalone-application.md)
164-
* [Run jobs remotely on an Apache Spark cluster using Apache Livy](apache-spark-livy-rest-interface.md)
165-
166-
### Tools and extensions
167-
168-
* [Use external packages with Jupyter notebooks in Apache Spark clusters on HDInsight](apache-spark-jupyter-notebook-use-external-packages.md)
169-
* [Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications](apache-spark-intellij-tool-plugin.md)
170-
* [Use HDInsight Tools Plugin for IntelliJ IDEA to debug Apache Spark applications remotely](apache-spark-intellij-tool-plugin-debug-jobs-remotely.md)
171-
* [Use Apache Zeppelin notebooks with an Apache Spark cluster on HDInsight](apache-spark-zeppelin-notebook.md)
172-
* [Kernels available for Jupyter notebook in Apache Spark cluster for HDInsight](apache-spark-jupyter-notebook-kernels.md)
173-
* [Install Jupyter on your computer and connect to an HDInsight Spark cluster](apache-spark-jupyter-notebook-install-locally.md)
174-
175-
### Manage resources
176-
177151
* [Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
178152
* [Track and debug jobs running on an Apache Spark cluster in HDInsight](apache-spark-job-debugging.md)
780 KB
Loading
708 KB
Loading
755 KB
Loading
937 KB
Loading
69.6 KB
Loading
244 KB
Loading

0 commit comments

Comments
 (0)