Skip to content

Commit c272e9e

Browse files
committed
edit pass: apache-spark-azure-portal-add-libraries
1 parent 8a7999c commit c272e9e

File tree

1 file changed

+55
-55
lines changed

1 file changed

+55
-55
lines changed

articles/synapse-analytics/spark/apache-spark-azure-portal-add-libraries.md

Lines changed: 55 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Manage Apache Spark packages
3-
description: Learn how to add and manage libraries used by Apache Spark in Azure Synapse Analytics. Libraries provide reusable code for use in your programs or projects.
3+
description: Learn how to add and manage libraries used by Apache Spark in Azure Synapse Analytics.
44
author: shuaijunye
55
ms.service: synapse-analytics
66
ms.topic: how-to
@@ -12,89 +12,89 @@ ms.custom: kr2b-contr-experiment
1212

1313
# Manage libraries for Apache Spark in Azure Synapse Analytics
1414

15-
Libraries provide reusable code that you might want to include in your programs or projects.
15+
Libraries provide reusable code that you might want to include in your programs or projects for Apache Spark in Azure Synapse Analytics (Azure Synapse Spark).
1616

1717
You might need to update your serverless Apache Spark pool environment for various reasons. For example, you might find that:
1818

1919
- One of your core dependencies released a new version.
2020
- You need an extra package for training your machine learning model or preparing your data.
21-
- You have found a better package and no longer need the older package.
21+
- A better package is available, and you no longer need the older package.
2222
- Your team has built a custom package that you need available in your Apache Spark pool.
2323

24-
To make third party or locally built code available to your applications, install a library onto one of your serverless Apache Spark pools or notebook session.
24+
To make third-party or locally built code available to your applications, install a library onto one of your serverless Apache Spark pools or a notebook session.
2525

26-
> [!IMPORTANT]
27-
>
28-
> - There are three levels of package installing on Synapse Analytics -- default level, Spark pool level and session level.
29-
> - Apache Spark in Azure Synapse Analytics has a full Anaconda install plus extra libraries served as the default level installation which is fully managed by Synapse. The Spark pool level packages can be used by all running Artifacts, e.g., Notebook and Spark job definition attaching the corresponding Spark pool. The session level installation will create an environment for the specific Notebook session, the change of session level libraries will not be persisted between sessions.
30-
> - You can upload custom libraries and a specific version of an open-source library that you would like to use in your Azure Synapse Analytics Workspace. The workspace packages can be installed in your Spark pools.
31-
> - To be noted, the pool level library management can take certain amount of time depending on the size of packages and the complexity of required dependencies. The session level installation is suggested with experimental and quick iterative scenarios.
32-
33-
## Default Installation
26+
## Overview of package levels
27+
28+
There are three levels of package installing on Azure Synapse Analytics:
3429

35-
Default packages include a full Anaconda install plus extra commonly used libraries. The full libraries list can be found at [Apache Spark version support](apache-spark-version-support.md).
30+
- **Default**: Default packages include a full Anaconda installation, plus extra commonly used libraries. For a full list of libraries, see [Apache Spark version support](apache-spark-version-support.md).
3631

37-
When a Spark instance starts, these libraries are included automatically. More packages can be added at the Spark pool level or session level.
32+
When a Spark instance starts, these libraries are included automatically. You can add more packages at the other levels.
33+
- **Spark pool**: All running artifacts can use packages at the Spark pool level. For example, you can attach notebook and Spark job definitions to corresponding Spark pools.
3834

39-
## Workspace packages
35+
You can upload custom libraries and a specific version of an open-source library that you want to use in your Azure Synapse Analytics workspace. The workspace packages can be installed in your Spark pools.
36+
- **Session**: A session-level installation creates an environment for a specific notebook session. The change of session-level libraries isn't persisted between sessions.
37+
38+
> [!NOTE]
39+
> The pool-level library management can take time, depending on the size of the packages and the complexity of required dependencies. We recommend the session-level installation for experimental and quick iterative scenarios.
40+
41+
## Manage workspace packages
4042

4143
When your team develops custom applications or models, you might develop various code artifacts like *.whl*, *.jar*, or *tar.gz* files to package your code.
4244

43-
In Synapse, workspace packages can be custom or private *.whl* or *.jar* files. You can upload these packages to your workspace and later assign them to a specific serverless Apache Spark pool. Once assigned, these workspace packages are installed automatically on all Spark pool sessions.
45+
In Azure Synapse, workspace packages can be custom or private *.whl* or *.jar* files. You can upload these packages to your workspace and later assign them to a specific serverless Apache Spark pool. After you assign these workspace packages, they're installed automatically on all Spark pool sessions.
4446

45-
To learn more about how to manage workspace libraries, see the following article:
47+
To learn more about how to manage workspace libraries, see [Manage workspace packages](./apache-spark-manage-workspace-packages.md).
4648

47-
- [Manage workspace packages](./apache-spark-manage-workspace-packages.md)
49+
## Manage pool packages
4850

49-
## Pool packages
51+
In some cases, you might want to standardize the packages that are used on an Apache Spark pool. This standardization can be useful if multiple people on your team commonly install the same packages.
5052

51-
In some cases, you might want to standardize the packages that are used on an Apache Spark pool. This standardization can be useful if the same packages are commonly installed by multiple people on your team.
53+
By using the pool management capabilities of Azure Synapse Analytics, you can configure the default set of libraries to install on a serverless Apache Spark pool. These libraries are installed on top of the [base runtime](./apache-spark-version-support.md).
5254

53-
Using the Azure Synapse Analytics pool management capabilities, you can configure the default set of libraries to install on a given serverless Apache Spark pool. These libraries are installed on top of the [base runtime](./apache-spark-version-support.md).
55+
Currently, pool management is supported only for Python. For Python, Synapse Spark pools use Conda to install and manage Python package dependencies.
5456

55-
Currently, pool management is only supported for Python. For Python, Synapse Spark pools use Conda to install and manage Python package dependencies. When specifying your pool-level libraries, you can now provide a *requirements.txt* or an *environment.yml* file. This environment configuration file is used every time a Spark instance is created from that Spark pool.
57+
When you're specifying pool-level libraries, you can now provide a *requirements.txt* or an *environment.yml* file. This environment configuration file is used every time a Spark instance is created from that Spark pool.
5658

5759
To learn more about these capabilities, see [Manage Spark pool packages](./apache-spark-manage-pool-packages.md).
5860

5961
> [!IMPORTANT]
60-
>
61-
> - If the package you are installing is large or takes a long time to install, this fact affects the Spark instance start up time.
62+
> - If the package that you're installing is large or takes a long time to install, it might affect the Spark instance's startup time.
6263
> - Altering the PySpark, Python, Scala/Java, .NET, or Spark version is not supported.
6364
64-
### Manage dependencies for DEP-enabled Synapse Spark pools
65+
## Manage dependencies for DEP-enabled Synapse Spark pools
6566

6667
> [!NOTE]
67-
>
68-
> - Installing packages from public repo is not supported within [DEP-enabled workspaces](../security/workspace-data-exfiltration-protection.md), you should upload all your dependencies as workspace libraries and install to your Spark pool.
69-
>
70-
Please follow the steps below if you have trouble to identify the required dependencies:
71-
72-
- **Step1: Run the following script to set up a local Python environment same with Synapse Spark environment**
73-
The setup script requires [Synapse-Python38-CPU.yml](https://github.com/Azure-Samples/Synapse/blob/main/Spark/Python/Synapse-Python38-CPU.yml) which is the list of libraries shipped in the default Python env in Synapse spark.
74-
75-
```powershell
76-
# one-time synapse Python setup
77-
wget Synapse-Python38-CPU.yml
78-
sudo bash Miniforge3-Linux-x86_64.sh -b -p /usr/lib/miniforge3
79-
export PATH="/usr/lib/miniforge3/bin:$PATH"
80-
sudo apt-get -yq install gcc g++
81-
conda env create -n synapse-env -f Synapse-Python38-CPU.yml
82-
source activate synapse-env
83-
```
84-
85-
- **Step2: Run the following script to identify the required dependencies**
86-
The below snippet can be used to pass your requirement.txt which has all the packages and version you intend to install in the spark 3.1/spark3.2 spark pool. It will print the names of the *new* wheel files/dependencies needed for your input library requirements. Note this will list out only the dependencies that are not already present in the spark pool by default.
87-
88-
```python
89-
# command to list out wheels needed for your input libraries
90-
# this command will list out only *new* dependencies that are
91-
# not already part of the built-in synapse environment
92-
pip install -r <input-user-req.txt> > pip_output.txt
93-
cat pip_output.txt | grep "Using cached *"
94-
```
68+
> Installing packages from public repo is not supported within [DEP-enabled workspaces](../security/workspace-data-exfiltration-protection.md), you should upload all your dependencies as workspace libraries and install to your Spark pool.
9569
70+
Please follow the steps below if you have trouble to identify the required dependencies:
9671

97-
## Session-scoped packages
72+
1. Run the following script to set up a local Python environment same with Synapse Spark environment. The script requires [Synapse-Python38-CPU.yml](https://github.com/Azure-Samples/Synapse/blob/main/Spark/Python/Synapse-Python38-CPU.yml), which is the list of libraries shipped in the default Python environment in Synapse Spark.
73+
74+
```powershell
75+
# One-time synapse Python setup
76+
wget Synapse-Python38-CPU.yml
77+
sudo bash Miniforge3-Linux-x86_64.sh -b -p /usr/lib/miniforge3
78+
export PATH="/usr/lib/miniforge3/bin:$PATH"
79+
sudo apt-get -yq install gcc g++
80+
conda env create -n synapse-env -f Synapse-Python38-CPU.yml
81+
source activate synapse-env
82+
```
83+
84+
1. Run the following script to identify the required dependencies.
85+
The script can be used to pass your requirement.txt file, which has all the packages and versions you intend to install in the spark 3.1/spark3.2 spark pool. It will print the names of the *new* wheel files/dependencies needed for your input library requirements.
86+
87+
```python
88+
# Command to list out wheels needed for your input libraries.
89+
# This command will list out only new dependencies that are
90+
# not already part of the built-in Synapse environment.
91+
pip install -r <input-user-req.txt> > pip_output.txt
92+
cat pip_output.txt | grep "Using cached *"
93+
```
94+
> [!NOTE]
95+
> This script will list out only the dependencies that are not already present in the spark pool by default.
96+
97+
## Manage session-scoped packages
9898

9999
Often, when doing interactive data analysis or machine learning, you might try newer packages or you might need packages that are currently unavailable on your Apache Spark pool. Instead of updating the pool configuration, users can now use session-scoped packages to add, manage, and update session dependencies.
100100

@@ -108,7 +108,7 @@ To learn more about how to manage session-scoped packages, see the following art
108108

109109
- [R session packages:](./apache-spark-manage-session-packages.md#session-scoped-r-packages-preview) Within your session, you can install packages across all nodes within your Spark pool using `install.packages` or `devtools`.
110110

111-
## Manage your packages outside Synapse Analytics UI
111+
## Manage your packages outside the Synapse Analytics UI
112112

113113
If your team want to manage the libraries without visiting the package management UIs, you have the options to manage the workspace packages and pool level package updates through Azure PowerShell cmdlets or REST APIs for Synapse Analytics.
114114

0 commit comments

Comments
 (0)