Skip to content

Commit cad093a

Browse files
committed
~Update the description for Python inline installation
1 parent a919525 commit cad093a

File tree

4 files changed

+188
-70
lines changed

4 files changed

+188
-70
lines changed
Lines changed: 73 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,126 @@
11
---
22
title: Manage Apache Spark packages
3-
description: Learn how to add and manage libraries used by Apache Spark in Azure Synapse Analytics. Libraries provide reusable code for use in your programs or projects.
3+
description: Learn how to add and manage libraries used by Apache Spark in Azure Synapse Analytics.
44
author: shuaijunye
55
ms.service: synapse-analytics
66
ms.topic: how-to
7-
ms.date: 11/03/2022
7+
ms.date: 02/20/2023
88
ms.author: shuaijunye
99
ms.subservice: spark
1010
ms.custom: kr2b-contr-experiment
1111
---
1212

1313
# Manage libraries for Apache Spark in Azure Synapse Analytics
1414

15-
Libraries provide reusable code that you might want to include in your programs or projects.
15+
Libraries provide reusable code that you might want to include in your programs or projects for Apache Spark in Azure Synapse Analytics (Azure Synapse Spark).
1616

1717
You might need to update your serverless Apache Spark pool environment for various reasons. For example, you might find that:
1818

1919
- One of your core dependencies released a new version.
2020
- You need an extra package for training your machine learning model or preparing your data.
21-
- You have found a better package and no longer need the older package.
21+
- A better package is available, and you no longer need the older package.
2222
- Your team has built a custom package that you need available in your Apache Spark pool.
2323

24-
To make third party or locally built code available to your applications, install a library onto one of your serverless Apache Spark pools or notebook session.
24+
To make third-party or locally built code available to your applications, install a library onto one of your serverless Apache Spark pools or a notebook session.
2525

26-
> [!IMPORTANT]
27-
>
28-
> - There are three levels of package installing on Synapse Analytics -- default level, Spark pool level and session level.
29-
> - Apache Spark in Azure Synapse Analytics has a full Anaconda install plus extra libraries served as the default level installation which is fully managed by Synapse. The Spark pool level packages can be used by all running Artifacts, e.g., Notebook and Spark job definition attaching the corresponding Spark pool. The session level installation will create an environment for the specific Notebook session, the change of session level libraries will not be persisted between sessions.
30-
> - You can upload custom libraries and a specific version of an open-source library that you would like to use in your Azure Synapse Analytics Workspace. The workspace packages can be installed in your Spark pools.
31-
> - To be noted, the pool level library management can take certain amount of time depending on the size of packages and the complexity of required dependencies. The session level installation is suggested with experimental and quick iterative scenarios.
32-
33-
## Default Installation
26+
## Overview of package levels
3427

35-
Default packages include a full Anaconda install plus extra commonly used libraries. The full libraries list can be found at [Apache Spark version support](apache-spark-version-support.md).
28+
There are three levels of packages installed on Azure Synapse Analytics:
3629

37-
When a Spark instance starts, these libraries are included automatically. More packages can be added at the Spark pool level or session level.
30+
- **Default**: Default packages include a full Anaconda installation, plus extra commonly used libraries. For a full list of libraries, see [Apache Spark version support](apache-spark-version-support.md).
3831

39-
## Workspace packages
32+
When a Spark instance starts, these libraries are included automatically. You can add more packages at the other levels.
33+
- **Spark pool**: All running artifacts can use packages at the Spark pool level. For example, you can attach notebook and Spark job definitions to corresponding Spark pools.
4034

41-
When your team develops custom applications or models, you might develop various code artifacts like *.whl*, *.jar*, or *tar.gz* files to package your code.
35+
You can upload custom libraries and a specific version of an open-source library that you want to use in your Azure Synapse Analytics workspace. The workspace packages can be installed in your Spark pools.
36+
- **Session**: A session-level installation creates an environment for a specific notebook session. The change of session-level libraries isn't persisted between sessions.
4237

43-
In Synapse, workspace packages can be custom or private *.whl* or *.jar* files. You can upload these packages to your workspace and later assign them to a specific serverless Apache Spark pool. Once assigned, these workspace packages are installed automatically on all Spark pool sessions.
38+
> [!NOTE]
39+
>
40+
> - Pool-level library management can take time, depending on the size of the packages and the complexity of required dependencies. We recommend the session-level installation for experimental and quick iterative scenarios.
41+
> - The pool-level library management will produce a stable dependency for running your Notebooks and Spark job definitions. Installing the library to your Spark pool is highly recommended for the pipeline runs.
42+
> - Session level library management can help you with fast iteration or dealing with the frequent changes of library. However, the stability of session level installation is not promised. Also, in-line commands like %pip and %conda are disabled in pipeline run. Managing library in Notebook session is recommended during the developing phase.
4443
45-
To learn more about how to manage workspace libraries, see the following article:
44+
## Manage workspace packages
4645

47-
- [Manage workspace packages](./apache-spark-manage-workspace-packages.md)
46+
When your team develops custom applications or models, you might develop various code artifacts like *.whl*, *.jar*, or *tar.gz* files to package your code.
4847

49-
> [!NOTE]
50-
> If you enabled [Data exfiltration protection](../security/workspace-data-exfiltration-protection.md), you should upload all your dependencies as workspace libraries.
48+
In Azure Synapse, workspace packages can be custom or private *.whl* or *.jar* files. You can upload these packages to your workspace and later assign them to a specific serverless Apache Spark pool. After you assign these workspace packages, they're installed automatically on all Spark pool sessions.
5149

52-
## Pool packages
50+
To learn more about how to manage workspace libraries, see [Manage workspace packages](./apache-spark-manage-workspace-packages.md).
5351

54-
In some cases, you might want to standardize the packages that are used on an Apache Spark pool. This standardization can be useful if the same packages are commonly installed by multiple people on your team.
52+
## Manage pool packages
5553

56-
Using the Azure Synapse Analytics pool management capabilities, you can configure the default set of libraries to install on a given serverless Apache Spark pool. These libraries are installed on top of the [base runtime](./apache-spark-version-support.md).
54+
In some cases, you might want to standardize the packages that are used on an Apache Spark pool. This standardization can be useful if multiple people on your team commonly install the same packages.
5755

58-
Currently, pool management is only supported for Python. For Python, Synapse Spark pools use Conda to install and manage Python package dependencies. When specifying your pool-level libraries, you can now provide a *requirements.txt* or an *environment.yml* file. This environment configuration file is used every time a Spark instance is created from that Spark pool.
56+
By using the pool management capabilities of Azure Synapse Analytics, you can configure the default set of libraries to install on a serverless Apache Spark pool. These libraries are installed on top of the [base runtime](./apache-spark-version-support.md).
57+
58+
For Python libraries, Azure Synapse Spark pools use Conda to install and manage Python package dependencies. You can specify the pool-level Python libraries by providing a *requirements.txt* or *environment.yml* file. This environment configuration file is used every time a Spark instance is created from that Spark pool. You can also attach the workspace packages to your pools.
5959

6060
To learn more about these capabilities, see [Manage Spark pool packages](./apache-spark-manage-pool-packages.md).
6161

6262
> [!IMPORTANT]
6363
>
64-
> - If the package you are installing is large or takes a long time to install, this fact affects the Spark instance start up time.
64+
> - If the package that you're installing is large or takes a long time to install, it might affect the Spark instance's startup time.
6565
> - Altering the PySpark, Python, Scala/Java, .NET, or Spark version is not supported.
66-
> - Installing packages from PyPI is not supported within DEP-enabled workspaces.
6766
68-
## Session-scoped packages
67+
## Manage dependencies for DEP-enabled Azure Synapse Spark pools
68+
69+
> [!NOTE]
70+
> Installing packages from a public repo is not supported within [DEP-enabled workspaces](../security/workspace-data-exfiltration-protection.md). Instead, upload all your dependencies as workspace libraries and install them to your Spark pool.
71+
72+
If you're having trouble identifying required dependencies, follow these steps:
73+
74+
1. Run the following script to set up a local Python environment that's the same as the Azure Synapse Spark environment. The script requires [Synapse-Python38-CPU.yml](https://github.com/Azure-Samples/Synapse/blob/main/Spark/Python/Synapse-Python38-CPU.yml), which is the list of libraries shipped in the default Python environment in Azure Synapse Spark.
75+
76+
```powershell
77+
# One-time Azure Synapse Python setup
78+
wget Synapse-Python38-CPU.yml
79+
sudo bash Miniforge3-Linux-x86_64.sh -b -p /usr/lib/miniforge3
80+
export PATH="/usr/lib/miniforge3/bin:$PATH"
81+
sudo apt-get -yq install gcc g++
82+
conda env create -n synapse-env -f Synapse-Python38-CPU.yml
83+
source activate synapse-env
84+
```
85+
86+
1. Run the following script to identify the required dependencies.
87+
The script can be used to pass your *requirements.txt* file, which has all the packages and versions that you intend to install in the Spark 3.1 or Spark 3.2 pool. It will print the names of the *new* wheel files/dependencies for your input library requirements.
88+
89+
```python
90+
# Command to list wheels needed for your input libraries.
91+
# This command will list only new dependencies that are
92+
# not already part of the built-in Azure Synapse environment.
93+
pip install -r <input-user-req.txt> > pip_output.txt
94+
cat pip_output.txt | grep "Using cached *"
95+
```
96+
97+
> [!NOTE]
98+
> This script will list only the dependencies that are not already present in the Spark pool by default.
99+
100+
## Manage session-scoped packages
69101

70-
Often, when doing interactive data analysis or machine learning, you might try newer packages or you might need packages that are currently unavailable on your Apache Spark pool. Instead of updating the pool configuration, users can now use session-scoped packages to add, manage, and update session dependencies.
102+
When you're doing interactive data analysis or machine learning, you might try newer packages, or you might need packages that are currently unavailable on your Apache Spark pool. Instead of updating the pool configuration, you can use session-scoped packages to add, manage, and update session dependencies.
71103

72-
Session-scoped packages allow users to define package dependencies at the start of their session. When you install a session-scoped package, only the current session has access to the specified packages. As a result, these session-scoped packages don't affect other sessions or jobs using the same Apache Spark pool. In addition, these libraries are installed on top of the base runtime and pool level packages.
104+
Session-scoped packages allow users to define package dependencies at the start of their session. When you install a session-scoped package, only the current session has access to the specified packages. As a result, these session-scoped packages don't affect other sessions or jobs that use the same Apache Spark pool. In addition, these libraries are installed on top of the base runtime and pool-level packages.
73105

74106
To learn more about how to manage session-scoped packages, see the following articles:
75107

76-
- [Python session packages:](./apache-spark-manage-session-packages.md#session-scoped-python-packages) At the start of a session, provide a Conda *environment.yml* to install more Python packages from popular repositories.
108+
- [Python session packages](./apache-spark-manage-session-packages.md#session-scoped-python-packages): At the start of a session, provide a Conda *environment.yml* file to install more Python packages from popular repositories. Or you can use %pip and %conda commands to manage libraries in the Notebook code cells.
77109

78-
- [Scala/Java session packages:](./apache-spark-manage-session-packages.md#session-scoped-java-or-scala-packages) At the start of your session, provide a list of *.jar* files to install using `%%configure`.
110+
- [Scala/Java session packages](./apache-spark-manage-session-packages.md#session-scoped-java-or-scala-packages): At the start of your session, provide a list of *.jar* files to install by using `%%configure`.
79111

80-
- [R session packages:](./apache-spark-manage-session-packages.md#session-scoped-r-packages-preview) Within your session, you can install packages across all nodes within your Spark pool using `install.packages` or `devtools`.
112+
- [R session packages](./apache-spark-manage-session-packages.md#session-scoped-r-packages-preview): Within your session, you can install packages across all nodes within your Spark pool by using `install.packages` or `devtools`.
81113

82-
## Manage your packages outside Synapse Analytics UI
114+
## Automate the library management process through Azure PowerShell cmdlets and REST APIs
83115

84-
If your team want to manage the libraries without visiting the package management UIs, you have the options to manage the workspace packages and pool level package updates through Azure PowerShell cmdlets or REST APIs for Synapse Analytics.
116+
If your team wants to manage libraries without visiting the package management UIs, you have the option to manage the workspace packages and pool-level package updates through Azure PowerShell cmdlets or REST APIs for Azure Synapse Analytics.
85117

86-
To learn more about Azure PowerShell cmdlets and package management REST APIs, see the following articles:
118+
For more information, see the following articles:
87119

88-
- Azure PowerShell cmdlets for Synapse Analytics: [Manage your Spark pool libraries through Azure PowerShell cmdlets](apache-spark-manage-packages-outside-ui.md#manage-packages-through-azure-powershell-cmdlets)
89-
- Package management REST APIs: [Manage your Spark pool libraries through REST APIs](apache-spark-manage-packages-outside-ui.md#manage-packages-through-rest-apis)
120+
- [Manage your Spark pool libraries through REST APIs](apache-spark-manage-packages-outside-ui.md#manage-packages-through-rest-apis)
121+
- [Manage your Spark pool libraries through Azure PowerShell cmdlets](apache-spark-manage-packages-outside-ui.md#manage-packages-through-azure-powershell-cmdlets)
90122

91123
## Next steps
92124

93-
- View the default libraries: [Apache Spark version support](apache-spark-version-support.md)
94-
- Troubleshoot library installation errors: [Troubleshoot library errors](apache-spark-troubleshoot-library-errors.md)
125+
- [View the default libraries and supported Apache Spark versions](apache-spark-version-support.md)
126+
- [Troubleshoot library installation errors](apache-spark-troubleshoot-library-errors.md)

articles/synapse-analytics/spark/apache-spark-manage-packages-outside-UI.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ description: Learn how to manage packages using Azure PowerShell cmdlets or REST
44
author: shuaijunye
55
ms.service: synapse-analytics
66
ms.topic: conceptual
7-
ms.date: 07/07/2022
7+
ms.date: 02/23/2023
88
ms.author: shuaijunye
99
ms.subservice: spark
1010
---
1111

12-
# Manage packages outside Synapse Analytics Studio UIs
12+
# Automate the library management process through Azure PowerShell cmdlets and REST APIs
1313

1414
You may want to manage your libraries for your serverless Apache Spark pools without going into the Synapse Analytics UI pages. For example, you may find that:
1515

@@ -21,6 +21,7 @@ In this article, we'll provide a general guide to help you managing libraries th
2121
## Manage packages through Azure PowerShell cmdlets
2222

2323
### Add new libraries
24+
2425
1. [New-AzSynapseWorkspacePackage](/powershell/module/az.synapse/new-azsynapseworkspacepackage) command can be used to **upload new libraries to workspace**.
2526

2627
```powershell
@@ -42,29 +43,31 @@ In this article, we'll provide a general guide to help you managing libraries th
4243
```
4344
4445
### Remove libraries
46+
4547
1. In order to **remove a installed package** from your Spark pool, please refer to the command combination of [Get-AzSynapseWorkspacePackage](/powershell/module/az.synapse/get-azsynapseworkspacepackage) and [Update-AzSynapseSparkPool](/powershell/module/az.synapse/update-azsynapsesparkpool).
4648
4749
```powershell
4850
$package = Get-AzSynapseWorkspacePackage -WorkspaceName ContosoWorkspace -Name ContosoPackage
4951
Update-AzSynapseSparkPool -WorkspaceName ContosoWorkspace -Name ContosoSparkPool -PackageAction Remove -Package $package
5052
```
5153
52-
2. You can also retrieve a Spark pool and **remove all attached workspace libraries** from the pool by calling [Get-AzSynapseSparkPool](/powershell/module/az.synapse/get-azsynapsesparkpool) and [Update-AzSynapseSparkPool](/powershell/module/az.synapse/update-azsynapsesparkpool) commands.
54+
2. You can also retrieve a Spark pool and **remove all attached workspace libraries** from the pool by calling [Get-AzSynapseSparkPool](/powershell/module/az.synapse/get-azsynapsesparkpool) and [Update-AzSynapseSparkPool](/powershell/module/az.synapse/update-azsynapsesparkpool) commands.
55+
5356
```powershell
5457
$pool = Get-AzSynapseSparkPool -ResourceGroupName ContosoResourceGroup -WorkspaceName ContosoWorkspace -Name ContosoSparkPool
5558
$pool | Update-AzSynapseSparkPool -PackageAction Remove -Package $pool.WorkspacePackages
5659
```
5760
5861
For more Azure PowerShell cmdlets capabilities, please refer to [Azure PowerShell cmdlets for Azure Synapse Analytics](/powershell/module/az.synapse).
5962
60-
6163
## Manage packages through REST APIs
6264
6365
### Manage the workspace packages
64-
With the ability of REST APIs, you can add/delete packages or list all uploaded files of your workspace. See the full supported APIs, please refer to [Overview of workspace library APIs](/rest/api/synapse/data-plane/library).
6566
67+
With the ability of REST APIs, you can add/delete packages or list all uploaded files of your workspace. See the full supported APIs, please refer to [Overview of workspace library APIs](/rest/api/synapse/data-plane/library).
6668
6769
### Manage the Spark pool packages
70+
6871
You can leverage the [Spark pool REST API](/rest/api/synapse/big-data-pools/create-or-update) to attach or remove your custom or open source libraries to your Spark pools.
6972
7073
1. For custom libraries, please specify the list of custom files as the **customLibraries** property in request body.
@@ -91,5 +94,6 @@ You can leverage the [Spark pool REST API](/rest/api/synapse/big-data-pools/crea
9194
```
9295
9396
## Next steps
97+
9498
- View the default libraries: [Apache Spark version support](apache-spark-version-support.md)
9599
- Manage Spark pool level packages through Synapse Studio portal: [Python package management on Notebook Session](./apache-spark-manage-session-packages.md#session-scoped-python-packages)

0 commit comments

Comments
 (0)