You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#Customer intent: As a developer for Apache Spark and Apache Spark in Azure HDInsight, I want to learn how to manage my Spark application dependencies and install packages on my HDInsight cluster.
11
11
---
12
12
@@ -29,7 +29,7 @@ When a Spark session starts in Jupyter Notebook on Spark kernel for Scala, you c
29
29
*[Maven Repository](https://search.maven.org/), or community-contributed packages at [Spark Packages](https://spark-packages.org/).
30
30
* Jar files stored on your cluster's primary storage.
31
31
32
-
You'll use the `%%configure` magic to configure the notebook to use an external package. In notebooks that use external packages, make sure you call the `%%configure` magic in the first code cell. This ensures that the kernel is configured to use the package before the session starts.
32
+
You can use the `%%configure` magic to configure the notebook to use an external package. In notebooks that use external packages, make sure you call the `%%configure` magic in the first code cell. This ensures that the kernel is configured to use the package before the session starts.
In some cases, you may want to configure the jar dependencies at cluster level so that every application can be set up with same dependencies by default. The approach is to add your jar paths to Spark driver and executor class path.
73
73
74
-
1. Run below sample script actions to copy jar files from primary storage `wasb://[email protected]/libs/*` to cluster local file system `/usr/libs/sparklibs`. The step is needed as linux uses `:` to separate class path list, but HDInsight only support storage paths with scheme like `wasb://`. The remote storage path won't work correctly if you directly add it to class path.
74
+
1. Run sample script actions to copy jar files from primary storage `wasb://[email protected]/libs/*` to cluster local file system `/usr/libs/sparklibs`. The step is needed as linux uses `:` to separate class path list, but HDInsight only support storage paths with scheme like `wasb://`. The remote storage path won't work correctly if you directly add it to class path.
75
75
76
76
```bash
77
77
sudo mkdir -p /usr/libs/sparklibs
@@ -98,16 +98,17 @@ HDInsight cluster has built-in jar dependencies, and updates for these jar versi
98
98
99
99
## Python packages for one Spark job
100
100
### Use Jupyter Notebook
101
-
HDInsight Jupyter Notebook PySpark kernel doesn't support installing Python packages from PyPi or Anaconda package repository directly. If you have `.zip`, `.egg`, or `.py` dependencies, and want to reference them for one Spark session, follow below steps:
102
101
103
-
1. Run below sample script actions to copy `.zip`, `.egg` or `.py` files from primary storage `wasb://[email protected]/libs/*` to cluster local file system `/usr/libs/pylibs`. The step is needed as linux uses `:` to separate search path list, but HDInsight only support storage paths with scheme like `wasb://`. The remote storage path won't work correctly when you use `sys.path.insert`.
102
+
HDInsight Jupyter Notebook PySpark kernel doesn't support installing Python packages from PyPi or Anaconda package repository directly. If you have `.zip`, `.egg`, or `.py` dependencies, and want to reference them for one Spark session, follow steps:
103
+
104
+
1. Run sample script actions to copy `.zip`, `.egg` or `.py` files from primary storage `wasb://[email protected]/libs/*` to cluster local file system `/usr/libs/pylibs`. The step is needed as linux uses `:` to separate search path list, but HDInsight only support storage paths with scheme like `wasb://`. The remote storage path won't work correctly when you use `sys.path.insert`.
0 commit comments