stackabletech · adwk67 · Oct 21, 2024 · Oct 10, 2024 · Oct 11, 2024 · Oct 11, 2024
diff --git a/docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc b/docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc
@@ -136,17 +136,71 @@ You should arrive at your workspace:
 image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_workspace.png[]
 
 Now you can double-click on the `notebook` folder on the left, open and run the contained file.
-Click on the double arrow (⏩️) to execute the Python scripts.
+Click on the double arrow (⏩️) to execute the Python scripts (click on the image below to go to the notebook file).
 
-image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_run_notebook.png[]
+image::jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/jupyter_hub_run_notebook.png[link="../../../../stacks/jupyterhub-pyspark-hdfs/notebook.ipynb"]
 
 You can also inspect the `hdfs` folder where the `core-site.xml` and `hdfs-site.xml` from the discovery ConfigMap of the HDFS cluster are located.
 
-[NOTE]
-====
 The image defined for the spark job must contain all dependencies needed for that job to run.
-For pyspark jobs, this will mean that Python libraries either need to be baked into the image (this demo contains a Dockerfile that was used to generate an image containing scikit-learn, pandas and their dependencies) or {spark-pkg}[packaged in some other way].
-====
+For pyspark jobs, this will mean that Python libraries either need to be baked into the image or {spark-pkg}[packaged in some other way].
+This demo contains a custom image created from a Dockerfile that is used to generate an image containing scikit-learn, pandas and their dependencies.
+This is described below.
+
+=== Install the libraries into a product image
+
+Libraries can be added to a custom *product* image launched by the notebook. Suppose a Spark job is prepared like this:
+
+[source,python]
+----
+spark = (SparkSession
+            .builder
+            .master(f'k8s://https://{os.environ["KUBERNETES_SERVICE_HOST"]}:{os.environ["KUBERNETES_SERVICE_PORT"]}')
+            .config("spark.kubernetes.container.image", "docker.stackable.tech/demos/spark-k8s-with-scikit-learn:3.5.0-stackable24.3.0")
+            .config("spark.driver.port", "2222")
+            .config("spark.driver.blockManager.port", "7777")
+            .config("spark.driver.host", "driver-service.default.svc.cluster.local")
+            .config("spark.driver.bindAddress", "0.0.0.0")
+            .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
+            .config("spark.kubernetes.authenticate.serviceAccountName", "spark")
+            .config("spark.executor.instances", "4")
+            .config("spark.kubernetes.container.image.pullPolicy", "IfNotPresent")
+            .appName("taxi-data-anomaly-detection")
+            .getOrCreate()
+        )
+----
+
+It requires a specific Spark image:
+
+[source,python]
+----
+.config("spark.kubernetes.container.image",
+  "docker.stackable.tech/demos/spark-k8s-with-scikit-learn:3.5.0-stackable24.3.0"),
+...
+----
+
+This is created by taking a Spark image, in this case `docker.stackable.tech/stackable/spark-k8s:3.5.0-stackable24.3.0`, installing specific python libraries into it
+, and re-tagging the image:
+
+[source,console]
+----
+FROM docker.stackable.tech/stackable/spark-k8s:3.5.0-stackable24.3.0
+
+COPY demos/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data/requirements.txt .
+
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r ./requirements.txt
+----
+
+Where `requirements.txt` contains:
+
+[source,console]
+----
+scikit-learn==1.3.1
+pandas==2.0.3
+----
+
+NOTE: using a custom image requires access to a repository where the image can be made available.
 
 == Model details
 

diff --git a/docs/modules/demos/pages/signal-processing.adoc b/docs/modules/demos/pages/signal-processing.adoc
@@ -65,6 +65,47 @@ image::signal-processing/notebook.png[]
 
 The notebook reads the measurement data in windowed batches using a loop, computes some predictions for each batch and persists the scores in a separate timescale table.
 
+=== Adding libraries
+
+The notebook shows (commented out lines at the top of the first cell) one way of installing python libraries: by using an inline `pip` command.
+There are two ways of doing this:
+
+==== Install from within the notebook
+
+This can be done by executing `!pip install` from within a notebook cell, as shown in the screenshot:
+
+[source,console]
+----
+!pip install psycopg2-binary
+!pip install alibi-detect
+----
+
+==== Install the libraries into a custom image
+
+Alternatively dependencies can be added into the base image used for jupyterhub.
+This could make use of any Dockerfile mechanism (downloading via `curl`, using a package manager etc.) and is not limited to python libraries.
+To achieve the same imports as mentioned in the previous section, build the dockerfile like this:
+
+[source,console]
+----
+FROM jupyter/pyspark-notebook:python-3.9
+
+COPY demos/signal-processing/requirements.txt .
+
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r ./requirements.txt
+----
+
+Where `requirements.txt` contains:
+
+[source,console]
+----
+psycopg2-binary==2.9.9
+alibi-detect==0.11.4
+----
+
+NOTE: using a custom image requires access to a repository where the image can be made available.
+
 == Model details
 
 The enriched data is calculated using an online, unsupervised https://docs.seldon.io/projects/alibi-detect/en/stable/od/methods/sr.html[model] that uses a technique called http://www.houxiaodi.com/assets/papers/cvpr07.pdf[Spectral Residuals].