-
-
Notifications
You must be signed in to change notification settings - Fork 10
feat: Jupyterhub with keycloak, spark and s3 #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
8f41ef3
initial keycloak setup
adwk67 eaf555c
wip: jupyterhub + keycloak
adwk67 34fb370
wip
adwk67 54fc0dc
wip: certificates work but callback does not
adwk67 ae34b6a
wip: various tweaks
adwk67 34c138d
added some temp docs
adwk67 05ad6d6
add login info
adwk67 e16da8c
added some readme info
adwk67 0f5dce1
corrected ingress secret, set python cacert explicitly
adwk67 0e3a28c
Merge branch 'main' into feat/keycloak-jupyterhub
adwk67 ca0c492
wip: working version
adwk67 f046dd8
clean-up realm-config
adwk67 e8eb2f9
delegate user check to Keycloak
adwk67 c1274e6
use demo-specific keycloak
adwk67 c41f309
removed unnecessary settings
adwk67 f6d22a9
specify ports
adwk67 697a0a8
add jupyterhub.yaml to stack
adwk67 396705f
wip: working nb/spark combo
adwk67 bc94e33
read/write from s3
adwk67 53132a8
remove driver service resource in favour of the ones produced dynamic…
adwk67 803e520
use secret for minio credentials, add demo entry
adwk67 bcfa3ae
set endpoints via extra config
adwk67 0ff07da
mount notebook
adwk67 9d431b5
user-specific job name
adwk67 79fdb3b
add some notebook comments
adwk67 b021d35
typos and add password to stack
adwk67 d3added
first draft of demo docs
adwk67 9c7298e
typo, fixed title
adwk67 7c497ee
added hdfs write/read steps
adwk67 573f812
updated docs
adwk67 884f0bf
doc cleanup
adwk67 44fad51
Merge branch 'main' into feat/keycloak-jupyterhub
adwk67 3d4484c
Apply suggestions from code review
adwk67 80bd2c6
review suggestions: remove HDFS, improve docs and server options
adwk67 44b0ecf
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 49d47e0
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 5a2c6cf
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 0eb1ac7
Update docs/modules/demos/pages/jupyterhub-keycloak.adoc
adwk67 7be5288
added a note about proxy reachability
adwk67 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| --- | ||
| apiVersion: batch/v1 | ||
| kind: Job | ||
| metadata: | ||
| name: load-gas-data | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: load-gas-data | ||
| image: "bitnami/minio:2022-debian-10" | ||
| command: ["bash", "-c", "cd /tmp; curl -O https://repo.stackable.tech/repository/misc/datasets/gas-sensor-data/20160930_203718.csv && mc --insecure alias set minio http://minio:9000/ $(cat /minio-s3-credentials/accessKey) $(cat /minio-s3-credentials/secretKey) && mc cp 20160930_203718.csv minio/demo/gas-sensor/raw/;"] | ||
| volumeMounts: | ||
| - name: minio-s3-credentials | ||
| mountPath: /minio-s3-credentials | ||
| volumes: | ||
| - name: minio-s3-credentials | ||
| secret: | ||
| secretName: minio-s3-credentials | ||
| restartPolicy: OnFailure | ||
| backoffLimit: 50 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| = jupyterhub-keycloak | ||
|
|
||
| :k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu | ||
| :spark-pkg: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html | ||
| :pyspark: https://spark.apache.org/docs/latest/api/python/getting_started/index.html | ||
| :jupyterhub-k8s: https://github.com/jupyterhub/zero-to-jupyterhub-k8s | ||
| :jupyterlab: https://jupyterlab.readthedocs.io/en/stable/ | ||
| :jupyter: https://jupyter.org | ||
| :keycloak: https://www.keycloak.org/ | ||
| :gas-sensor: https://archive.ics.uci.edu/dataset/487/gas+sensor+array+temperature+modulation | ||
|
|
||
| This demo showcases the integration between {jupyter}[JupyterHub] and {keycloak}[Keycloak] deployed on the Stackable Data Platform (SDP) onto a Kubernetes cluster. | ||
| {jupyterlab}[JupyterLab] is deployed using the {jupyterhub-k8s}[pyspark-notebook stack] provided by the Jupyter community. | ||
| A simple notebook is provided that shows how to start a distributed Spark cluster, reading and writing data from an S3 instance. | ||
|
|
||
| For this demo a small sample of {gas-sensor}[gas sensor measurements*] is provided. | ||
| Install this demo on an existing Kubernetes cluster: | ||
|
|
||
| [source,console] | ||
| ---- | ||
| $ stackablectl demo install jupyterhub-keycloak | ||
| ---- | ||
|
|
||
| WARNING: When running a distributed Spark cluster from within a JupyterHub notebook, the notebook acts as the driver and requests executors Pods from k8s. | ||
| These Pods in turn can mount *all* volumes and Secrets in that namespace. | ||
| To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in this demo. | ||
|
|
||
| [#system-requirements] | ||
| == System requirements | ||
|
|
||
| To run this demo, your system needs at least: | ||
|
|
||
| * 8 {k8s-cpu}[cpu units] (core/hyperthread) | ||
| * 32GiB memory | ||
|
|
||
| You may need more resources depending on how many concurrent users are logged in, and which notebook profiles they are using. | ||
|
|
||
| == Aim / Context | ||
|
|
||
| This demo shows how to authenticate JupyerHub users against a Keycloak backend using JupyterHub's OAuthenticator. | ||
| The same users as in the xref:end-to-end-security.adoc[End-to-end-security] demo are configured in Keycloak and these will be used as examples. | ||
| The notebook offers a simple template for using Spark to interact with S3 as a storage backend. | ||
|
|
||
| == Overview | ||
|
|
||
| This demo will: | ||
|
|
||
| * Install the required Stackable Data Platform operators | ||
| * Spin up the following data products: | ||
| ** *JupyterHub*: A multi-user server for Jupyter notebooks | ||
| ** *Keycloak*: An identity and access management product | ||
| ** *S3*: A Minio instance for data storage | ||
| * Download a sample of the gas sensor dataset into S3 | ||
| * Install the Jupyter notebook | ||
| * Demonstrate some basic data operations against S3 | ||
| * Illustrate multi-user usage | ||
|
|
||
| == JupyterHub | ||
|
|
||
| Have a look at the available Pods before logging in: | ||
|
|
||
| [source,console] | ||
| ---- | ||
| $ kubectl get pods | ||
| NAME READY STATUS RESTARTS AGE | ||
| hub-84f49ccbd7-29h7j 1/1 Running 0 56m | ||
| keycloak-544d757f57-f55kr 2/2 Running 0 57m | ||
| load-gas-data-m6z5p 0/1 Completed 0 54m | ||
| minio-5486d7584f-x2jn8 1/1 Running 0 57m | ||
| proxy-648bf7f45b-62vqg 1/1 Running 0 56m | ||
|
|
||
| ---- | ||
|
|
||
| The `proxy` Pod has an associated `proxy-public` service with a statically-defined port (31095), exposed with type NodePort. The `keycloak` Pod has a Service called `keycloak` with a fixed port (31093) of type NodePort as well. | ||
| In order to reach the JupyterHub web interface, navigate to this service. | ||
| The node port IP can be found in the ConfigMap `keycloak-address` (written by the Keycloak Deployment as it starts up). | ||
| On Kind this can be any node - not necessarily the one where the proxy Pod is running. | ||
| This is due to the way in which Docker networking is used within the cluster. | ||
| On other clusters it will be necessary to use the exact Node on which the proxy is running. | ||
|
|
||
| In the example below that would then be 172.19.0.5:31095: | ||
|
|
||
| [source,yaml] | ||
| ---- | ||
| apiVersion: v1 | ||
| data: | ||
| keycloakAddress: 172.19.0.5:31093 # Keycloak itself | ||
| keycloakNodeIp: 172.19.0.5 # can be used to access the proxy-public service | ||
maltesander marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| kind: ConfigMap | ||
| metadata: | ||
| name: keycloak-address | ||
| namespace: default | ||
| ---- | ||
|
|
||
| NOTE: The `hub` Pod may show a `CreateContainerConfigError` for a few moments on start-up as it requires the ConfigMap written by the Keycloak deployment. | ||
|
|
||
| You should see the JupyterHub login page, which will indicate a re-direct to the OAuth service (Keycloak): | ||
|
|
||
| image::jupyterhub-keycloak/oauth-login.png[] | ||
|
|
||
| Click on the sign-in button. | ||
| You will be redirected to the Keycloak login, where you can enter one of the aforementioned users (e.g. `justin.martin` or `isla.williams`: the password is the same as the username): | ||
|
|
||
| image::jupyterhub-keycloak/keycloak-login.png[] | ||
|
|
||
| A successful login will redirect you back to JupyterHub where different profiles are listed (the drop-down options are visible when you click on the respective fields): | ||
|
|
||
| image::jupyterhub-keycloak/server-options.png[] | ||
|
|
||
| The explorer window on the left includes a notebook that is already mounted. | ||
|
|
||
| Double-click on the file `notebook/process-s3.ipynb`: | ||
|
|
||
| image::jupyterhub-keycloak/load-nb.png[] | ||
|
|
||
| Run the notebook by selecting "Run All Cells" from the menu: | ||
|
|
||
| image::jupyterhub-keycloak/run-nb.png[] | ||
|
|
||
| The notebook includes some comments regarding image compatibility and uses a custom image built off the official Spark image that matches the Spark version used in the notebook. | ||
| The java versions also match exactly. | ||
| Python versions need to match at the `major:minor` level, which is why Python 3.11 is used in the custom image. | ||
|
|
||
| Once the spark executor has been started (we have specified `spark.executor.instances` = 1) it will spin up as an extra pod. | ||
| We have named the spark job to incorporate the current user (justin-martin). | ||
| JupyterHub has started a pod for the user's notebook instance (`jupyter-justin-martin---bdd3b4a1`) and another one for the spark executor (`process-s3-jupyter-justin-martin-bdd3b4a1-9e9da995473f481f-exec-1`): | ||
|
|
||
| [source,console] | ||
| ---- | ||
| $ kubectl get pods | ||
| NAME READY STATUS RESTARTS AGE | ||
| ... | ||
| jupyter-justin-martin---bdd3b4a1 1/1 Running 0 17m | ||
| process-s3-jupyter-justin-martin-... 1/1 Running 0 2m9s | ||
| ... | ||
| ---- | ||
|
|
||
| Stop the kernel in the notebook (which will shut down the spark session and thus the executor) and log out as the current user. | ||
| Log in now as `daniel.king` and then again as `isla.williams` (you may need to do this in a clean browser sessions so that existing login cookies are removed). | ||
| This user has been defined as an admin user in the jupyterhub configuration: | ||
|
|
||
| [source,yaml] | ||
| ---- | ||
| ... | ||
| hub: | ||
| config: | ||
| Authenticator: | ||
| # don't filter here: delegate to Keycloak | ||
| allow_all: True | ||
| admin_users: | ||
| - isla.williams | ||
| ... | ||
| ---- | ||
|
|
||
| You should now see user-specific pods for all three users: | ||
|
|
||
|
|
||
| [source,console] | ||
| ---- | ||
| $ kubectl get pods | ||
| NAME READY STATUS RESTARTS AGE | ||
| ... | ||
| jupyter-daniel-king---181a80ce 1/1 Running 0 6m17s | ||
| jupyter-isla-williams---14730816 1/1 Running 0 4m50s | ||
| jupyter-justin-martin---bdd3b4a1 1/1 Running 0 3h47m | ||
| ... | ||
| ---- | ||
|
|
||
| The admin user (`isla.williams`) will also have an extra Admin tab in the JupyterHub console where current users can be managed. | ||
| You can find this in the JupyterHub UI at http://<ip>:31095/hub/admin e.g http://172.19.0.5:31095/hub/admin: | ||
|
|
||
| image::jupyterhub-keycloak/admin-tab.png[] | ||
|
|
||
| You can inspect the S3 buckets by using stackable stacklet list to return the Minio endpoint and logging in there with `admin/adminadmin`: | ||
|
|
||
| [source,console] | ||
| ---- | ||
| $ stackablectl stacklet list | ||
|
|
||
| ┌─────────┬───────────────┬───────────┬───────────────────────────────┬────────────┐ | ||
| │ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │ | ||
| ╞═════════╪═══════════════╪═══════════╪═══════════════════════════════╪════════════╡ | ||
| │ minio ┆ minio-console ┆ default ┆ http http://172.19.0.5:32470 ┆ │ | ||
| └─────────┴───────────────┴───────────┴───────────────────────────────┴────────────┘ | ||
| ---- | ||
|
|
||
| image::jupyterhub-keycloak/s3-buckets.png[] | ||
|
|
||
| NOTE: if you attempt to re-run the notebook you will need to first remove the `_temporary folders` from the S3 buckets. | ||
| These are created by spark jobs and are not removed from the bucket when the job has completed. | ||
|
|
||
| *See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25 | ||
| Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # docker build -t oci.stackable.tech/sandbox/spark:3.5.2-python311 -f Dockerfile . | ||
| # kind load docker-image oci.stackable.tech/sandbox/spark:3.5.2-python311 -n stackable-data-platform | ||
| # or: | ||
| # docker push oci.stackable.tech/sandbox/spark:3.5.2-python311 | ||
|
|
||
| FROM spark:3.5.2-scala2.12-java17-ubuntu | ||
|
|
||
| USER root | ||
|
|
||
| RUN set -ex; \ | ||
| apt-get update; \ | ||
| # Install dependencies for Python 3.11 | ||
| apt-get install -y \ | ||
| software-properties-common \ | ||
| && apt-get update && apt-get install -y \ | ||
| python3.11 \ | ||
| python3.11-venv \ | ||
| python3.11-dev \ | ||
| && rm -rf /var/lib/apt/lists/*; \ | ||
| # Install pip manually for Python 3.11 | ||
| curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \ | ||
| python3.11 get-pip.py && \ | ||
| rm get-pip.py | ||
|
|
||
| # Make Python 3.11 the default Python version | ||
| RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \ | ||
| && update-alternatives --install /usr/bin/pip pip /usr/local/bin/pip3 1 | ||
|
|
||
| USER spark |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| --- | ||
| releaseName: jupyterhub | ||
| name: jupyterhub | ||
| repo: | ||
| name: jupyterhub | ||
| url: https://jupyterhub.github.io/helm-chart/ | ||
| version: 4.0.0 | ||
| options: | ||
| hub: | ||
| config: | ||
| Authenticator: | ||
| allow_all: True | ||
| admin_users: | ||
| - admin | ||
| JupyterHub: | ||
| authenticator_class: nativeauthenticator.NativeAuthenticator | ||
| NativeAuthenticator: | ||
| open_signup: true | ||
| proxy: | ||
| service: | ||
| type: ClusterIP | ||
| rbac: | ||
| create: true | ||
| prePuller: | ||
| hook: | ||
| enabled: false | ||
| continuous: | ||
| enabled: false | ||
| scheduling: | ||
| userScheduler: | ||
| enabled: false | ||
| singleuser: | ||
| cmd: null | ||
| serviceAccountName: hub | ||
| networkPolicy: | ||
| enabled: false | ||
| extraLabels: | ||
| stackable.tech/vendor: Stackable | ||
| profileList: | ||
| - display_name: "Default" | ||
| description: "Default profile" | ||
| default: true | ||
| profile_options: | ||
| cpu: | ||
| display_name: CPU | ||
| choices: | ||
| "2": | ||
| display_name: "2 request, 2 limit" | ||
| kubespawner_override: | ||
| cpu_guarantee: 2 | ||
| cpu_limit: 2 | ||
| "1 request, 16 limit": | ||
| display_name: "1 request, 16 limit" | ||
| kubespawner_override: | ||
| cpu_guarantee: 1 | ||
| cpu_limit: 16 | ||
| memory: | ||
| display_name: Memory | ||
| choices: | ||
| "8 GB": | ||
| display_name: "8 GB" | ||
| kubespawner_override: | ||
| mem_guarantee: "8G" | ||
| mem_limit: "8G" | ||
| image: | ||
| display_name: Image | ||
| choices: | ||
| "quay.io/jupyter/pyspark-notebook:python-3.11.9": | ||
| display_name: "quay.io/jupyter/pyspark-notebook:python-3.11.9" | ||
| kubespawner_override: | ||
| image: "quay.io/jupyter/pyspark-notebook:python-3.11.9" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.