Welcome to our Helm Chart Installer for Spark. This would enable a user to easily deploy the Spark Ecosystem Components on a Kubernetes Cluster.
The below components enable the following features:
- Running Spark Notebooks using Spark and Spark SQL
- Creating Spark Jobs using python
- Tracking Spark Jobs using a UI
Components:
- Hive Metastore
- Spark Thrift Server
- Spark History Server
- Lighter Server
- Jupyter Lab
- SparkMagic Kernel
- Spark Dashboard
- Zeppelin not supported ARM64
We invite you to try this out and let us know any issues/feedback you have via Github Issues. Do let us know what adaptions you have done for your setup via Github Discussions.
Suitable for users with basic knowledge on Kubernetes and Helm. Can also install on Microk8s.
Requirements:
- Ingress
- Storage that support
ReadWriteMany
-
Run the following install command, where
spark-bundleis the name you prefer:helm install spark-bundle installer --namespace kapitanspark --create-namespace --atomic --timeout=30m
-
Run the command
kubectl get ingress --namespace kapitansparkto get IP address of KUBERNETES_NODE_IP. For default password, please refer to component section in this document. After that you can access- Jupyter lab at http://KUBERNETES_NODE_IP/jupyterlab
- Spark History Server at http://KUBERNETES_NODE_IP/spark-history-server
- Lighter UI http://KUBERNETES_NODE_IP/lighter
- Spark Dashboard http://KUBERNETES_NODE_IP/grafana
- Zeppelin http://KUBERNETES_NODE_IP/zeppelin
| Syntax | Description |
|---|---|
| Kubernetes | 1.23.0 >= 1.29.0 |
| Helm | 3 |
| Resource | Description | Remarks |
|---|---|---|
| CPU | 8 Cores | |
| Memory | 12 GB | |
| Disk | 80 GB | Adjust this based on the size of your Spark docker images |
Remarks
-
Hive metastore
- You may rebuild the image using the Dockerfile
hive-metastore/Dockerfile - After rebuilding, modify the following keys in
values.yaml:image.repository,image.taginvalues.yaml.
- You may rebuild the image using the Dockerfile
-
Spark Thrift Server
- You may rebuild the image using the Dockerfile
spark_docker_image/Dockerfile - After rebuilding, modify the following keys in
values.yaml:image.repository,image.taginvalues.yaml. - Spark UI has been intentionally disabled at
spark-thrift-server/templates/service.yaml. - Dependency:
hive-metastorecomponent.
- You may rebuild the image using the Dockerfile
-
Jupyter Lab
- Modify
jupyterlab/requirements.txtaccording to your project before installation. - Default password:
spark ecosystem
- Modify
-
Lighter
- You may rebuild the image using the Dockerfile
spark_docker_image/Dockerfile - After rebuilding, modify the following keys in
values.yaml:image.spark.repository,image.spark.taginvalues.yaml. - If Spark History Server uses Persistent Volumes to save event logs instead of Blob storage S3a, ensure to install it with
spark-history-servercomponent on the same Kubernetes namespace. - Dependencies:
hive-metastore,spark-dashboardandspark-history-servercomponents. The latter can be turned off invalues.yaml. - Default user:
dataOpspassword:5Wmi95w4
- You may rebuild the image using the Dockerfile
-
Spark History Server
- By default, Persistent Volumes is used to read event logs, you may modify this by updating the
dirkey inspark-history-server/values.yamland in thelightercomponent, update keyspark.history.eventLog.dirinlighter/values.yaml - If using Persistence volume instead of Blob storage S3a, ensure it is installed on the same namespace as other components.
- Default user:
dataOpspassword:5Wmi95w4
- By default, Persistent Volumes is used to read event logs, you may modify this by updating the
-
Spark Dashboard
- Default user:
dashboardpassword:1K7rYwg655Zl
- Default user:
-
Zeppelin
- You may rebuild the image using the Dockerfile
zeppelin/Dockerfile - After rebuilding, modify the following keys in
values.yaml:image.repository,image.tag,ZEPPELIN_K8S_CONTAINER_IMAGEinvalues.yaml. - If Spark History Server uses Persistent Volumes to save event logs instead of Blob storage S3a, ensure to install it with
spark-history-servercomponent on the same Kubernetes namespace. - Dependencies:
hive-metastore,spark-dashboardandspark-history-servercomponents. The latter can be turned off invalues.yaml. - Default user:
dataOpspassword:Tz44828IX60O
- You may rebuild the image using the Dockerfile
This method is ideal for advanced users who have some expertise in Kubernetes and Helm. This approach enables you to extend existing configurations efficiently for your needs, without modifying the existing source code.
This helm chart supports various methods of customization
- Modifying
values.yaml - Providing a new
values.yamlfile - Using Kustomize
Show Details of Customization
You may customise your installation of the above components by editing the file at installer/values.yaml.
Alternatively, you can create a copy of the values file and run the following modified command
helm install spark-bundle installer --values new_values.yaml --namespace kapitanspark --create-namespace --atomic --timeout=30mThis approach prevents you from modifying the original source code and enables you to customize as per your needs.
You may refer to this section Using Kustomize
If you want to install each component separately, you can also navigate to the individual chart folder and run helm install as needed.
You may create multiple instances of this Helm Chart by specifying a different Helm Chart name, for example : production, staging and testing environments.
You may need to adjust the Spark Thrift Server Port Number if you are installing 2 instances on the same cluster.
Show Sample Commands to Create Multiple Instances
helm install spark-production installer --namespace kapitanspark-prod --create-namespace --atomic --timeout=30mhelm install spark-testing installer --namespace kapitanspark-test --create-namespace --atomic --timeout=30mShow Customised Install Instructions
Requirements:
- Ingress (Nginx)
- Storage that support
ReadWriteMany, eg: NFS or Longhorn NFS
-
Customize your components by enabling or disabling them in
installer/values.yaml -
Navigate to the directory
kcustomize/example/prod/, and modifygoogle-secret.yamlandvalues.yamlfiles. -
Modify
jupyterlab/requirements.txtaccording to your project before installation -
Execute the install command stated below in the folder
kcustomize/example/prod/, replacingspark-bundlewith your preferred name. You can add--dry-run=serverto test any error in helm files before installation:cd kcustomize/example/prod/ helm install spark-bundle ../../../installer --namespace kapitanspark --post-renderer ./kustomize.sh --values ./values.yaml --create-namespace --atomic --timeout=30m -
After successful installation, you should be able to access the Jupyter Lab, Spark History Server, Lighter UI and Dashboard based on your configuration of the Ingress section in
values.yaml.
You may skip the local setup if you already an existing kubernetes cluster you would like to use
See details of setup for microk8s
At the moment, we have only tested this locally using microk8s. Refer to the installation steps on microk8s docs
If you are using Microk8s, below are the steps to install Nginx and PV with RWX support:
use the following command to install MicroK8s with specified resource limits:
# the requirements stated below are the minimum, feel free to adjust upwards as needed
microk8s install --cpu 8 --mem 12 --disk 80
install MicroK8s using:
sudo snap install microk8s --classic --channel=1.28
ensure you set the correct permissions for the kube configuration directory:
chmod 0700 ~/.kube
microk8s enable hostpath-storage
microk8s enable ingress
microk8s enable metrics-server
#output your kubeconfig using this command
microk8s config
# update ~/.kube/config to add the config above to access this kubernetes cluster via kubectl