Skip to content

Commit 1742d60

Browse files
authored
Add jupyter (#87)
Closes G-Research/spark#128 --------- Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
1 parent ba16e94 commit 1742d60

File tree

7 files changed

+402
-0
lines changed

7 files changed

+402
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,6 @@ scripts/armadactl
8080
e2e-test.log
8181
extraJars/*.jar
8282
scripts/.tmp/
83+
84+
# Jupyter
85+
example/jupyter/workspace/

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,3 +151,21 @@ The project includes a ready-to-use Spark job to test your setup:
151151
This job leverages the same configuration parameters (`ARMADA_MASTER`, `ARMADA_QUEUE`, `ARMADA_LOOKOUT_URL`) as the `scripts/config.sh` script.
152152

153153
Use the -h option to see what other options are available.
154+
155+
### Jupyter Notebook
156+
157+
The Docker image includes Jupyter support. Run Jupyter with the example notebooks:
158+
159+
```bash
160+
./scripts/runJupyter.sh
161+
```
162+
163+
**Note:** The Docker image must be built with `INCLUDE_PYTHON=true` for Jupyter to work.
164+
165+
This will start a Jupyter notebook server at `http://localhost:8888` (or the port specified by `JUPYTER_PORT` in `scripts/config.sh`).
166+
The example notebooks from `example/jupyter/notebooks` are mounted in the container at `/home/spark/workspace/notebooks`.
167+
168+
**Configuration:**
169+
- **Required:** `SPARK_DRIVER_HOST`
170+
- Override the Jupyter port if required by setting `JUPYTER_PORT` in `scripts/config.sh`
171+
- The script uses the same configuration (`ARMADA_MASTER`, `ARMADA_QUEUE`, `SPARK_DRIVER_HOST`, etc.) as other scripts

docker/Dockerfile

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,13 @@ ARG spark_base_image_tag=3.3.3-scala2.12-java11-ubuntu
2020
FROM ${spark_base_image_prefix}:${spark_base_image_tag}
2121

2222
ARG scala_binary_version=2.13
23+
ARG spark_version=3.3.3
24+
ARG include_python=false
2325

2426
COPY target/armada-cluster-manager_${scala_binary_version}-*-all.jar /opt/spark/jars/
2527
COPY extraFiles /opt/spark/extraFiles
2628
COPY extraJars/* /opt/spark/jars
29+
COPY docker/jupyter-entrypoint.sh /opt/spark/bin/jupyter-entrypoint.sh
2730

2831

2932
USER 0
@@ -34,5 +37,41 @@ RUN mkdir -p /opt/spark/coreJars && \
3437

3538
ENV SPARK_DIST_CLASSPATH=/opt/spark/coreJars/*
3639

40+
# Install Jupyter, PySpark, and Python dependencies (only if include_python is true)
41+
RUN if [ "$include_python" = "true" ]; then \
42+
apt-get update && \
43+
apt-get install -y python3-pip && \
44+
pip3 install --no-cache-dir \
45+
jupyter \
46+
notebook \
47+
ipykernel \
48+
pyspark==${spark_version} && \
49+
apt-get clean && \
50+
rm -rf /var/lib/apt/lists/*; \
51+
fi
52+
53+
54+
RUN if [ "$include_python" = "true" ]; then \
55+
mkdir -p /home/spark/workspace && \
56+
mkdir -p /home/spark/.local/share/jupyter && \
57+
mkdir -p /home/spark/.jupyter && \
58+
chown -R 185:185 /home/spark/workspace && \
59+
chown -R 185:185 /home/spark/.local && \
60+
chown -R 185:185 /home/spark/.jupyter; \
61+
fi && \
62+
chmod +x /opt/spark/bin/jupyter-entrypoint.sh
63+
3764
ARG spark_uid=185
3865
USER ${spark_uid}
66+
67+
# Install ipykernel (only if include_python is true)
68+
RUN if [ "$include_python" = "true" ]; then \
69+
HOME=/home/spark python3 -m ipykernel install --user --name python3 --display-name "Python 3"; \
70+
fi
71+
72+
ENV HOME=/home/spark
73+
ENV SPARK_HOME=/opt/spark
74+
ENV PYSPARK_PYTHON=python3
75+
ENV PYSPARK_DRIVER_PYTHON=python3
76+
ENV PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-*src.zip
77+
ENV JUPYTER_RUNTIME_DIR=/home/spark/.local/share/jupyter/runtime

docker/jupyter-entrypoint.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
3+
cd /home/spark/workspace
4+
5+
exec jupyter notebook \
6+
--ip=0.0.0.0 \
7+
--port=8888 \
8+
--no-browser \
9+
--NotebookApp.token='' \
10+
--NotebookApp.password='' \
11+
--NotebookApp.notebook_dir=/home/spark/workspace
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "introduction",
6+
"metadata": {},
7+
"source": [
8+
"# Armada Spark Example\n",
9+
"\n",
10+
"This notebook demonstrates how to run Spark jobs on Armada using PySpark in client mode."
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": null,
16+
"id": "imports",
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"import os\n",
21+
"import glob\n",
22+
"import subprocess\n",
23+
"import random\n",
24+
"from pyspark.sql import SparkSession\n",
25+
"from pyspark import SparkConf"
26+
]
27+
},
28+
{
29+
"cell_type": "markdown",
30+
"id": "setup-section",
31+
"metadata": {},
32+
"source": [
33+
"## Setup\n",
34+
"\n",
35+
"Clean up any existing Spark context and configure the environment."
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"execution_count": null,
41+
"id": "stop-existing-context",
42+
"metadata": {},
43+
"outputs": [],
44+
"source": [
45+
"try:\n",
46+
" from pyspark import SparkContext\n",
47+
" if SparkContext._active_spark_context:\n",
48+
" SparkContext._active_spark_context.stop()\n",
49+
"except:\n",
50+
" pass"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"id": "config-section",
56+
"metadata": {},
57+
"source": [
58+
"## Configuration\n",
59+
"\n",
60+
"Set up connection parameters and locate the Armada Spark JAR file."
61+
]
62+
},
63+
{
64+
"cell_type": "code",
65+
"execution_count": null,
66+
"id": "configuration",
67+
"metadata": {},
68+
"outputs": [],
69+
"source": [
70+
"# Configuration\n",
71+
"auth_token = os.environ.get('ARMADA_AUTH_TOKEN')\n",
72+
"auth_script_path = os.environ.get('ARMADA_AUTH_SCRIPT_PATH')\n",
73+
"driver_host = os.environ.get('SPARK_DRIVER_HOST')\n",
74+
"driver_port = os.environ.get('SPARK_DRIVER_PORT', '7078')\n",
75+
"block_manager_port = os.environ.get('SPARK_BLOCK_MANAGER_PORT', '10061')\n",
76+
"armada_master = os.environ.get('ARMADA_MASTER', 'local://armada://host.docker.internal:30002')\n",
77+
"armada_queue = os.environ.get('ARMADA_QUEUE', 'default')\n",
78+
"armada_namespace = os.environ.get('ARMADA_NAMESPACE', 'default')\n",
79+
"image_name = os.environ.get('IMAGE_NAME', 'spark:armada')\n",
80+
"event_watcher_use_tls = os.environ.get('ARMADA_EVENT_WATCHER_USE_TLS', 'false')\n",
81+
"\n",
82+
"# Find JAR - try common Scala versions (2.12, 2.13)\n",
83+
"jar_paths = glob.glob('/opt/spark/jars/armada-cluster-manager_2.1*-*-all.jar')\n",
84+
"if not jar_paths:\n",
85+
" raise FileNotFoundError(\"Armada Spark JAR not found!\")\n",
86+
"armada_jar = jar_paths[0]\n",
87+
"\n",
88+
"# Generate app ID, required for client mode\n",
89+
"app_id = f\"jupyter-spark-{subprocess.check_output(['openssl', 'rand', '-hex', '3']).decode().strip()}\""
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"id": "spark-config-section",
95+
"metadata": {},
96+
"source": [
97+
"## Spark Configuration\n",
98+
"\n",
99+
"Configure Spark to use Armada as the cluster manager in client mode."
100+
]
101+
},
102+
{
103+
"cell_type": "code",
104+
"execution_count": null,
105+
"id": "spark-config",
106+
"metadata": {},
107+
"outputs": [],
108+
"source": [
109+
"# Spark Configuration\n",
110+
"conf = SparkConf()\n",
111+
"if auth_token:\n",
112+
" conf.set(\"spark.armada.auth.token\", auth_token)\n",
113+
"if auth_script_path:\n",
114+
" conf.set(\"spark.armada.auth.script.path\", auth_script_path)\n",
115+
"if not driver_host:\n",
116+
" raise ValueError(\n",
117+
" \"SPARK_DRIVER_HOST environment variable is required. \"\n",
118+
" )\n",
119+
"conf.set(\"spark.master\", armada_master)\n",
120+
"conf.set(\"spark.submit.deployMode\", \"client\")\n",
121+
"conf.set(\"spark.app.id\", app_id)\n",
122+
"conf.set(\"spark.app.name\", \"jupyter-spark-pi\")\n",
123+
"conf.set(\"spark.driver.bindAddress\", \"0.0.0.0\")\n",
124+
"conf.set(\"spark.driver.host\", driver_host)\n",
125+
"conf.set(\"spark.driver.port\", driver_port)\n",
126+
"conf.set(\"spark.driver.blockManager.port\", block_manager_port)\n",
127+
"conf.set(\"spark.home\", \"/opt/spark\")\n",
128+
"conf.set(\"spark.armada.container.image\", image_name)\n",
129+
"conf.set(\"spark.armada.queue\", armada_queue)\n",
130+
"conf.set(\"spark.armada.scheduling.namespace\", armada_namespace)\n",
131+
"conf.set(\"spark.armada.eventWatcher.useTls\", event_watcher_use_tls)\n",
132+
"conf.set(\"spark.kubernetes.file.upload.path\", \"/tmp\")\n",
133+
"conf.set(\"spark.kubernetes.executor.disableConfigMap\", \"true\")\n",
134+
"conf.set(\"spark.local.dir\", \"/tmp\")\n",
135+
"conf.set(\"spark.jars\", armada_jar)\n",
136+
"\n",
137+
"# Network timeouts\n",
138+
"conf.set(\"spark.network.timeout\", \"800s\")\n",
139+
"conf.set(\"spark.executor.heartbeatInterval\", \"60s\")\n",
140+
"\n",
141+
"# Static mode - tune these values for your environment\n",
142+
"conf.set(\"spark.executor.instances\", \"2\")\n",
143+
"conf.set(\"spark.armada.driver.limit.memory\", \"1Gi\")\n",
144+
"conf.set(\"spark.armada.driver.request.memory\", \"1Gi\")\n",
145+
"conf.set(\"spark.armada.executor.limit.memory\", \"1Gi\")\n",
146+
"conf.set(\"spark.armada.executor.request.memory\", \"1Gi\")"
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": null,
152+
"id": "create-spark-session",
153+
"metadata": {},
154+
"outputs": [],
155+
"source": [
156+
"# Create SparkSession\n",
157+
"spark = SparkSession.builder.config(conf=conf).getOrCreate()\n",
158+
"print(f\"SparkSession created\")"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"id": "examples-section",
164+
"metadata": {},
165+
"source": [
166+
"## Examples\n",
167+
"\n",
168+
"Run Spark computations on the Armada cluster."
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"id": "spark-pi-calculation",
175+
"metadata": {},
176+
"outputs": [],
177+
"source": [
178+
"# Spark Pi calculation\n",
179+
"print(f\"Running Spark Pi calculation...\")\n",
180+
"n = 10000\n",
181+
"\n",
182+
"def inside(p):\n",
183+
" x, y = random.random(), random.random()\n",
184+
" return x*x + y*y < 1\n",
185+
"\n",
186+
"count = spark.sparkContext.parallelize(range(0, n)).filter(inside).count()\n",
187+
"pi = 4.0 * count / n\n",
188+
"print(f\" Pi is approximately: {pi}\")"
189+
]
190+
},
191+
{
192+
"cell_type": "markdown",
193+
"id": "cleanup-section",
194+
"metadata": {},
195+
"source": [
196+
"## Cleanup\n",
197+
"\n",
198+
"Stop the Spark context to release resources. This will stop the executors in Armada."
199+
]
200+
},
201+
{
202+
"cell_type": "code",
203+
"execution_count": null,
204+
"id": "stop-spark-context",
205+
"metadata": {},
206+
"outputs": [],
207+
"source": [
208+
"# Stop Spark context\n",
209+
"print(\"Stopping Spark context...\")\n",
210+
"spark.stop()\n",
211+
"print(\"Spark context stopped successfully\")"
212+
]
213+
}
214+
],
215+
"metadata": {
216+
"kernelspec": {
217+
"display_name": "Python 3",
218+
"language": "python",
219+
"name": "python3"
220+
},
221+
"language_info": {
222+
"codemirror_mode": {
223+
"name": "ipython",
224+
"version": 3
225+
},
226+
"file_extension": ".py",
227+
"mimetype": "text/x-python",
228+
"name": "python",
229+
"nbconvert_exporter": "python",
230+
"pygments_lexer": "ipython3",
231+
"version": "3.10.12"
232+
}
233+
},
234+
"nbformat": 4,
235+
"nbformat_minor": 5
236+
}

scripts/createImage.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ docker build \
5959
--build-arg spark_base_image_prefix=$image_prefix \
6060
--build-arg spark_base_image_tag=$image_tag \
6161
--build-arg scala_binary_version=$SCALA_BIN_VERSION \
62+
--build-arg spark_version=$SPARK_VERSION \
6263
--build-arg include_python=$INCLUDE_PYTHON \
6364
-f "$root/docker/Dockerfile" \
6465
"$root"

0 commit comments

Comments
 (0)