Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Commit 8c08189

Browse files
foxishash211
authored andcommitted
Docs improvements (#176)
* Adding official alpha docker image to docs * Reorder sections and create a specific one for "advanced" * Provide limitations and instructions about running on GKE * Fix title of advanced section: submission * Improved section on running in the cloud * Update versioning * Address comments * Address comments (cherry picked from commit e5da90d)
1 parent b139b46 commit 8c08189

File tree

3 files changed

+78
-30
lines changed

3 files changed

+78
-30
lines changed

docs/running-on-kubernetes-cloud.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
layout: global
3+
title: Running Spark in the cloud with Kubernetes
4+
---
5+
6+
For general information about running Spark on Kubernetes, refer to [running Spark on Kubernetes](running-on-kubernetes.md).
7+
8+
A Kubernetes cluster may be brought up on different cloud providers or on premise. It is commonly provisioned through [Google Container Engine](https://cloud.google.com/container-engine/), or using [kops](https://github.com/kubernetes/kops) on AWS, or on premise using [kubeadm](https://kubernetes.io/docs/getting-started-guides/kubeadm/).
9+
10+
## Running on Google Container Engine (GKE)
11+
12+
* Create a GKE [container cluster](https://cloud.google.com/container-engine/docs/clusters/operations).
13+
* Obtain kubectl and [configure](https://cloud.google.com/container-engine/docs/clusters/operations#configuring_kubectl) it appropriately.
14+
* Find the identity of the master associated with this project.
15+
16+
> kubectl cluster-info
17+
Kubernetes master is running at https://<master-ip>:443
18+
19+
* Run spark-submit with the master option set to `k8s://https://<master-ip>:443`. The instructions for running spark-submit are provided in the [running on kubernetes](running-on-kubernetes.md) tutorial.
20+
* Check that your driver pod, and subsequently your executor pods are launched using `kubectl get pods`.
21+
* Read the stdout and stderr of the driver pod using `kubectl logs <name-of-driver-pod>`, or stream the logs using `kubectl logs -f <name-of-driver-pod>`.
22+
23+
Known issues:
24+
* If you face OAuth token expiry errors when you run spark-submit, it is likely because the token needs to be refreshed. The easiest way to fix this is to run any `kubectl` command, say, `kubectl version` and then retry your submission.

docs/running-on-kubernetes.md

Lines changed: 46 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,28 @@ currently limited and not well-tested. This should not be used in production env
1212
* You must have appropriate permissions to create and list [pods](https://kubernetes.io/docs/user-guide/pods/), [nodes](https://kubernetes.io/docs/admin/node/) and [services](https://kubernetes.io/docs/user-guide/services/) in your cluster. You can verify that you can list these resources by running `kubectl get nodes`, `kubectl get pods` and `kubectl get svc` which should give you a list of nodes, pods and services (if any) respectively.
1313
* You must have an extracted spark distribution with Kubernetes support, or build one from [source](https://github.com/apache-spark-on-k8s/spark).
1414

15-
## Setting Up Docker Images
15+
## Driver & Executor Images
1616

1717
Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to
1818
be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is
1919
frequently used with Kubernetes, so Spark provides some support for working with Docker to get started quickly.
2020

21-
To use Spark on Kubernetes with Docker, images for the driver and the executors need to built and published to an
22-
accessible Docker registry. Spark distributions include the Docker files for the driver and the executor at
23-
`dockerfiles/driver/Dockerfile` and `docker/executor/Dockerfile`, respectively. Use these Docker files to build the
21+
If you wish to use pre-built docker images, you may use the images published in [kubespark](https://hub.docker.com/u/kubespark/). The images are as follows:
22+
23+
<table class="table">
24+
<tr><th>Component</th><th>Image</th></tr>
25+
<tr>
26+
<td>Spark Driver Image</td>
27+
<td><code>kubespark/spark-driver:v2.1.0-k8s-support-0.1.0-alpha.1</code></td>
28+
</tr>
29+
<tr>
30+
<td>Spark Executor Image</td>
31+
<td><code>kubespark/spark-executor:v2.1.0-k8s-support-0.1.0-alpha.1</code></td>
32+
</tr>
33+
</table>
34+
35+
You may also build these docker images from sources, or customize them as required. Spark distributions include the Docker files for the driver and the executor at
36+
`dockerfiles/driver/Dockerfile` and `dockerfiles/executor/Dockerfile`, respectively. Use these Docker files to build the
2437
Docker images, and then tag them with the registry that the images should be sent to. Finally, push the images to the
2538
registry.
2639

@@ -44,8 +57,8 @@ are set up as described above:
4457
--kubernetes-namespace default \
4558
--conf spark.executor.instances=5 \
4659
--conf spark.app.name=spark-pi \
47-
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest \
48-
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest \
60+
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-k8s-support-0.1.0-alpha.1 \
61+
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-k8s-support-0.1.0-alpha.1 \
4962
examples/jars/spark_examples_2.11-2.2.0.jar
5063

5164
The Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting
@@ -55,7 +68,6 @@ being contacted at `api_server_url`. If no HTTP protocol is specified in the URL
5568
setting the master to `k8s://example.com:443` is equivalent to setting it to `k8s://https://example.com:443`, but to
5669
connect without SSL on a different port, the master would be set to `k8s://http://example.com:8443`.
5770

58-
5971
If you have a Kubernetes cluster setup, one way to discover the apiserver URL is by executing `kubectl cluster-info`.
6072

6173
> kubectl cluster-info
@@ -67,33 +79,17 @@ In the above example, the specific Kubernetes cluster can be used with spark sub
6779
Note that applications can currently only be executed in cluster mode, where the driver and its executors are running on
6880
the cluster.
6981

70-
### Dependency Management and Docker Containers
82+
### Specifying input files
7183

7284
Spark supports specifying JAR paths that are either on the submitting host's disk, or are located on the disk of the
7385
driver and executors. Refer to the [application submission](submitting-applications.html#advanced-dependency-management)
7486
section for details. Note that files specified with the `local://` scheme should be added to the container image of both
7587
the driver and the executors. Files without a scheme or with the scheme `file://` are treated as being on the disk of
7688
the submitting machine, and are uploaded to the driver running in Kubernetes before launching the application.
77-
78-
### Setting Up SSL For Submitting the Driver
7989

80-
When submitting to Kubernetes, a pod is started for the driver, and the pod starts an HTTP server. This HTTP server
81-
receives the driver's configuration, including uploaded driver jars, from the client before starting the application.
82-
Spark supports using SSL to encrypt the traffic in this bootstrapping process. It is recommended to configure this
83-
whenever possible.
90+
### Accessing Kubernetes Clusters
8491

85-
See the [security page](security.html) and [configuration](configuration.html) sections for more information on
86-
configuring SSL; use the prefix `spark.ssl.kubernetes.submit` in configuring the SSL-related fields in the context
87-
of submitting to Kubernetes. For example, to set the trustStore used when the local machine communicates with the driver
88-
pod in starting the application, set `spark.ssl.kubernetes.submit.trustStore`.
89-
90-
One note about the keyStore is that it can be specified as either a file on the client machine or a file in the
91-
container image's disk. Thus `spark.ssl.kubernetes.submit.keyStore` can be a URI with a scheme of either `file:`
92-
or `local:`. A scheme of `file:` corresponds to the keyStore being located on the client machine; it is mounted onto
93-
the driver container as a [secret volume](https://kubernetes.io/docs/user-guide/secrets/). When the URI has the scheme
94-
`local:`, the file is assumed to already be on the container's disk at the appropriate path.
95-
96-
### Kubernetes Clusters and the authenticated proxy endpoint
92+
For details about running on public cloud environments, such as Google Container Engine (GKE), refer to [running Spark in the cloud with Kubernetes](running-on-kubernetes-cloud.md).
9793

9894
Spark-submit also supports submission through the
9995
[local kubectl proxy](https://kubernetes.io/docs/user-guide/accessing-the-cluster/#using-kubectl-proxy). One can use the
@@ -112,16 +108,36 @@ If our local proxy were listening on port 8001, we would have our submission loo
112108
--kubernetes-namespace default \
113109
--conf spark.executor.instances=5 \
114110
--conf spark.app.name=spark-pi \
115-
--conf spark.kubernetes.driver.docker.image=registry-host:5000/spark-driver:latest \
116-
--conf spark.kubernetes.executor.docker.image=registry-host:5000/spark-executor:latest \
111+
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-k8s-support-0.1.0-alpha.1 \
112+
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-k8s-support-0.1.0-alpha.1 \
117113
examples/jars/spark_examples_2.11-2.2.0.jar
118114

119115
Communication between Spark and Kubernetes clusters is performed using the fabric8 kubernetes-client library.
120116
The above mechanism using `kubectl proxy` can be used when we have authentication providers that the fabric8
121-
kubernetes-client library does not support. Authentication using X509 Client Certs and oauth tokens
117+
kubernetes-client library does not support. Authentication using X509 Client Certs and OAuth tokens
122118
is currently supported.
123119

124-
### Determining the Driver Base URI
120+
## Advanced
121+
122+
### Setting Up SSL For Submitting the Driver
123+
124+
When submitting to Kubernetes, a pod is started for the driver, and the pod starts an HTTP server. This HTTP server
125+
receives the driver's configuration, including uploaded driver jars, from the client before starting the application.
126+
Spark supports using SSL to encrypt the traffic in this bootstrapping process. It is recommended to configure this
127+
whenever possible.
128+
129+
See the [security page](security.html) and [configuration](configuration.html) sections for more information on
130+
configuring SSL; use the prefix `spark.ssl.kubernetes.submit` in configuring the SSL-related fields in the context
131+
of submitting to Kubernetes. For example, to set the trustStore used when the local machine communicates with the driver
132+
pod in starting the application, set `spark.ssl.kubernetes.submit.trustStore`.
133+
134+
One note about the keyStore is that it can be specified as either a file on the client machine or a file in the
135+
container image's disk. Thus `spark.ssl.kubernetes.submit.keyStore` can be a URI with a scheme of either `file:`
136+
or `local:`. A scheme of `file:` corresponds to the keyStore being located on the client machine; it is mounted onto
137+
the driver container as a [secret volume](https://kubernetes.io/docs/user-guide/secrets/). When the URI has the scheme
138+
`local:`, the file is assumed to already be on the container's disk at the appropriate path.
139+
140+
### Submission of Local Files through Ingress/External controller
125141

126142
Kubernetes pods run with their own IP address space. If Spark is run in cluster mode, the driver pod may not be
127143
accessible to the submitter. However, the submitter needs to send local dependencies from its local disk to the driver

resource-managers/kubernetes/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,14 @@ Afterwards, the integration tests can be executed with Maven or your IDE. Note t
5353
`pre-integration-test` phase must be run every time the Spark main code changes. When running tests from the
5454
command line, the `pre-integration-test` phase should automatically be invoked if the `integration-test` phase is run.
5555

56+
After the above step, the integration test can be run using the following command:
57+
58+
```sh
59+
build/mvn integration-test \
60+
-Pkubernetes -Pkubernetes-integration-tests \
61+
-pl resource-managers/kubernetes/integration-tests -am
62+
```
63+
5664
# Preserve the Minikube VM
5765

5866
The integration tests make use of [Minikube](https://github.com/kubernetes/minikube), which fires up a virtual machine

0 commit comments

Comments
 (0)