Skip to content

Commit 38d4a25

Browse files
authored
Add descriptions (#463)
* Add descriptions * Fix spelling * fix whitespace * fix whitespace
1 parent 9cd61dd commit 38d4a25

File tree

9 files changed

+66
-45
lines changed

9 files changed

+66
-45
lines changed
Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
= First steps
2+
:description: Create and run your first Spark job with the Stackable Operator. Includes steps for job setup, verification, and inspecting driver logs.
23

3-
Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the operator and its dependencies, you will now create a Spark job. Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at the logs from the driver pod.
4+
Once you have followed the steps in the xref:getting_started/installation.adoc[] section to install the operator and its dependencies, you will now create a Spark job.
5+
Afterwards you can <<_verify_that_it_works, verify that it works>> by looking at the logs from the driver pod.
46

57
== Starting a Spark job
68

79
A Spark application is made of up three components:
810

9-
- Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
10-
- Driver: the driver starts the designated number of executors and removes them when the job is completed.
11-
- Executor(s): responsible for executing the job itself
11+
* Job: this will build a `spark-submit` command from the resource, passing this to internal spark code together with templates for building the driver and executor pods
12+
* Driver: the driver starts the designated number of executors and removes them when the job is completed.
13+
* Executor(s): responsible for executing the job itself
1214

1315
Create a `SparkApplication`:
1416

@@ -19,34 +21,38 @@ include::example$getting_started/getting_started.sh[tag=install-sparkapp]
1921

2022
Where:
2123

22-
- `metadata.name` contains the name of the SparkApplication
23-
- `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application.
24-
- `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
25-
- `spec.mode`: only `cluster` is currently supported
26-
- `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
27-
- `spec.driver`: driver-specific settings.
28-
- `spec.executor`: executor-specific settings.
24+
* `metadata.name` contains the name of the SparkApplication
25+
* `spec.version`: SparkApplication version (1.0). This can be freely set by the users and is added by the operator as label to all workload resources created by the application.
26+
* `spec.sparkImage`: the image used by the job, driver and executor pods. This can be a custom image built by the user or an official Stackable image. Available official images are listed in the Stackable https://repo.stackable.tech/#browse/browse:docker:v2%2Fstackable%spark-k8s%2Ftags[image registry].
27+
* `spec.mode`: only `cluster` is currently supported
28+
* `spec.mainApplicationFile`: the artifact (Java, Scala or Python) that forms the basis of the Spark job. This path is relative to the image, so in this case we are running an example python script (that calculates the value of pi): it is bundled with the Spark code and therefore already present in the job image
29+
* `spec.driver`: driver-specific settings.
30+
* `spec.executor`: executor-specific settings.
2931

3032
== Verify that it works
3133

32-
As mentioned above, the `SparkApplication` that has just been created will build a `spark-submit` command and pass it to the driver pod, which in turn will create executor pods that run for the duration of the job before being clean up. A running process will look like this:
34+
As mentioned above, the SparkApplication that has just been created will build a `spark-submit` command and pass it to the driver Pod, which in turn will create executor Pods that run for the duration of the job before being clean up.
35+
A running process will look like this:
3336

3437
image::getting_started/spark_running.png[Spark job]
3538

36-
- `pyspark-pi-xxxx`: this is the initialising job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
37-
- `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
38-
- `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
39+
* `pyspark-pi-xxxx`: this is the initializing job that creates the spark-submit command (named as `metadata.name` with a unique suffix)
40+
* `pyspark-pi-xxxxxxx-driver`: the driver pod that drives the execution
41+
* `pythonpi-xxxxxxxxx-exec-x`: the set of executors started by the driver (in our example `spec.executor.instances` was set to 3 which is why we have 3 executors)
3942

4043
Job progress can be followed by issuing this command:
4144

4245
----
4346
include::example$getting_started/getting_started.sh[tag=wait-for-job]
4447
----
4548

46-
When the job completes the driver cleans up the executor. The initial job is persisted for several minutes before being removed. The completed state will look like this:
49+
When the job completes the driver cleans up the executor.
50+
The initial job is persisted for several minutes before being removed.
51+
The completed state will look like this:
4752

4853
image::getting_started/spark_complete.png[Completed job]
4954

50-
The driver logs can be inspected for more information about the results of the job. In this case we expect to find the results of our (approximate!) pi calculation:
55+
The driver logs can be inspected for more information about the results of the job.
56+
In this case we expect to find the results of our (approximate!) pi calculation:
5157

5258
image::getting_started/spark_log.png[Driver log]

docs/modules/spark-k8s/pages/getting_started/index.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
= Getting started
22

3-
This guide will get you started with Spark using the Stackable Operator for Apache Spark. It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result.
3+
This guide will get you started with Spark using the Stackable Operator for Apache Spark.
4+
It will guide you through the installation of the Operator and its dependencies, executing your first Spark job and reviewing its result.
45

56
== Prerequisites
67

docs/modules/spark-k8s/pages/getting_started/installation.adoc

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
= Installation
2+
:description: Learn how to set up Spark with the Stackable Operator, from installation to running your first job, including prerequisites and resource recommendations.
23

34
On this page you will install the Stackable Spark-on-Kubernetes operator as well as the commons, secret and listener operators
45
which are required by all Stackable operators.
56

67
== Dependencies
78

8-
Spark applications almost always require dependencies like database drivers, REST api clients and many others. These
9-
dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too). There are
10-
multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are
11-
implemented at the operator level. In this guide we are going to keep things simple and look at executing a Spark job
12-
that has a minimum of dependencies.
9+
Spark applications almost always require dependencies like database drivers, REST api clients and many others.
10+
These dependencies must be available on the `classpath` of each executor (and in some cases of the driver, too).
11+
There are multiple ways to provision Spark jobs with such dependencies: some are built into Spark itself while others are implemented at the operator level.
12+
In this guide we are going to keep things simple and look at executing a Spark job that has a minimum of dependencies.
1313

1414
More information about the different ways to define Spark jobs and their dependencies is given on the following pages:
1515

16-
- xref:usage-guide/index.adoc[]
17-
- xref:job_dependencies.adoc[]
16+
* xref:usage-guide/index.adoc[]
17+
* xref:job_dependencies.adoc[]
1818

1919
== Stackable Operators
2020

@@ -25,8 +25,8 @@ There are 2 ways to install Stackable operators
2525

2626
=== stackablectl
2727

28-
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install
29-
Operators. Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
28+
`stackablectl` is the command line tool to interact with Stackable operators and our recommended way to install Operators.
29+
Follow the xref:management:stackablectl:installation.adoc[installation steps] for your platform.
3030

3131
After you have installed `stackablectl` run the following command to install the Spark-k8s operator:
3232

@@ -42,12 +42,13 @@ The tool will show
4242
include::example$getting_started/install_output.txt[]
4343
----
4444

45-
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl. For
46-
example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
45+
TIP: Consult the xref:management:stackablectl:quickstart.adoc[] to learn more about how to use stackablectl.
46+
For example, you can use the `--cluster kind` flag to create a Kubernetes cluster with link:https://kind.sigs.k8s.io/[kind].
4747

4848
=== Helm
4949

50-
You can also use Helm to install the operator. Add the Stackable Helm repository:
50+
You can also use Helm to install the operator.
51+
Add the Stackable Helm repository:
5152
[source,bash]
5253
----
5354
include::example$getting_started/getting_started.sh[tag=helm-add-repo]
@@ -59,8 +60,8 @@ Then install the Stackable Operators:
5960
include::example$getting_started/getting_started.sh[tag=helm-install-operators]
6061
----
6162

62-
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the `SparkApplication` (as well as the
63-
CRDs for the required operators). You are now ready to create a Spark job.
63+
Helm will deploy the operators in a Kubernetes Deployment and apply the CRDs for the SparkApplication (as well as the CRDs for the required operators).
64+
You are now ready to create a Spark job.
6465

6566
== What's next
6667

docs/modules/spark-k8s/pages/index.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
= Stackable Operator for Apache Spark
2-
:description: The Stackable operator for Apache Spark is a Kubernetes operator that can manage Apache Spark clusters. Learn about its features, resources, dependencies and demos, and see the list of supported Spark versions.
2+
:description: Manage Apache Spark clusters on Kubernetes with Stackable Operator, featuring SparkApplication CRDs, history server, S3 integration, and demos for big data tasks.
33
:keywords: Stackable operator, Apache Spark, Kubernetes, operator, data science, engineer, big data, CRD, StatefulSet, ConfigMap, Service, S3, demo, version
44
:spark: https://spark.apache.org/
55
:github: https://github.com/stackabletech/spark-k8s-operator/

docs/modules/spark-k8s/pages/usage-guide/examples.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Examples
2+
:description: Explore Spark job examples with various setups for PySpark and Scala, including external datasets, PVC mounts, and S3 access configurations.
23

34
The following examples have the following `spec` fields in common:
45

docs/modules/spark-k8s/pages/usage-guide/history-server.adoc

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
11
= Spark History Server
2+
:description: Set up Spark History Server on Kubernetes to access Spark logs via S3, with configuration for cleanups and web UI access details.
23
:page-aliases: history_server.adoc
34

45
== Overview
56

6-
The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated. One or more Spark History Server instances can be deployed independently of `SparkApplication` jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available.
7+
The Stackable Spark-on-Kubernetes operator runs Apache Spark workloads in a Kubernetes cluster, whereby driver- and executor-pods are created for the duration of the job and then terminated.
8+
One or more Spark History Server instances can be deployed independently of SparkApplication jobs and used as an end-point for spark logging, so that job information can be viewed once the job pods are no longer available.
79

810
== Deployment
911

10-
The example below demonstrates how to set up the history server running in one Pod with scheduled cleanups of the event logs. The event logs are loaded from an S3 bucket named `spark-logs` and the folder `eventlogs/`. The credentials for this bucket are provided by the secret class `s3-credentials-class`. For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page.
12+
The example below demonstrates how to set up the history server running in one Pod with scheduled cleanups of the event logs.
13+
The event logs are loaded from an S3 bucket named `spark-logs` and the folder `eventlogs/`.
14+
The credentials for this bucket are provided by the secret class `s3-credentials-class`.
15+
For more details on how the Stackable Data Platform manages S3 resources see the xref:concepts:s3.adoc[S3 resources] page.
1116

1217

1318
[source,yaml]
@@ -52,7 +57,8 @@ include::example$example-history-app.yaml[]
5257

5358
== History Web UI
5459

55-
To access the history server web UI, use one of the `NodePort` services created by the operator. For the example above, the operator created two services as shown:
60+
To access the history server web UI, use one of the `NodePort` services created by the operator.
61+
For the example above, the operator created two services as shown:
5662

5763
[source,bash]
5864
----
@@ -70,13 +76,17 @@ image::history-server-ui.png[History Server Console]
7076

7177
For a role group of the Spark history server, you can specify: `configOverrides` for the following files:
7278

73-
- `security.properties`
79+
* `security.properties`
7480

7581
=== The security.properties file
7682

77-
The `security.properties` file is used to configure JVM security properties. It is very seldom that users need to tweak any of these, but there is one use-case that stands out, and that users need to be aware of: the JVM DNS cache.
83+
The `security.properties` file is used to configure JVM security properties.
84+
It is very seldom that users need to tweak any of these, but there is one use-case that stands out, and that users need to be aware of: the JVM DNS cache.
7885

79-
The JVM manages its own cache of successfully resolved host names as well as a cache of host names that cannot be resolved. Some products of the Stackable platform are very sensible to the contents of these caches and their performance is heavily affected by them. As of version 3.4.0, Apache Spark may perform poorly if the positive cache is disabled. To cache resolved host names, and thus speeding up queries you can configure the TTL of entries in the positive cache like this:
86+
The JVM manages its own cache of successfully resolved host names as well as a cache of host names that cannot be resolved.
87+
Some products of the Stackable platform are very sensible to the contents of these caches and their performance is heavily affected by them.
88+
As of version 3.4.0, Apache Spark may perform poorly if the positive cache is disabled.
89+
To cache resolved host names, and thus speeding up queries you can configure the TTL of entries in the positive cache like this:
8090

8191
[source,yaml]
8292
----

docs/modules/spark-k8s/pages/usage-guide/job-dependencies.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Job Dependencies
2+
:description: Learn how to provision dependencies for Spark jobs using custom images, volumes, Maven packages, or Python packages, and their trade-offs.
23
:page-aliases: job_dependencies.adoc
34

45
== Overview

docs/modules/spark-k8s/pages/usage-guide/s3.adoc

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
= S3 bucket specification
2+
:description: Learn how to configure S3 access in SparkApplications using inline credentials or external resources, including TLS for secure connections.
23

3-
You can specify S3 connection details directly inside the `SparkApplication` specification or by referring to an external `S3Bucket` custom resource.
4+
You can specify S3 connection details directly inside the SparkApplication specification or by referring to an external S3Bucket custom resource.
45

56
== S3 access using credentials
67

7-
To specify S3 connection details directly as part of the `SparkApplication` resource you add an inline connection configuration as shown below.
8+
To specify S3 connection details directly as part of the SparkApplication resource you add an inline connection configuration as shown below.
89

910
[source,yaml]
1011
----
@@ -21,7 +22,7 @@ s3connection: # <1>
2122
<3> Optional connection port.
2223
<4> Name of the `Secret` object expected to contain the following keys: `accessKey` and `secretKey`
2324

24-
It is also possible to configure the connection details as a separate Kubernetes resource and only refer to that object from the `SparkApplication` like this:
25+
It is also possible to configure the connection details as a separate Kubernetes resource and only refer to that object from the SparkApplication like this:
2526

2627
[source,yaml]
2728
----
@@ -47,7 +48,7 @@ spec:
4748
secretClass: minio-credentials-class
4849
----
4950

50-
This has the advantage that one connection configuration can be shared across `SparkApplications` and reduces the cost of updating these details.
51+
This has the advantage that one connection configuration can be shared across SparkApplications and reduces the cost of updating these details.
5152

5253
== S3 access with TLS
5354

docs/modules/spark-k8s/pages/usage-guide/security.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
= Security
2+
:description: Learn how to configure Apache Spark applications with Kerberos authentication using Stackable Secret Operator for secure data access in HDFS.
23

34
== Authentication
45

@@ -56,7 +57,7 @@ executor:
5657
volumes:
5758
- name: hdfs-config <4>
5859
configMap:
59-
name: hdfs
60+
name: hdfs
6061
- name: kerberos
6162
ephemeral:
6263
volumeClaimTemplate:
@@ -94,4 +95,3 @@ sparkConf:
9495
----
9596
<1> Location of the keytab file.
9697
<2> Principal name. This needs to have the format `<SERVICE_NAME>.default.svc.cluster.local@<REALM>` where `SERVICE_NAME` matches the volume claim annotation `secrets.stackable.tech/kerberos.service.names` and `REALM` must be `CLUSTER.LOCAL` unless a different realm was used explicitly. In that case, the `KERBEROS_REALM` environment variable must also be set accordingly.
97-

0 commit comments

Comments
 (0)