Skip to content

Commit c8e849f

Browse files
maltesanderrazvan
andauthored
chore: ensure metrics are correctly exposed (#619)
* fix examples * adapted changelog * add metrics service for spark connect * fix comment * precommit * add metrics test for the history server * add metrics test for connect server * add test for connect executor metrics * update docs --------- Co-authored-by: Razvan-Daniel Mihai <[email protected]>
1 parent f4b00b7 commit c8e849f

File tree

22 files changed

+535
-139
lines changed

22 files changed

+535
-139
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ All notable changes to this project will be documented in this file.
99
- Add experimental support for Spark 4 ([#589])
1010
- Helm: Allow Pod `priorityClassName` to be configured ([#608]).
1111
- Support for Spark 3.5.7 ([#610]).
12+
- Add metrics service with `prometheus.io/path|port|scheme` annotations for spark history server ([#619]).
13+
- Add metrics service with `prometheus.io/path|port|scheme` annotations for spark connect ([#619]).
1214

1315
### Fixed
1416

@@ -35,6 +37,7 @@ All notable changes to this project will be documented in this file.
3537
[#610]: https://github.com/stackabletech/spark-k8s-operator/pull/610
3638
[#611]: https://github.com/stackabletech/spark-k8s-operator/pull/611
3739
[#617]: https://github.com/stackabletech/spark-k8s-operator/pull/617
40+
[#619]: https://github.com/stackabletech/spark-k8s-operator/pull/619
3841

3942
## [25.7.0] - 2025-07-23
4043

apps/ny_tlc_report.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
need to be submitted along with the job.
1010
--output Path to write the report as a CSV file.
1111
"""
12+
1213
import argparse
1314

1415
from argparse import Namespace

docs/modules/spark-k8s/pages/usage-guide/history-server.adoc

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -157,10 +157,9 @@ By setting up port forwarding on 18080 the UI can be opened by pointing your bro
157157

158158
image::history-server-ui.png[History Server Console]
159159

160-
== Metrics
160+
== Monitoring
161161

162-
[NOTE]
163-
====
164-
Starting with version 25.7, the built-in Prometheus servlet is enabled in addition to the existing JMX exporter.
165-
The JMX exporter is still available but it is deprecated and will be removed in a future release.
166-
====
162+
The operator creates a Kubernetes service dedicated specifically to collect metrics for Spark History instances with Prometheus.
163+
These metrics are exported via the JMX exporter as the history server doesn't support the built in Spark prometheus servlet.
164+
The service name follows the convention `<stacklet name>-history-metrics`.
165+
Metrics can be scraped at the endpoint `<service name>:18081/metrics`.

docs/modules/spark-k8s/pages/usage-guide/operations/applications.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ As the operator creates the necessary resources, the status of the application t
99
NOTE: The operator never reconciles an application once it has been created.
1010
To resubmit an application, a new SparkApplication resource must be created.
1111

12-
== Metrics
12+
== Monitoring
1313

1414
[NOTE]
1515
====

docs/modules/spark-k8s/pages/usage-guide/spark-connect.adoc

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,14 @@ include::example$example-spark-connect.yaml[]
2626
<7> Customize the driver properties in the `server` role. The number of cores here is not related to Kubernetes cores!
2727
<8> Customize `spark.executor.\*` and `spark.kubernetes.executor.*` in the `executor` role.
2828

29-
== Metrics
29+
== Monitoring
3030

31-
The server pod exposes Prometheus metrics at the following endpoints:
31+
The operator creates a Kubernetes service dedicated specifically to collect metrics for Spark Connect instances with Prometheus.
32+
The service name follows the convention `<stacklet name>-server-metrics`.
33+
This service exposes Prometheus metrics at the following endpoints:
3234

33-
* `/metrics/prometheus` for driver instances.
34-
* `/metrics/executors/prometheus` for executor instances.
35+
* `<service name>:4040/metrics/prometheus` for driver instances.
36+
* `<service name>:4040/metrics/executors/prometheus` for executor instances.
3537

3638
To customize the metrics configuration use the `spec.server.configOverrides' like this:
3739

@@ -47,8 +49,8 @@ The example above adds a new endpoint for application metrics.
4749

4850
== Spark History Server
4951

50-
Unforunately integration with the Spark History Server is not supported yet.
51-
The connect server seems to ignore the `spark.eventLog` properties while also prohibiting clients to set them programatically.
52+
Unfortunately integration with the Spark History Server is not supported yet.
53+
The connect server seems to ignore the `spark.eventLog` properties while also prohibiting clients to set them programmatically.
5254

5355
== Notable Omissions
5456

examples/README-examples.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Several resources are needed in this store. These can be loaded like this:
5050

5151
```text
5252
kubectl exec minio-mc-0 -- sh -c 'mc alias set test-minio http://test-minio:9000/'
53-
kubectl cp examples/ny-tlc-report-1.1.0-3.5.7.jar minio-mc-0:/tmp
53+
kubectl cp tests/templates/kuttl/spark-ny-public-s3/ny-tlc-report-1.1.0-3.5.7.jar minio-mc-0:/tmp
5454
kubectl cp apps/ny_tlc_report.py minio-mc-0:/tmp
5555
kubectl cp examples/yellow_tripdata_2021-07.csv minio-mc-0:/tmp
5656
kubectl exec minio-mc-0 -- mc cp /tmp/ny-tlc-report-1.1.0-3.5.7.jar test-minio/my-bucket

kind/assert-pvc-jars.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ spec:
1313
claimName: pvc-ksv
1414
containers:
1515
- name: assert-pvc-jars
16-
image: oci.stackable.tech/sdp/tools:0.2.0-stackable0.4.0
16+
image: oci.stackable.tech/sdp/tools:1.0.0-stackable0.0.0-dev
1717
env:
1818
- name: DEST_DIR
1919
value: "/dependencies/jars"

kind/kind-pvc.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ spec:
2424
claimName: pvc-ksv
2525
containers:
2626
- name: aws-deps
27-
image: oci.stackable.tech/sdp/tools:0.2.0-stackable0.4.0
27+
image: oci.stackable.tech/sdp/tools:1.0.0-stackable0.0.0-dev
2828
env:
2929
- name: DEST_DIR
3030
value: "/dependencies/jars"

kind/minio.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,6 @@ spec:
2929
spec:
3030
containers:
3131
- name: minio-mc
32-
image: bitnamilegacy/minio:2022-debian-10
32+
image: docker.io/bitnamilegacy/minio:2024-debian-12
3333
stdin: true
3434
tty: true

rust/operator-binary/src/connect/controller.rs

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ use strum::{EnumDiscriminants, IntoStaticStr};
2121
use super::crd::{CONNECT_APP_NAME, CONNECT_CONTROLLER_NAME, v1alpha1};
2222
use crate::{
2323
Ctx,
24-
connect::{common, crd::SparkConnectServerStatus, executor, server},
24+
connect::{common, crd::SparkConnectServerStatus, executor, server, service},
2525
crd::constants::{OPERATOR_NAME, SPARK_IMAGE_BASE_NAME},
2626
};
2727

@@ -47,7 +47,7 @@ pub enum Error {
4747
ServerProperties { source: server::Error },
4848

4949
#[snafu(display("failed to build spark connect service"))]
50-
BuildService { source: server::Error },
50+
BuildService { source: service::Error },
5151

5252
#[snafu(display("failed to build spark connect executor config map for {name}"))]
5353
BuildExecutorConfigMap {
@@ -67,9 +67,6 @@ pub enum Error {
6767
name: String,
6868
},
6969

70-
#[snafu(display("spark connect object has no namespace"))]
71-
ObjectHasNoNamespace,
72-
7370
#[snafu(display("failed to update the connect server stateful set"))]
7471
ApplyStatefulSet {
7572
source: stackable_operator::cluster_resources::Error,
@@ -208,12 +205,22 @@ pub async fn reconcile(
208205
.context(ApplyRoleBindingSnafu)?;
209206

210207
// Headless service used by executors connect back to the driver
211-
let service =
212-
server::build_internal_service(scs, &resolved_product_image.app_version_label_value)
208+
let headless_service =
209+
service::build_headless_service(scs, &resolved_product_image.app_version_label_value)
213210
.context(BuildServiceSnafu)?;
214211

215-
let applied_internal_service = cluster_resources
216-
.add(client, service.clone())
212+
let applied_headless_service = cluster_resources
213+
.add(client, headless_service.clone())
214+
.await
215+
.context(ApplyServiceSnafu)?;
216+
217+
// Metrics service used for scraping
218+
let metrics_service =
219+
service::build_metrics_service(scs, &resolved_product_image.app_version_label_value)
220+
.context(BuildServiceSnafu)?;
221+
222+
cluster_resources
223+
.add(client, metrics_service.clone())
217224
.await
218225
.context(ApplyServiceSnafu)?;
219226

@@ -224,7 +231,7 @@ pub async fn reconcile(
224231
server::server_properties(
225232
scs,
226233
&server_config,
227-
&applied_internal_service,
234+
&applied_headless_service,
228235
&service_account,
229236
&resolved_product_image,
230237
)

0 commit comments

Comments
 (0)