[SPARK-53325] Support Prometheus 2.0 text-based-format and best practices for metrics naming #298

jiangzho · 2025-08-19T00:27:20Z

What changes were proposed in this pull request?

This PR adds support for Prometheus text-based-format and best practices for metrics naming

Existing format

metrics_jvm_bufferPool_direct_capacity_Number{type="gauges"} 98348
metrics_jvm_bufferPool_direct_capacity_Value{type="gauges"} 98348
metrics_jvm_bufferPool_direct_count_Number{type="gauges"} 41
metrics_jvm_bufferPool_direct_count_Value{type="gauges"} 41
metrics_kubernetes_client_http_response_latency_nanos_Count{type="histograms"} 26910
metrics_kubernetes_client_http_response_latency_nanos_Max{type="histograms"} 232417143
metrics_kubernetes_client_http_response_latency_nanos_Mean{type="histograms"} 1.1410164260725182E7
metrics_kubernetes_client_http_response_latency_nanos_Min{type="histograms"} 2931711
metrics_kubernetes_client_http_response_latency_nanos_50thPercentile{type="histograms"} 7559152.0
metrics_kubernetes_client_http_response_latency_nanos_75thPercentile{type="histograms"} 9440850.0
metrics_kubernetes_client_http_response_latency_nanos_95thPercentile{type="histograms"} 1.2576766E7
metrics_kubernetes_client_http_response_latency_nanos_98thPercentile{type="histograms"} 1.34034482E8
metrics_kubernetes_client_http_response_latency_nanos_99thPercentile{type="histograms"} 1.34034482E8
metrics_kubernetes_client_http_response_latency_nanos_999thPercentile{type="histograms"} 1.34034482E8
metrics_kubernetes_client_http_response_latency_nanos_StdDev{type="histograms"} 2.177784612259799E7
metrics_kubernetes_client_pods_get_Count{type="counters"} 8967
metrics_kubernetes_client_pods_get_MeanRate{type="counters"} 0.02678169644780033
metrics_kubernetes_client_pods_get_OneMinuteRate{type="counters"} 0.049758750361204154
metrics_kubernetes_client_pods_get_FiveMinuteRate{type="counters"} 0.035255140329213855
metrics_kubernetes_client_pods_get_FifteenMinuteRate{type="counters"} 0.02931221844089468

with this patch, operator would be able to export format matching Prometheus 2.0 recommended practice like

# HELP jvm_bufferpool_direct_capacity Gauge metric
# TYPE jvm_bufferpool_direct_capacity gauge
jvm_bufferpool_direct_capacity 256092

# HELP jvm_bufferpool_direct_count Gauge metric
# TYPE jvm_bufferpool_direct_count gauge
jvm_bufferpool_direct_count 44

# HELP kubernetes_client_2xx_total Meter count
# TYPE kubernetes_client_2xx_total counter
kubernetes_client_2xx_total 130

Why are the changes needed?

It's Prometheus 2.0 best practice for using the next format with necessary comments. Also, some common scrapers (like Datadog) rely on these metadata (e.g. # HELP and # TYPE) to parse metrics correctly. They may skip metrics if these are missing.

Ideally we can introduce similar patch to Spark core as well for better Prometheus 2.0 compatibility.

Does this PR introduce any user-facing change?

New functionalities becomes available (for metrics format)

How was this patch tested?

CIs / curl on :19090/prometheus to validate the format

Was this patch authored or co-authored using generative AI tooling?

No

jiangzho · 2025-08-19T00:28:33Z

cc @peter-toth for review - thanks !

peter-toth

LGTM, nits only.

spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java

...operator/src/main/java/org/apache/spark/k8s/operator/metrics/PrometheusPullModelHandler.java

peter-toth

Thanks for the fixes @jiangzho, LGTM form my side.

@dongjoon-hyun , would you like to take a look at this PR?

dongjoon-hyun · 2025-08-28T03:07:23Z

Sure, @peter-toth . Sorry for being late for the party.

dongjoon-hyun · 2025-08-28T03:09:17Z

spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java


+  public static final ConfigOption<Boolean> ENABLE_PROMETHEUS_TEXT_BASED_FORMAT =
+      ConfigOption.<Boolean>builder()
+          .key("spark.kubernetes.operator.metrics.enablePrometheusTextBasedFormat")


For new configuration, we had better rename enableXXX to XXXEnabled for consistency.

- spark.kubernetes.operator.metrics.enablePrometheusTextBasedFormat + spark.kubernetes.operator.metrics.PrometheusTextBasedFormatEnabled

Please see the existing code you added, @jiangzho .

spark-kubernetes-operator/spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java

Line 67 in 2ef2bf5

.key("spark.kubernetes.operator.terminateOnInformerFailureEnabled")

spark-kubernetes-operator/spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java

Line 128 in 2ef2bf5

.key("spark.kubernetes.operator.reconciler.trimStateTransitionHistoryEnabled")

spark-kubernetes-operator/spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java

Line 303 in 2ef2bf5

.key("spark.kubernetes.operator.metrics.clientMetricsEnabled")

Actually, it seems we are a bit inconsistent with config namings. Sometimes we use Enabled postfix, like in the above cases, but other times .enabled like in spark.kubernetes.operator.leaderElection.enabled. Apache Spark repo seem to prefer .enabled so shall we stick to it here as well?

Actually, it seems we are a bit inconsistent with config namings. Sometimes we use Enabled postfix, like in the above cases, but other times .enabled like in spark.kubernetes.operator.leaderElection.enabled.

😄 I understand why you saying like that. Since Apache Spark was an open source community, there were many ways in the community contribution. IIUC, there was a proposal from Wenchen (cloud-fan) long time ago to make it more consistent for the new configuration. Although my memory is vague, here is my understanding.

.enabled is recommended only for new configuration namespace itself. For example, spark.sql.ansi.enabled is used to turn on and off for all spark.sql.ansi.*. For a single configuration, we don't use it because X.Y.Z.enabled declares X.Y.Z is a new namespace. For example, spark.kubernetes.operator.leaderElection.enabled controls the whole spark.kubernetes.operator.leaderElection.* configuration. There is no misleading part here.

spark.kubernetes.operator.leaderElection.enabled spark.kubernetes.operator.leaderElection.leaseName spark.kubernetes.operator.leaderElection.leaseDurationSeconds spark.kubernetes.operator.leaderElection.renewDeadlineSeconds spark.kubernetes.operator.leaderElection.retryPeriodSeconds

.enabled vs .enable. We decided to use .enabled while leaving our existing mistakes (like the following). So, there is no arguable point here. The rule is straight forward.

spark.acls.enable spark.history.ui.acls.enable spark.appStateStore.asyncTracking.enable spark.sql.inMemoryTableScanStatistics.enable spark.kafka.consumer.cache.jmx.enable

So, the following is wrong because we have a clear rule for .enabled.

Apache Spark repo seem to prefer .enabled so shall we stick to it here as well?

Lastly, I want to clarify that you are confused with .enabled and .enableXXX. We need to compare .enableXXX and .XXXEnabled because they don't introduce any side effects like adding a new namespace. So, these .enableXXX and .XXXEnabled are still controversial. I get it. However, look around this K8s operator repository. Did you find any instance of .enableXXX? Technically, it's none because I didn't allow from the beginning to prevent those kind mess-up in this new repository.

$ git grep spark.kubernetes.operator.enable | wc -l 0

To sum up, I'm -1 with this wrong configuration names to protect the existing consistency of K8s Operator repository and I expect you folks (@peter-toth and @jiangzho) are aware of this naming rule as the protector of K8s Operator repository while reviewing other community PRs.

Thanks @dongjoon-hyun for the detailed reason through!

I have updated the conf key and would keep that in mind for future configuration

Thanks @dongjoon-hyun for the details, I'm ok with the config names suggested by you.

dongjoon-hyun · 2025-08-28T03:09:35Z

spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java

+
+  public static final ConfigOption<Boolean> ENABLE_SANITIZED_PROMETHEUS_METRICS_NAME =
+      ConfigOption.<Boolean>builder()
+          .key("spark.kubernetes.operator.metrics.enableSanitizePrometheusMetricsName")


dongjoon-hyun · 2025-08-28T03:12:13Z

docs/config_properties.md

 | spark.kubernetes.operator.metrics.clientMetricsGroupByResponseCodeEnabled | Boolean | true | false | When enabled, additional metrics group by http response code group(1xx, 2xx, 3xx, 4xx, 5xx) received from API server will be added. Users can disable it when their monitoring system can combine lower level kubernetes.client.http.response.<3-digit-response-code> metrics. | 
 | spark.kubernetes.operator.metrics.port | Integer | 19090 | false | The port used for checking metrics | 
+ | spark.kubernetes.operator.metrics.enablePrometheusTextBasedFormat | Boolean | true | false | Whether or not to enable text-based format for Prometheus 2.0, as recommended by https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format | 
+ | spark.kubernetes.operator.metrics.enableSanitizePrometheusMetricsName | Boolean | true | false | Whether or not to enable automatic name sanitizing for all metrics based on best-practice guide from Prometheus https://prometheus.io/docs/practices/naming/ | 


Both configuration names should be revised.

dongjoon-hyun · 2025-08-28T03:18:12Z

...operator/src/main/java/org/apache/spark/k8s/operator/metrics/PrometheusPullModelHandler.java

+          .append("_rate{interval=\"15m\"} ")
+          .append(meter.getFifteenMinuteRate())
+          .append("\n\n");
+      return stringBuilder.toString();


Please consider to use string concatenation like line 177 to 186.

FYI, Java 9+ improved string contentenation via JEP 280: Indify String Concatenation (https://openjdk.org/jeps/280).

For the technical details, please see SPARK-52880 , @jiangzho .

[SPARK-52880][CORE] Improve toString by JEP-280 instead of ToStringBuilder spark#51572

dongjoon-hyun

With the above mentioned reason, I inevitably need to block accidental merging of this PR.

#298 (comment)

…ices for metrics naming ### What changes were proposed in this pull request? This PR adds support for Prometheus text-based-format and best practices for metrics naming Existing format ``` metrics_jvm_bufferPool_direct_capacity_Number{type="gauges"} 98348 metrics_jvm_bufferPool_direct_capacity_Value{type="gauges"} 98348 metrics_jvm_bufferPool_direct_count_Number{type="gauges"} 41 metrics_jvm_bufferPool_direct_count_Value{type="gauges"} 41 metrics_kubernetes_client_http_response_latency_nanos_Count{type="histograms"} 26910 metrics_kubernetes_client_http_response_latency_nanos_Max{type="histograms"} 232417143 metrics_kubernetes_client_http_response_latency_nanos_Mean{type="histograms"} 1.1410164260725182E7 metrics_kubernetes_client_http_response_latency_nanos_Min{type="histograms"} 2931711 metrics_kubernetes_client_http_response_latency_nanos_50thPercentile{type="histograms"} 7559152.0 metrics_kubernetes_client_http_response_latency_nanos_75thPercentile{type="histograms"} 9440850.0 metrics_kubernetes_client_http_response_latency_nanos_95thPercentile{type="histograms"} 1.2576766E7 metrics_kubernetes_client_http_response_latency_nanos_98thPercentile{type="histograms"} 1.34034482E8 metrics_kubernetes_client_http_response_latency_nanos_99thPercentile{type="histograms"} 1.34034482E8 metrics_kubernetes_client_http_response_latency_nanos_999thPercentile{type="histograms"} 1.34034482E8 metrics_kubernetes_client_http_response_latency_nanos_StdDev{type="histograms"} 2.177784612259799E7 metrics_kubernetes_client_pods_get_Count{type="counters"} 8967 metrics_kubernetes_client_pods_get_MeanRate{type="counters"} 0.02678169644780033 metrics_kubernetes_client_pods_get_OneMinuteRate{type="counters"} 0.049758750361204154 metrics_kubernetes_client_pods_get_FiveMinuteRate{type="counters"} 0.035255140329213855 metrics_kubernetes_client_pods_get_FifteenMinuteRate{type="counters"} 0.02931221844089468 ``` with this patch, operator would be able to export format matching Prometheus 2.0 recommended practice like ``` # HELP jvm_bufferpool_direct_capacity Gauge metric # TYPE jvm_bufferpool_direct_capacity gauge jvm_bufferpool_direct_capacity 256092 # HELP jvm_bufferpool_direct_count Gauge metric # TYPE jvm_bufferpool_direct_count gauge jvm_bufferpool_direct_count 44 # HELP kubernetes_client_2xx_total Meter count # TYPE kubernetes_client_2xx_total counter kubernetes_client_2xx_total 130 # HELP kubernetes_client_http_response_latency Histogram metric # TYPE kubernetes_client_http_response_latency histogram kubernetes_client_http_response_latency_seconds_bucket{le="0.5"} 0.000104422 kubernetes_client_http_response_latency_seconds_bucket{le="0.75"} 0.000128422 kubernetes_client_http_response_latency_seconds_bucket{le="0.95"} 0.000139544 kubernetes_client_http_response_latency_seconds_bucket{le="0.98"} 0.000169124 kubernetes_client_http_response_latency_seconds_bucket{le="0.99"} 0.066452639 kubernetes_client_http_response_latency_seconds_count 2000 kubernetes_client_http_response_latency_seconds_sum 0.456670434 ``` ### Why are the changes needed? It's Prometheus 2.0 best practice for using the next format with necessary comments. Also, some common scrapers (like Datadog) rely on these metadata (e.g. # HELP and # TYPE) to parse metrics correctly. They may skip metrics if these are missing. ### Does this PR introduce _any_ user-facing change? New functionalities becomes available (for metrics format) ### How was this patch tested? CIs / curl on :19090/prometheus to validate the format ### Was this patch authored or co-authored using generative AI tooling? No

… reference.

…as possible

jiangzho · 2025-08-28T21:52:33Z

Thanks @dongjoon-hyun ! Your review is truly appreciated. I've updated the PR according to your review

Also added one commit to fix the previous handling of histogram & timers, making them publishing quantiles instead of the mis-interpretation of le buckets in previous commit

dongjoon-hyun · 2025-08-29T18:10:47Z

Let's wait until Peter comes back so that we build a consensus on config name spaces because this is essentially important for project management (at least among us).

peter-toth · 2025-08-29T18:27:26Z

Let's wait until Peter comes back so that we build a consensus on config name spaces because this is essentially important for project management (at least among us).

Thanks @dongjoon-hyun! Our concluson on names in #298 (comment) is ok with me.

peter-toth

LGTM from my end, @dongjoon-hyun do you have any other suggestions?

dongjoon-hyun

+1, LGTM. Thank you, @jiangzho and @peter-toth .

peter-toth · 2025-09-03T08:25:35Z

Thank you @jiangzho , @dongjoon-hyun.

Merged to main (0.5.0).

github-actions bot added the OPERATOR label Aug 19, 2025

jiangzho force-pushed the metrics_header branch 4 times, most recently from d673f89 to 935c07e Compare August 22, 2025 01:22

peter-toth approved these changes Aug 25, 2025

View reviewed changes

spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java Outdated Show resolved Hide resolved

...operator/src/main/java/org/apache/spark/k8s/operator/metrics/PrometheusPullModelHandler.java Outdated Show resolved Hide resolved

peter-toth approved these changes Aug 26, 2025

View reviewed changes

dongjoon-hyun reviewed Aug 28, 2025

View reviewed changes

dongjoon-hyun requested changes Aug 28, 2025

View reviewed changes

jiangzho added 5 commits August 28, 2025 13:20

Fix style, invalid regex in name formatting, and remove duplicates in…

35e9f6b

… reference.

Group the new props under 'metrics'

ec0f9d7

Fix the naming convention for new properties, and use string concate …

904ba26

…as possible

Fix histogram and timer handling to use quantile instead of le bucket

a4ddc93

jiangzho force-pushed the metrics_header branch from fe2f023 to a4ddc93 Compare August 28, 2025 21:48

peter-toth approved these changes Sep 2, 2025

View reviewed changes

dongjoon-hyun approved these changes Sep 2, 2025

View reviewed changes

peter-toth closed this in 5f476c3 Sep 3, 2025

peter-toth mentioned this pull request Sep 5, 2025

[SPARK-53442]Make PrometheusServlet compatible with OpenMetrics apache/spark#52183

Open

[SPARK-53325] Support Prometheus 2.0 text-based-format and best practices for metrics naming #298

[SPARK-53325] Support Prometheus 2.0 text-based-format and best practices for metrics naming #298

Uh oh!

Conversation

jiangzho commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

jiangzho commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 28, 2025

Uh oh!

dongjoon-hyun Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

jiangzho Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

jiangzho commented Aug 28, 2025

Uh oh!

dongjoon-hyun commented Aug 29, 2025

Uh oh!

peter-toth commented Aug 29, 2025

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiangzho commented Aug 19, 2025 •

edited

Loading

jiangzho commented Aug 19, 2025 •

edited

Loading

dongjoon-hyun Aug 28, 2025 •

edited

Loading