-
Notifications
You must be signed in to change notification settings - Fork 48
[SPARK-53325] Support Prometheus 2.0 text-based-format and best practices for metrics naming #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @peter-toth for review - thanks ! |
d673f89 to
935c07e
Compare
peter-toth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nits only.
spark-operator/src/main/java/org/apache/spark/k8s/operator/config/SparkOperatorConf.java
Outdated
Show resolved
Hide resolved
...operator/src/main/java/org/apache/spark/k8s/operator/metrics/PrometheusPullModelHandler.java
Outdated
Show resolved
Hide resolved
peter-toth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes @jiangzho, LGTM form my side.
@dongjoon-hyun , would you like to take a look at this PR?
|
Sure, @peter-toth . Sorry for being late for the party. |
|
|
||
| public static final ConfigOption<Boolean> ENABLE_PROMETHEUS_TEXT_BASED_FORMAT = | ||
| ConfigOption.<Boolean>builder() | ||
| .key("spark.kubernetes.operator.metrics.enablePrometheusTextBasedFormat") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new configuration, we had better rename enableXXX to XXXEnabled for consistency.
- spark.kubernetes.operator.metrics.enablePrometheusTextBasedFormat
+ spark.kubernetes.operator.metrics.PrometheusTextBasedFormatEnabled
Please see the existing code you added, @jiangzho .
Line 67 in 2ef2bf5
| .key("spark.kubernetes.operator.terminateOnInformerFailureEnabled") |
Line 128 in 2ef2bf5
| .key("spark.kubernetes.operator.reconciler.trimStateTransitionHistoryEnabled") |
Line 303 in 2ef2bf5
| .key("spark.kubernetes.operator.metrics.clientMetricsEnabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems we are a bit inconsistent with config namings. Sometimes we use Enabled postfix, like in the above cases, but other times .enabled like in spark.kubernetes.operator.leaderElection.enabled. Apache Spark repo seem to prefer .enabled so shall we stick to it here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems we are a bit inconsistent with config namings. Sometimes we use
Enabledpostfix, like in the above cases, but other times.enabledlike inspark.kubernetes.operator.leaderElection.enabled.
😄 I understand why you saying like that. Since Apache Spark was an open source community, there were many ways in the community contribution. IIUC, there was a proposal from Wenchen (cloud-fan) long time ago to make it more consistent for the new configuration. Although my memory is vague, here is my understanding.
.enabledis recommended only for new configuration namespace itself. For example,spark.sql.ansi.enabledis used to turn on and off for allspark.sql.ansi.*. For a single configuration, we don't use it becauseX.Y.Z.enableddeclaresX.Y.Zis a new namespace. For example,spark.kubernetes.operator.leaderElection.enabledcontrols the wholespark.kubernetes.operator.leaderElection.*configuration. There is no misleading part here.
spark.kubernetes.operator.leaderElection.enabled
spark.kubernetes.operator.leaderElection.leaseName
spark.kubernetes.operator.leaderElection.leaseDurationSeconds
spark.kubernetes.operator.leaderElection.renewDeadlineSeconds
spark.kubernetes.operator.leaderElection.retryPeriodSeconds
.enabledvs.enable. We decided to use.enabledwhile leaving our existing mistakes (like the following). So, there is no arguable point here. The rule is straight forward.
spark.acls.enable
spark.history.ui.acls.enable
spark.appStateStore.asyncTracking.enable
spark.sql.inMemoryTableScanStatistics.enable
spark.kafka.consumer.cache.jmx.enable
So, the following is wrong because we have a clear rule for .enabled.
Apache Spark repo seem to prefer
.enabledso shall we stick to it here as well?
Lastly, I want to clarify that you are confused with .enabled and .enableXXX. We need to compare .enableXXX and .XXXEnabled because they don't introduce any side effects like adding a new namespace. So, these .enableXXX and .XXXEnabled are still controversial. I get it. However, look around this K8s operator repository. Did you find any instance of .enableXXX? Technically, it's none because I didn't allow from the beginning to prevent those kind mess-up in this new repository.
$ git grep spark.kubernetes.operator.enable | wc -l
0
To sum up, I'm -1 with this wrong configuration names to protect the existing consistency of K8s Operator repository and I expect you folks (@peter-toth and @jiangzho) are aware of this naming rule as the protector of K8s Operator repository while reviewing other community PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun for the detailed reason through!
I have updated the conf key and would keep that in mind for future configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dongjoon-hyun for the details, I'm ok with the config names suggested by you.
|
|
||
| public static final ConfigOption<Boolean> ENABLE_SANITIZED_PROMETHEUS_METRICS_NAME = | ||
| ConfigOption.<Boolean>builder() | ||
| .key("spark.kubernetes.operator.metrics.enableSanitizePrometheusMetricsName") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
docs/config_properties.md
Outdated
| | spark.kubernetes.operator.metrics.clientMetricsGroupByResponseCodeEnabled | Boolean | true | false | When enabled, additional metrics group by http response code group(1xx, 2xx, 3xx, 4xx, 5xx) received from API server will be added. Users can disable it when their monitoring system can combine lower level kubernetes.client.http.response.<3-digit-response-code> metrics. | | ||
| | spark.kubernetes.operator.metrics.port | Integer | 19090 | false | The port used for checking metrics | | ||
| | spark.kubernetes.operator.metrics.enablePrometheusTextBasedFormat | Boolean | true | false | Whether or not to enable text-based format for Prometheus 2.0, as recommended by https://prometheus.io/docs/instrumenting/exposition_formats/#text-based-format | | ||
| | spark.kubernetes.operator.metrics.enableSanitizePrometheusMetricsName | Boolean | true | false | Whether or not to enable automatic name sanitizing for all metrics based on best-practice guide from Prometheus https://prometheus.io/docs/practices/naming/ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both configuration names should be revised.
| .append("_rate{interval=\"15m\"} ") | ||
| .append(meter.getFifteenMinuteRate()) | ||
| .append("\n\n"); | ||
| return stringBuilder.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider to use string concatenation like line 177 to 186.
FYI, Java 9+ improved string contentenation via JEP 280: Indify String Concatenation (https://openjdk.org/jeps/280).
For the technical details, please see SPARK-52880 , @jiangzho .
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the above mentioned reason, I inevitably need to block accidental merging of this PR.
…ices for metrics naming
### What changes were proposed in this pull request?
This PR adds support for Prometheus text-based-format and best practices for metrics naming
Existing format
```
metrics_jvm_bufferPool_direct_capacity_Number{type="gauges"} 98348
metrics_jvm_bufferPool_direct_capacity_Value{type="gauges"} 98348
metrics_jvm_bufferPool_direct_count_Number{type="gauges"} 41
metrics_jvm_bufferPool_direct_count_Value{type="gauges"} 41
metrics_kubernetes_client_http_response_latency_nanos_Count{type="histograms"} 26910
metrics_kubernetes_client_http_response_latency_nanos_Max{type="histograms"} 232417143
metrics_kubernetes_client_http_response_latency_nanos_Mean{type="histograms"} 1.1410164260725182E7
metrics_kubernetes_client_http_response_latency_nanos_Min{type="histograms"} 2931711
metrics_kubernetes_client_http_response_latency_nanos_50thPercentile{type="histograms"} 7559152.0
metrics_kubernetes_client_http_response_latency_nanos_75thPercentile{type="histograms"} 9440850.0
metrics_kubernetes_client_http_response_latency_nanos_95thPercentile{type="histograms"} 1.2576766E7
metrics_kubernetes_client_http_response_latency_nanos_98thPercentile{type="histograms"} 1.34034482E8
metrics_kubernetes_client_http_response_latency_nanos_99thPercentile{type="histograms"} 1.34034482E8
metrics_kubernetes_client_http_response_latency_nanos_999thPercentile{type="histograms"} 1.34034482E8
metrics_kubernetes_client_http_response_latency_nanos_StdDev{type="histograms"} 2.177784612259799E7
metrics_kubernetes_client_pods_get_Count{type="counters"} 8967
metrics_kubernetes_client_pods_get_MeanRate{type="counters"} 0.02678169644780033
metrics_kubernetes_client_pods_get_OneMinuteRate{type="counters"} 0.049758750361204154
metrics_kubernetes_client_pods_get_FiveMinuteRate{type="counters"} 0.035255140329213855
metrics_kubernetes_client_pods_get_FifteenMinuteRate{type="counters"} 0.02931221844089468
```
with this patch, operator would be able to export format matching Prometheus 2.0 recommended practice like
```
# HELP jvm_bufferpool_direct_capacity Gauge metric
# TYPE jvm_bufferpool_direct_capacity gauge
jvm_bufferpool_direct_capacity 256092
# HELP jvm_bufferpool_direct_count Gauge metric
# TYPE jvm_bufferpool_direct_count gauge
jvm_bufferpool_direct_count 44
# HELP kubernetes_client_2xx_total Meter count
# TYPE kubernetes_client_2xx_total counter
kubernetes_client_2xx_total 130
# HELP kubernetes_client_http_response_latency Histogram metric
# TYPE kubernetes_client_http_response_latency histogram
kubernetes_client_http_response_latency_seconds_bucket{le="0.5"} 0.000104422
kubernetes_client_http_response_latency_seconds_bucket{le="0.75"} 0.000128422
kubernetes_client_http_response_latency_seconds_bucket{le="0.95"} 0.000139544
kubernetes_client_http_response_latency_seconds_bucket{le="0.98"} 0.000169124
kubernetes_client_http_response_latency_seconds_bucket{le="0.99"} 0.066452639
kubernetes_client_http_response_latency_seconds_count 2000
kubernetes_client_http_response_latency_seconds_sum 0.456670434
```
### Why are the changes needed?
It's Prometheus 2.0 best practice for using the next format with necessary comments.
Also, some common scrapers (like Datadog) rely on these metadata (e.g. # HELP and # TYPE) to parse metrics correctly. They may skip metrics if these are missing.
### Does this PR introduce _any_ user-facing change?
New functionalities becomes available (for metrics format)
### How was this patch tested?
CIs / curl on :19090/prometheus to validate the format
### Was this patch authored or co-authored using generative AI tooling?
No
fe2f023 to
a4ddc93
Compare
|
Thanks @dongjoon-hyun ! Your review is truly appreciated. I've updated the PR according to your review Also added one commit to fix the previous handling of histogram & timers, making them publishing quantiles instead of the mis-interpretation of le buckets in previous commit |
|
Let's wait until Peter comes back so that we build a consensus on config name spaces because this is essentially important for project management (at least among us). |
Thanks @dongjoon-hyun! Our concluson on names in #298 (comment) is ok with me. |
peter-toth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my end, @dongjoon-hyun do you have any other suggestions?
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @jiangzho and @peter-toth .
|
Thank you @jiangzho , @dongjoon-hyun. Merged to |
What changes were proposed in this pull request?
This PR adds support for Prometheus text-based-format and best practices for metrics naming
Existing format
with this patch, operator would be able to export format matching Prometheus 2.0 recommended practice like
Why are the changes needed?
It's Prometheus 2.0 best practice for using the next format with necessary comments. Also, some common scrapers (like Datadog) rely on these metadata (e.g. # HELP and # TYPE) to parse metrics correctly. They may skip metrics if these are missing.
Ideally we can introduce similar patch to Spark core as well for better Prometheus 2.0 compatibility.
Does this PR introduce any user-facing change?
New functionalities becomes available (for metrics format)
How was this patch tested?
CIs / curl on :19090/prometheus to validate the format
Was this patch authored or co-authored using generative AI tooling?
No