[SPARK-53157][CORE] Decouple driver and executor polling intervals

ForVic · Mridul Muralidharan · commit 04a40ab8680e · 2025-09-12T10:37:29.000-05:00
### What changes were proposed in this pull request? Add a config `spark.driver.metrics.pollingInterval`, and schedule driver polling interval / heartbeat at that schedule. ### Why are the changes needed? Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application. Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies. ### Does this PR introduce _any_ user-facing change? Yes, introduces a spark config ### How was this patch tested? Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed. Methodology for validating that increased driver heartbeat intervals would improve memory collection: 1. Using a 6gb driver heap, wrote a job to broadcast a table, gradually increasing the size of the table until OOM. 2. Increased driver memory to 10gb, large enough for the same broadcast to succeed. 3. Repeated this job and tracked the peak memory usage that was written to event log. 4. After repeated experiments, witnessed that the median peak heap typical usage was tracked at <=5GiB. 5. Added my change, and decreased the heartbeat interval. 6. Re-ran same jobs with 10gb heap, and saw that the typical peak memory usage tracked was ~8GiB, more accurately reflecting the increased memory needs. ### Was this patch authored or co-authored using generative AI tooling? No Closes #51885 from ForVic/vsunderl/driver_polling_interval. Authored-by: ForVic <vsunderland@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -614,7 +614,7 @@ class SparkContext(config: SparkConf) extends Logging {
     _heartbeater = new Heartbeater(
       () => SparkContext.this.reportHeartBeat(_executorMetricsSource),
       "driver-heartbeater",
-      conf.get(EXECUTOR_HEARTBEAT_INTERVAL))
+      conf.get(DRIVER_METRICS_POLLING_INTERVAL))
     _heartbeater.start()
 
     // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -1201,6 +1201,14 @@ package object config {
     .checkValue(v => v >= 0, "The value should be a non-negative time value.")
     .createWithDefaultString("0min")
 
+  private[spark] val DRIVER_METRICS_POLLING_INTERVAL =
+    ConfigBuilder("spark.driver.metrics.pollingInterval")
+      .doc("How often to collect driver metrics (in milliseconds). " +
+        "If unset, the polling is done at the executor heartbeat interval. " +
+        "If set, the polling is done at this interval.")
+      .version("4.1.0")
+      .fallbackConf(EXECUTOR_HEARTBEAT_INTERVAL)
+
   private[spark] val DRIVER_BIND_ADDRESS = ConfigBuilder("spark.driver.bindAddress")
     .doc("Address where to bind network listen sockets on the driver.")
     .version("2.1.0")