[SPARK-53735][SDP] Hide server-side JVM stack traces by default in spark-pipelines output

sryza · sryza · commit 776ffd5effb8 · 2025-09-29T07:38:48.000-07:00
### What changes were proposed in this pull request? Hide server-side JVM stack traces by default in spark-pipelines output ### Why are the changes needed? Error output for failing pipeline runs can be very verbose and show a bunch of info that is not relevant to the user. ### Does this PR introduce _any_ user-facing change? Changes unreleased feature ### How was this patch tested? - Ran `spark-pipelines run` and verified the output. - Observed that explicitly setting the `spark.sql.connect.serverStacktrace.enabled` config brings the server-side stack traces back Before: ``` 2025-09-26 15:29:54: Failed to resolve flow: 'spark_catalog.default.rental_bike_trips'. Error: [TABLE_OR_VIEW_NOT_FOUND] The table or view `spark_catalog`.`default`.`rental_bike_trips_raws` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01; 'UnresolvedRelation [spark_catalog, default, rental_bike_trips_raws], [], true Traceback (most recent call last): File "/Users/sandy.ryza/oss/python/pyspark/pipelines/cli.py", line 358, in <module> run( File "/Users/sandy.ryza/oss/python/pyspark/pipelines/cli.py", line 285, in run handle_pipeline_events(result_iter) File "/Users/sandy.ryza/oss/python/pyspark/pipelines/spark_connect_pipeline.py", line 53, in handle_pipeline_events for result in iter: File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1169, in execute_command_as_iterator for response in self._execute_and_fetch_as_iterator(req, observations or {}): File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1559, in _execute_and_fetch_as_iterator self._handle_error(error) File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1833, in _handle_error self._handle_rpc_error(error) File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1904, in _handle_rpc_error raise convert_exception( pyspark.errors.exceptions.connect.AnalysisException: Failed to resolve flows in the pipeline. A flow can fail to resolve because the flow itself contains errors or because it reads from an upstream flow which failed to resolve. Flows with errors: spark_catalog.default.rental_bike_trips Flows that failed due to upstream errors: To view the exceptions that were raised while resolving these flows, look for flow failures that precede this log. JVM stacktrace: org.apache.spark.sql.pipelines.graph.UnresolvedPipelineException at org.apache.spark.sql.pipelines.graph.GraphValidations.validateSuccessfulFlowAnalysis(GraphValidations.scala:284) at org.apache.spark.sql.pipelines.graph.GraphValidations.validateSuccessfulFlowAnalysis$(GraphValidations.scala:247) at org.apache.spark.sql.pipelines.graph.DataflowGraph.validateSuccessfulFlowAnalysis(DataflowGraph.scala:33) at org.apache.spark.sql.pipelines.graph.DataflowGraph.$anonfun$validationFailure$1(DataflowGraph.scala:186) at scala.util.Try$.apply(Try.scala:217) at org.apache.spark.sql.pipelines.graph.DataflowGraph.validationFailure$lzycompute(DataflowGraph.scala:185) at org.apache.spark.sql.pipelines.graph.DataflowGraph.validationFailure(DataflowGraph.scala:185) at org.apache.spark.sql.pipelines.graph.DataflowGraph.validate(DataflowGraph.scala:173) at org.apache.spark.sql.pipelines.graph.PipelineExecution.resolveGraph(PipelineExecution.scala:109) at org.apache.spark.sql.pipelines.graph.PipelineExecution.startPipeline(PipelineExecution.scala:48) at org.apache.spark.sql.pipelines.graph.PipelineExecution.runPipeline(PipelineExecution.scala:63) at org.apache.spark.sql.connect.pipelines.PipelinesHandler$.startRun(PipelinesHandler.scala:294) at org.apache.spark.sql.connect.pipelines.PipelinesHandler$.handlePipelinesCommand(PipelinesHandler.scala:93) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.handlePipelineCommand(SparkConnectPlanner.scala:2727) at org.apache.spark.sql.connect.planner.SparkConnectPlanner.process(SparkConnectPlanner.scala:2697) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handleCommand(ExecuteThreadRunner.scala:322) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:224) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:196) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:349) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804) at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:349) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:102) at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111) at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:348) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:196) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:125) at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:347) 25/09/26 08:29:54 INFO ShutdownHookManager: Shutdown hook called ``` After: ``` 2025-09-26 15:27:33: Failed to resolve flow: 'spark_catalog.default.rental_bike_trips'. Error: [TABLE_OR_VIEW_NOT_FOUND] The table or view `spark_catalog`.`default`.`rental_bike_trips_raws` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01; 'UnresolvedRelation [spark_catalog, default, rental_bike_trips_raws], [], true Traceback (most recent call last): File "/Users/sandy.ryza/oss/python/pyspark/pipelines/cli.py", line 360, in <module> run( File "/Users/sandy.ryza/oss/python/pyspark/pipelines/cli.py", line 287, in run handle_pipeline_events(result_iter) File "/Users/sandy.ryza/oss/python/pyspark/pipelines/spark_connect_pipeline.py", line 53, in handle_pipeline_events for result in iter: File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1169, in execute_command_as_iterator for response in self._execute_and_fetch_as_iterator(req, observations or {}): File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1559, in _execute_and_fetch_as_iterator self._handle_error(error) File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1833, in _handle_error self._handle_rpc_error(error) File "/Users/sandy.ryza/oss/python/pyspark/sql/connect/client/core.py", line 1904, in _handle_rpc_error raise convert_exception( pyspark.errors.exceptions.connect.AnalysisException: Failed to resolve flows in the pipeline. A flow can fail to resolve because the flow itself contains errors or because it reads from an upstream flow which failed to resolve. Flows with errors: spark_catalog.default.rental_bike_trips Flows that failed due to upstream errors: To view the exceptions that were raised while resolving these flows, look for flow failures that precede this log. 25/09/26 08:27:34 INFO ShutdownHookManager: Shutdown hook called 25/09/26 08:27:34 INFO ShutdownHookManager: Deleting directory /private/var/folders/1v/dqhbgmt10vl6v3tdlwvvx90r0000gp/T/localPyFiles-039afc43-9f5c-4a6f-ac7b-2437496ac7de 25/09/26 08:27:34 INFO ShutdownHookManager: Deleting directory /private/var/folders/1v/dqhbgmt10vl6v3tdlwvvx90r0000gp/T/spark-c67d94d5-4110-4268-af67-430b3ae82133 ``` ### Was this patch authored or co-authored using generative AI tooling? Closes #52470 from sryza/hide-jvm-stack-trace. Lead-authored-by: Sandy Ryza <sandyryza@gmail.com> Co-authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Sandy Ryza <sandy.ryza@databricks.com>
diff --git a/python/pyspark/pipelines/cli.py b/python/pyspark/pipelines/cli.py
@@ -295,7 +295,9 @@ def run(
     spec = load_pipeline_spec(spec_path)
 
     log_with_curr_timestamp("Creating Spark session...")
-    spark_builder = SparkSession.builder
+    spark_builder = SparkSession.builder.config(
+        "spark.sql.connect.serverStacktrace.enabled", "false"
+    )
     for key, value in spec.configuration.items():
         spark_builder = spark_builder.config(key, value)