[SPARK-51130][YARN][TESTS] Run the test cases related to connect in the YarnClusterSuite on Github Actions only

LuciferYang · LuciferYang · commit ba7849e55115 · 2025-02-08T17:21:04.000+08:00
### What changes were proposed in this pull request? The main change in this PR is the addition of two `assume` conditions to ensure that the test cases related to 'connect' in the YarnClusterSuite are only executed on Github Actions. ### Why are the changes needed? Run these two test cases successfully locally is overly complicated. Firstly, it is necessary to install the required Python packages: https://github.com/apache/spark/blob/f5f7c365d519c4f9d4b7a5dce2c8a047cf051899/.github/workflows/build_and_test.yml#L363 Otherwise, local test execution will fail due to missing Python modules. Secondly, before running tests locally, a packaging operation must be performed to ensure that all dependencies are collected in the `assembly/target/scala-2.13/jars` directory. For example, executing `build/sbt package -Pyarn`. Failing to do so will result in the following error during local test execution: ``` Traceback (most recent call last): File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/resource-managers/yarn/target/test/data/org.apache.spark.deploy.yarn.YarnClusterSuite/yarn-264663/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/yangjie01/appcache/application_1738914482522_0019/container_1738914482522_0019_01_000001/test.py", line 13, in <module> "spark.api.mode", "connect").master("yarn").getOrCreate() ^^^^^^^^^^^^^ File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/pyspark/sql/session.py", line 511, in getOrCreate RemoteSparkSession._start_connect_server(url, opts) File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/pyspark/sql/connect/session.py", line 1073, in _start_connect_server PySparkSession(SparkContext.getOrCreate(conf)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/pyspark/core/context.py", line 523, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/pyspark/core/context.py", line 207, in __init__ self._do_init( File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/pyspark/core/context.py", line 300, in _do_init self._jsc = jsc or self._initialize_context(self._conf._jconf) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/pyspark/core/context.py", line 429, in _initialize_context return self._jvm.JavaSparkContext(jconf) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1627, in __call__ File "/Users/yangjie01/SourceCode/git/spark-mine-sbt/python/lib/py4j-0.10.9.9-src.zip/py4j/protocol.py", line 327, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.ClassNotFoundException: org.apache.spark.sql.connect.SparkConnectPlugin at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:467) at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41) at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36) at org.apache.spark.util.Utils$.classForName(Utils.scala:99) at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2828) at scala.collection.StrictOptimizedIterableOps.flatMap(StrictOptimizedIterableOps.scala:118) at scala.collection.StrictOptimizedIterableOps.flatMap$(StrictOptimizedIterableOps.scala:105) at scala.collection.immutable.ArraySeq.flatMap(ArraySeq.scala:35) at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2826) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:210) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:196) at org.apache.spark.SparkContext.<init>(SparkContext.scala:588) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184) at py4j.ClientServerConnection.run(ClientServerConnection.java:108) at java.base/java.lang.Thread.run(Thread.java:840) ``` Lastly, when running tests locally, the `clean` command should not be added. For instance, executing the following command ``` build/sbt "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite" -Pyarn ``` will result in successful tests. However, if the `clean` command is included, as in ``` build/sbt clean "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite" -Pyarn ``` the same test failure will occur. Additionally, adding `assume` conditions for testing is also relatively complex: 1. It is necessary to check that at least five essential Python modules are installed: pandas, pyarrow, grpc, grpcio, googleapis_common_protos. 2. It must be confirmed that the contents in `assembly/target/scala-2.13/jars` are fresh and usable. Given these circumstances, the current pr proposes that test cases related to `connect` in the `YarnClusterSuite` should only be executed in the GitHub pipeline. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions https://github.com/LuciferYang/spark/actions/runs/13196611264/job/36839274825 ![image](https://github.com/user-attachments/assets/6159b7b5-ab67-4698-a26c-9b4adfd10665) - locally check: ``` build/sbt clean "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite" -Pyarn ``` we can see: ``` [info] YarnClusterSuite: ... [info] - run Python application with Spark Connect in yarn-client mode !!! CANCELED !!! (9 milliseconds) [info] Map("JIRA_PASSWORD" -> "JackBaidu2020", "RUBYOPT" -> "", "HOME" -> "/Users/yangjie01", "JAVA_MAIN_CLASS_33082" -> "xsbt.boot.Boot", "HOMEBREW_BOTTLE_DOMAIN" -> ... did not contain key "GITHUB_ACTIONS" (YarnClusterSuite.scala:269) ... [info] - run Python application with Spark Connect in yarn-cluster mode !!! CANCELED !!! (1 millisecond) [info] Map("JIRA_PASSWORD" -> "JackBaidu2020", "RUBYOPT" -> "", "HOME" -> "/Users/yangjie01", "JAVA_MAIN_CLASS_33082" -> "xsbt.boot.Boot", "HOMEBREW_BOTTLE_DOMAIN" -> ... did not contain key "GITHUB_ACTIONS" (YarnClusterSuite.scala:275) ... [info] Run completed in 4 minutes, 33 seconds. [info] Total number of tests run: 28 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 28, failed 0, canceled 2, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #49848 from LuciferYang/SPARK-51130. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
diff --git a/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala b/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
@@ -266,11 +266,13 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
   }
 
   test("run Python application with Spark Connect in yarn-client mode") {
+    assume(sys.env.contains("GITHUB_ACTIONS"))
     testPySpark(
       true, extraConf = Map(SPARK_API_MODE.key -> "connect"), script = TEST_CONNECT_PYFILE)
   }
 
   test("run Python application with Spark Connect in yarn-cluster mode") {
+    assume(sys.env.contains("GITHUB_ACTIONS"))
     testPySpark(
       false, extraConf = Map(SPARK_API_MODE.key -> "connect"), script = TEST_CONNECT_PYFILE)
   }

Original file line number	Diff line number	Diff line change
`@@ -266,11 +266,13 @@ class YarnClusterSuite extends BaseYarnClusterSuite {`
`266`	`266`	`}`
`267`	`267`
`268`	`268`	`test("run Python application with Spark Connect in yarn-client mode") {`
	`269`	`+ assume(sys.env.contains("GITHUB_ACTIONS"))`
`269`	`270`	`testPySpark(`
`270`	`271`	`true, extraConf = Map(SPARK_API_MODE.key -> "connect"), script = TEST_CONNECT_PYFILE)`
`271`	`272`	`}`
`272`	`273`
`273`	`274`	`test("run Python application with Spark Connect in yarn-cluster mode") {`
	`275`	`+ assume(sys.env.contains("GITHUB_ACTIONS"))`
`274`	`276`	`testPySpark(`
`275`	`277`	`false, extraConf = Map(SPARK_API_MODE.key -> "connect"), script = TEST_CONNECT_PYFILE)`
`276`	`278`	`}`