Skip to content

Commit 6be3cce

Browse files
zsxwinggatorsmile
authored andcommitted
[SPARK-25899][TESTS] Fix flaky CoarseGrainedSchedulerBackendSuite
## What changes were proposed in this pull request? I saw CoarseGrainedSchedulerBackendSuite failed in my PR and finally reproduced the following error on a very busy machine: ``` sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 400 times over 10.009828643999999 seconds. Last failure message: ArrayBuffer("2", "0", "3") had length 3 instead of expected length 4. ``` The logs in this test shows executor 1 was not up when the test failed. ``` 18/10/30 11:34:03.563 dispatcher-event-loop-12 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.2:43656) with ID 2 18/10/30 11:34:03.593 dispatcher-event-loop-3 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.2:43658) with ID 3 18/10/30 11:34:03.629 dispatcher-event-loop-6 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.2:43654) with ID 0 18/10/30 11:34:03.885 pool-1-thread-1-ScalaTest-running-CoarseGrainedSchedulerBackendSuite INFO CoarseGrainedSchedulerBackendSuite: ===== FINISHED o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite: 'compute max number of concurrent tasks can be launched' ===== ``` And the following logs in executor 1 shows it was still doing the initialization when the timeout happened (at 18/10/30 11:34:03.885). ``` 18/10/30 11:34:03.463 netty-rpc-connection-0 INFO TransportClientFactory: Successfully created connection to 54b6b6217301/172.17.0.2:33741 after 37 ms (0 ms spent in bootstraps) 18/10/30 11:34:03.959 main INFO DiskBlockManager: Created local directory at /home/jenkins/workspace/core/target/tmp/spark-383518bc-53bd-4d9c-885b-d881f03875bf/executor-61c406e4-178f-40a6-ac2c-7314ee6fb142/blockmgr-03fb84a1-eedc-4055-8743-682eb3ac5c67 18/10/30 11:34:03.993 main INFO MemoryStore: MemoryStore started with capacity 546.3 MB ``` Hence, I think our current 10 seconds is not enough on a slow Jenkins machine. This PR just increases the timeout from 10 seconds to 60 seconds to make the test more stable. ## How was this patch tested? Jenkins Closes apache#22910 from zsxwing/fix-flaky-test. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: gatorsmile <[email protected]>
1 parent bc9f9b4 commit 6be3cce

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ import org.apache.spark.util.{RpcUtils, SerializableBuffer}
3030
class CoarseGrainedSchedulerBackendSuite extends SparkFunSuite with LocalSparkContext
3131
with Eventually {
3232

33+
private val executorUpTimeout = 60.seconds
34+
3335
test("serialized task larger than max RPC message size") {
3436
val conf = new SparkConf
3537
conf.set("spark.rpc.message.maxSize", "1")
@@ -51,7 +53,7 @@ class CoarseGrainedSchedulerBackendSuite extends SparkFunSuite with LocalSparkCo
5153
.setMaster("local-cluster[4, 3, 1024]")
5254
.setAppName("test")
5355
sc = new SparkContext(conf)
54-
eventually(timeout(10.seconds)) {
56+
eventually(timeout(executorUpTimeout)) {
5557
// Ensure all executors have been launched.
5658
assert(sc.getExecutorIds().length == 4)
5759
}
@@ -64,7 +66,7 @@ class CoarseGrainedSchedulerBackendSuite extends SparkFunSuite with LocalSparkCo
6466
.setMaster("local-cluster[4, 3, 1024]")
6567
.setAppName("test")
6668
sc = new SparkContext(conf)
67-
eventually(timeout(10.seconds)) {
69+
eventually(timeout(executorUpTimeout)) {
6870
// Ensure all executors have been launched.
6971
assert(sc.getExecutorIds().length == 4)
7072
}
@@ -96,7 +98,7 @@ class CoarseGrainedSchedulerBackendSuite extends SparkFunSuite with LocalSparkCo
9698

9799
try {
98100
sc.addSparkListener(listener)
99-
eventually(timeout(10.seconds)) {
101+
eventually(timeout(executorUpTimeout)) {
100102
// Ensure all executors have been launched.
101103
assert(sc.getExecutorIds().length == 4)
102104
}

0 commit comments

Comments
 (0)