Skip to content

Commit c851ecb

Browse files
committed
testsuite: workaround job start signal race
Problem: In several tests in t2800-jobs-cmd.t, a job is sent a signal immediately after the job starts. There is a rare race that can occur when signaling a job very quickly after it has started. This race can lead to unexpected results. Solution: Instead of calling "sleep inf", run a script that will echo some output then call "sleep inf". Ensure that data has been output from the job before signaling it. This will ensure that the job has started and the process is fully running before being sent a signal. Fixes #5210
1 parent e0f2211 commit c851ecb

File tree

1 file changed

+23
-4
lines changed

1 file changed

+23
-4
lines changed

t/t2800-jobs-cmd.t

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,15 @@ test_expect_success 'configure testing queues' '
4747
flux queue start --all
4848
'
4949

50+
test_expect_success 'create helper job submission script' '
51+
cat >sleepinf.sh <<-EOT &&
52+
#!/bin/sh
53+
echo "job started"
54+
sleep inf
55+
EOT
56+
chmod +x sleepinf.sh
57+
'
58+
5059
test_expect_success 'submit jobs for job list testing' '
5160
# Create `hostname` and `sleep` jobspec
5261
# N.B. Used w/ `flux job submit` for serial job submission
@@ -82,7 +91,12 @@ test_expect_success 'submit jobs for job list testing' '
8291
# Run a job that we will end with a signal, copy its JOBID to both inactive and
8392
# failed and terminated lists.
8493
#
85-
jobid=`flux submit --wait-event=start sleep inf` &&
94+
# N.B. sleepinf.sh and wait-event on job data to workaround
95+
# rare job startup race. See #5210
96+
#
97+
jobid=`flux submit ./sleepinf.sh` &&
98+
flux job wait-event -p guest.exec.eventlog $jobid shell.init &&
99+
flux job wait-event -p guest.output $jobid data &&
86100
flux job kill $jobid &&
87101
fj_wait_event $jobid clean &&
88102
echo $jobid >> inactiveids &&
@@ -92,7 +106,12 @@ test_expect_success 'submit jobs for job list testing' '
92106
# Run a job that we will end with a user exception, copy its JOBID to both
93107
# inactive and failed and exception lists.
94108
#
95-
jobid=`flux submit --wait-event=start sleep inf` &&
109+
# N.B. sleepinf.sh and wait-event on job data to workaround
110+
# rare job startup race. See #5210
111+
#
112+
jobid=`flux submit ./sleepinf.sh` &&
113+
flux job wait-event -p guest.exec.eventlog $jobid shell.init &&
114+
flux job wait-event -p guest.output $jobid data &&
96115
flux job raise --type=myexception --severity=0 -m "myexception" $jobid &&
97116
fj_wait_event $jobid clean &&
98117
echo $jobid >> inactiveids &&
@@ -507,8 +526,8 @@ test_expect_success 'flux-jobs --format={name} works' '
507526
flux jobs --filter=inactive -no "{name}" > jobnameI.out &&
508527
echo "canceledjob" >> jobnameI.exp &&
509528
echo "sleep" >> jobnameI.exp &&
510-
echo "sleep" >> jobnameI.exp &&
511-
echo "sleep" >> jobnameI.exp &&
529+
echo "sleepinf.sh" >> jobnameI.exp &&
530+
echo "sleepinf.sh" >> jobnameI.exp &&
512531
echo "nosuchcommand" >> jobnameI.exp &&
513532
count=$(($(job_list_state_count inactive) - 5)) &&
514533
for i in `seq 1 $count`; do

0 commit comments

Comments
 (0)