hack/ginkgo-e2e.sh: forward TERM/INT to Ginkgo

pohly · pohly · commit ce9e398641bf · 2025-01-17T16:28:41.000+01:00
What happens at the moment in e.g. pull-kubernetes-e2e-kind in case of a
timeout is that ginkgo-e2e.sh gets killed with SIGTERM. This is not propagated
to the E2E test suite processes, therefore there is no "Interrupted by User"
report and no JUnit file, depending on timing during the process shutdown.

Running the Ginkgo CLI with job control enabled creates a new process group,
which then can be used to kill the Ginko CLI and the E2E test suite
processes. With these changes, more information is produced. Some of it seems
a bit redundant, but it's better than none:

*** hack/ginkgo-e2e.sh: received termination signal -&gt; asking Ginkgo to stop.
***
*** Beware that a timeout may have been caused by some earlier test,
*** not necessarily the one which gets interrupted now.
*** See the "Spec runtime" for information about how long the
*** interrupted test was running.

  ------------------------------
  Interrupted by User
  First interrupt received; Ginkgo will run any cleanup and reporting nodes but will skip all remaining specs.  Interrupt again to skip cleanup.
  Here's a current progress report:
    [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller creates slices (Spec Runtime: 9
.065s)
      k8s.io/kubernetes/test/e2e/dra/dra.go:812
      In [It] (Node Runtime: 9.044s)
        k8s.io/kubernetes/test/e2e/dra/dra.go:812
        At [By Step] Creating slices (Step Runtime: 8.884s)
          k8s.io/kubernetes/test/e2e/dra/dra.go:847
...
        Begin Additional Progress Reports &gt;&gt;
          There is no failure as the matcher passed to Consistently has not yet failed
        &lt;&lt; End Additional Progress Reports
  ------------------------------
• [INTERRUPTED] [11.955 seconds]
[sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller [It] creates slices [sig-node, Feature:DynamicResourceAllocation, FeatureGate:DynamicResourceAllocation, Feature:Beta]
k8s.io/kubernetes/test/e2e/dra/dra.go:812

  Timeline &gt;&gt;
  STEP: Creating a kubernetes client @ 01/09/25 17:18:59.769
...
  [FAILED] in [It] - k8s.io/kubernetes/test/e2e/dra/dra.go:881 @ 01/09/25 17:19:08.835
  I0109 17:19:11.703212 302727 helper.go:125] Waiting up to 7m0s for all (but 0) nodes to be ready
  STEP: dump namespace information after failure @ 01/09/25 17:19:11.706
  STEP: Collecting events from namespace "dra-7998". @ 01/09/25 17:19:11.706
  STEP: Found 0 events. @ 01/09/25 17:19:11.708
...
  STEP: Destroying namespace "dra-7998" for this suite. @ 01/09/25 17:19:11.72
  &lt;&lt; Timeline

  [INTERRUPTED] Interrupted by User
  In [It] at: k8s.io/kubernetes/test/e2e/dra/dra.go:812 @ 01/09/25 17:19:08.833

  This is the Progress Report generated when the interrupt was received:
    [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller creates slices (Spec Runtime: 9
.065s)
...

  [FAILED] An interrupt occurred and then the following failure was recorded in the interrupted node before it exited:
  Context was cancelled (cause: Interrupted by User) after 0.329s.
  There is no failure as the matcher passed to Consistently has not yet failed
  In [It] at: k8s.io/kubernetes/test/e2e/dra/dra.go:881 @ 01/09/25 17:19:08.835
------------------------------
Checking for custom logdump instances, if any
----------------------------------------------------------------------------------------------------
k/k version of the log-dump.sh script is deprecated!
Please migrate your test job to use test-infra's repo version of log-dump.sh!
Migration steps can be found in the readme file.
----------------------------------------------------------------------------------------------------
Sourcing kube-util.sh
Detecting project
Skeleton Provider: detect-project not implemented
Dumping logs from master locally to '/tmp/test'
Master SSH not supported for local
Dumping logs from nodes locally to '/tmp/test'
Node SSH not supported for local

Summarizing 1 Failure:
  [INTERRUPTED] [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller [It] creates slices [sig-node, Feature:DynamicResourceAllocation, FeatureGate:DynamicResourceAllocation, Feature:Beta]
  k8s.io/kubernetes/test/e2e/dra/dra.go:812

Ran 1 of 6644 Specs in 12.208 seconds
FAIL! - Interrupted by User -- 0 Passed | 1 Failed | 0 Pending | 6643 Skipped
--- FAIL: TestE2E (12.74s)
FAIL

Ginkgo ran 1 suite in 13.379078611s
diff --git a/hack/ginkgo-e2e.sh b/hack/ginkgo-e2e.sh
@@ -204,6 +204,49 @@ fi
 # is not used.
 suite_args+=(--report-complete-ginkgo --report-complete-junit)
 
+# When SIGTERM doesn't reach the E2E test suite binaries, ginkgo will exit
+# without collecting information from about the currently running and
+# potentially stuck tests. This seems to happen when Prow shuts down a test
+# job because of a timeout.
+#
+# It's useful to print one final progress report in that case,
+# so GINKGO_PROGRESS_REPORT_ON_SIGTERM (enabled by default when CI=true)
+# catches SIGTERM and forwards it to all processes spawned by ginkgo.
+#
+# Manual invocations can trigger a similar report with `killall -USR1 e2e.test`
+# without having to kill the test run.
+GINKGO_CLI_PID=
+signal_handler() {
+  if [ -n "${GINKGO_CLI_PID}" ]; then
+    cat <<EOF
+
+*** $0: received $1 signal -> asking Ginkgo to stop.
+***
+*** Beware that a timeout may have been caused by some earlier test,
+*** not necessarily the one which gets interrupted now.
+*** See the "Spec runtime" for information about how long the
+*** interrupted test was running.
+
+EOF
+    # This goes to the process group, which is important because we
+    # need to reach the e2e.test processes forked by the Ginkgo CLI.
+    kill -TERM "-${GINKGO_CLI_PID}" || true
+
+    echo "Waiting for Ginkgo with pid ${GINKGO_CLI_PID}..."
+    wait "{$GINKGO_CLI_PID}"
+    echo "Ginkgo terminated."
+  fi
+}
+case "${GINKGO_PROGRESS_REPORT_ON_SIGTERM:-${CI:-no}}" in
+  y|yes|true)
+    kube::util::trap_add "signal_handler INT" INT
+    kube::util::trap_add "signal_handler TERM" TERM
+    # Job control is needed to make the Ginkgo CLI and all workers run
+    # in their own process group.
+    set -m
+    ;;
+esac
+
 # The following invocation is fairly complex. Let's dump it to simplify
 # determining what the final options are. Enabled by default in CI
 # environments like Prow.
@@ -236,4 +279,8 @@ case "${GINKGO_SHOW_COMMAND:-${CI:-no}}" in y|yes|true) set -x ;; esac
   ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} \
   ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} \
   "${suite_args[@]:+${suite_args[@]}}" \
-  "${@}"
+  "${@}" &
+
+set +x
+GINKGO_CLI_PID=$!
+wait "${GINKGO_CLI_PID}"