OCPBUGS-22442: Fix TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock flake

petr-muller · petr-muller · commit 70db2534ec50 · 2024-11-05T16:06:30.000+01:00
Occasionally, the test flaked with the following:

```
--- FAIL: TestRunGraph (1.04s)
    --- FAIL: TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock (0.01s)
        task_graph_test.go:943: unexpected error: [context canceled context canceled]
```

The failure happened because the test saw two `context canceled` errors, but expected only one such error to happen:

```
errors: []string{"context canceled"}
```

The test is processing a graph with two independenent nodes, processed without paralellism:

```
nodes: []*TaskNode{
	{Tasks: tasks("a1", "a2", "a3")},
	{Tasks: tasks("b")},
},
maxParallelism: 1,
```

The test is configured to signal the cancellation in the middle of processing task `a2`, and expects:

- a2 task to return successfuly
- a3 task to start, but return the `context canceled` error
- b task to never start

The problem is the non-determinism of the Go `select` statement. When multiple `case` branches
are satisfied, one of them is chosen nondeterministically. The producer/consumer structure
in the graph processing looks like the following:

```go
// consumers
for i := 0; i &lt; maxParallelism; i++ {
	go func(ctx context.Context, worker int) {
		for {
			select {
			case &lt;-ctx.Done():
				return
			case runTask := &lt;-workCh:
				err := fn(ctx, runTask.tasks)
				resultCh &lt;- taskStatus{index: runTask.index, error: err}
			}
		}
	}(ctx, i)
}

// producer
	for !done {
		nextNode := getNextNode()
		switch {
		case ctx.Err() == nil &amp;&amp; nextNode &gt;= 0: // push a task or collect a result
			select {
			case workCh &lt;- runTasks{index: nextNode, tasks: graph.Nodes[nextNode].Tasks}:
				submitted[nextNode] = true
				inflight++
			case result := &lt;-resultCh:
				results[result.index] = &amp;result
				inflight--
			case &lt;-ctx.Done():
			}
		case inflight &gt; 0: // no work available to push; collect results
			select {
			case result := &lt;-resultCh:
				results[result.index] = &amp;result
				inflight--
			case &lt;-ctx.Done():
				select {
				case runTask := &lt;-workCh: // workers canceled, so remove any work from the queue ourselves
					inflight--
					submitted[runTask.index] = false // TODO: This does not seem needed
				default:
				}
			}
		default: // no work to push and nothing in flight.  We're done
			done = true
		}
	}
```

Because of the nondeterminism the following trace was possible:

1. producer creates a job for first node [a1, a2, a3] and puts it to `workCh`
2. producer creates a job for second node [b] and waits (`workCh` is full, `resultCh` is empty, ctx is not canceled)
3. worker consumes the first job and while a2 is processed ctx gets canceled
4. producer puts the job for the second node to `workCh`
5. worker completes processing the first job, returning error on trying to process task a3
6. worker starts another loop, hits `select` and can select both branches: `ctx` is canceled and there is an item in `workCh`
7. worker consumes the second job, processes task b, returns second error
8. both errors are consumed by the producer, test expected just one error -&gt; test fail

There are two possible paths that prevented us from hitting the problem:
1. either the consumer selected the `&lt;-ctx.Done()` branch in (6) above: `b` is never processed and `workCh` is drained by the producer
2. or the producer managed to drain `workCh` before the worker got to it

Note that there is a similar non-determinism on the producer side. When the context is canceled,
the collection half of the cycle (`case inflight &gt; 0`) can select between
the two branches while there are results in `resultCh`, which means it can
take the "draining" branch even when there is nothing to drain, even
multiple times, until it is lucky enough times to collect all results in
`resultCh`.

The fix for the flake is to detect the cancelation at the start of the loop.
This narrows the race window to a really short time between the check and
the `select`.
diff --git a/pkg/payload/task_graph.go b/pkg/payload/task_graph.go
@@ -478,6 +478,17 @@ func RunGraph(ctx context.Context, graph *TaskGraph, maxParallelism int, fn func
 			defer utilruntime.HandleCrash()
 			defer wg.Done()
 			for {
+				// First, make sure the worker was not signalled to cancel. This may seem redundant with the <-ctx.Done() below,
+				// but it is necessary to properly handle the case where cancellation occurs while the worker is processing a
+				// task. Go `select` is nondeterministic when multiple cases are ready, so when the worker finishes a task,
+				// starts another loop and both the ctx.Done() and workCh cases are ready, Go could choose either of them,
+				// potentially starting a new task even though the worker was supposed to stop. Checking cancellation here makes
+				// the race window much smaller (cancellation would need to happen between this check and the select).
+				if ctx.Err() != nil {
+					klog.V(2).Infof("Worker %d: Received cancel signal", worker)
+					return
+				}
+
 				select {
 				case <-ctx.Done():
 					klog.V(2).Infof("Worker %d: Received cancel signal while waiting for work", worker)
diff --git a/pkg/payload/task_graph_test.go b/pkg/payload/task_graph_test.go
@@ -898,7 +898,6 @@ func TestRunGraph(t *testing.T) {
 						return err
 					}
 					cancelFn()
-					// time.Sleep(time.Second)
 					return nil
 				},
 				"*": func(t *testing.T, name string, ctx context.Context, cancelFn func()) error {

Original file line number	Diff line number	Diff line change
`@@ -898,7 +898,6 @@ func TestRunGraph(t *testing.T) {`
`898`	`898`	`return err`
`899`	`899`	`}`
`900`	`900`	`cancelFn()`
`901`		`- // time.Sleep(time.Second)`
`902`	`901`	`return nil`
`903`	`902`	`},`
`904`	`903`	`"": func(t testing.T, name string, ctx context.Context, cancelFn func()) error {`