Deadlock symptom when running sqllogic-ztests on BSUP input

When attempting to run the sqllogic-ztests on BSUP input data, I've found they consistently hang on certain query sets. I've got a stack trace from a hanging `super` process that was taken by sending it a `kill -QUIT`.

## Details

Repro is with super commit 550cb59.

I bumped into while testing the changes merged in #6434. Previously the sqllogictests exclusively used Parquet for input since it was the only way to get schema info and type checking from external inputs. But since #6434 makes it possible to get this with all input sources, I ran a set of tests where I changed all the input files in sqllogic-ztests from Parquet to SUP/BSUP/CSUP, e.g., by running this on my checkout before starting tests:

```
for file in $(find sqllogic-ztests -name \*.parquet); do
  super/dist/super -f bsup -o "$(dirname $file)/$(basename -s .parquet $file).bsup" "$file"
  ln -sf "$(basename -s .parquet $file).bsup" "$(dirname $file)/$(basename -s .parquet $file)"
  rm "$file"
done
```

While the SUP and CSUP ones finished ok, for some reason the BSUP ones hung on the random "aggregates" and "groupby" query sets as can be seen in [this Actions run](https://github.com/brimdata/sqllogic-ztests/actions/runs/20252912091). I've since reproduced the effect several times interactively on Ubuntu 24.04 Actions runners. It doesn't always hang on any one particular query, but considering the high number of queries in each set it's possible that many queries have some similar characteristic that's triggering the issue. What consistently happens is that during tests the CPU consumption goes from 100% busy to 100% idle, and a `super` process that was running one of the queries continues to hang around.

Suspecting some kind of deadlock, @nwt suggested the approach of sending the `super` process a `kill -QUIT`, and so that resulted in the attached [stacktrace.txt](https://github.com/user-attachments/files/24243095/stacktrace.txt).

In the event its analysis saves us time, I threw the stack trace at Claude AI and here's what it had to say:

>**Classic WaitGroup Deadlock**
>
> **Goroutine 1** (main goroutine) is stuck in `sync.WaitGroup.Wait()`:
>```
>goroutine 1 [sync.WaitGroup.Wait]:
>...
>github.com/brimdata/super/runtime.(*Context).Cancel(0xc00047e000)
>    /home/runner/super/runtime/context.go:38 +0x2a
>```
>
>**Goroutine 53** is the worker that should be calling `Done()`, but it's stuck trying to receive from a **nil channel**:
>
>```
>goroutine 53 [chan receive (nil chan)]:
>...
>github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).run(0xc00003fb80)
>   /home/runner/super/runtime/sam/op/aggregate/aggregate.go:195 +0xa2
>```
>
>**The Problem**
>1. Goroutine 53 is blocked receiving from a nil channel (likely from the scanner)
>2. Because it's blocked, it never completes and never calls `wg.Done()`
>3. Goroutine 1 is waiting for the WaitGroup counter to reach zero in `Context.Cancel()`
>4. **Deadlock:** goroutine 1 waits forever for goroutine 53, which is permanently blocked
>
>**Root Cause**
>Looking at the stack trace, the issue is in the file scanner pipeline:
>```
>github.com/brimdata/super/sio/bsupio.(*scanner).Pull
>    → chan receive (nil chan)
>```
>
>The scanner's output channel appears to be nil, which means receiving from it will block forever.
>
>**What to Check**
>1. **aggregate.go:163-195** - Check the goroutine spawn and channel initialization
>2. **scanner.go:82** - The Pull method is receiving from a nil channel
>3. Look for a race condition where the channel gets set to nil during cancellation
>4. Check if there's a missing context cancellation signal to the worker goroutine
>
>**Likely Fix**
>You probably need to add a `select` statement with a context cancellation case:
>
>```go
>select {
>case val := <-ch:
>    // process
>case <-ctx.Done():
>    return ctx.Err()
>}
>```
>This would allow the blocked goroutine to exit when `Context.Cancel()` is called, letting the WaitGroup complete properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock symptom when running sqllogic-ztests on BSUP input #6458

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock symptom when running sqllogic-ztests on BSUP input #6458

Description

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions