Skip to content

Deadlock symptom when running sqllogic-ztests on BSUP input #6458

@philrz

Description

@philrz

When attempting to run the sqllogic-ztests on BSUP input data, I've found they consistently hang on certain query sets. I've got a stack trace from a hanging super process that was taken by sending it a kill -QUIT.

Details

Repro is with super commit 550cb59.

I bumped into while testing the changes merged in #6434. Previously the sqllogictests exclusively used Parquet for input since it was the only way to get schema info and type checking from external inputs. But since #6434 makes it possible to get this with all input sources, I ran a set of tests where I changed all the input files in sqllogic-ztests from Parquet to SUP/BSUP/CSUP, e.g., by running this on my checkout before starting tests:

for file in $(find sqllogic-ztests -name \*.parquet); do
  super/dist/super -f bsup -o "$(dirname $file)/$(basename -s .parquet $file).bsup" "$file"
  ln -sf "$(basename -s .parquet $file).bsup" "$(dirname $file)/$(basename -s .parquet $file)"
  rm "$file"
done

While the SUP and CSUP ones finished ok, for some reason the BSUP ones hung on the random "aggregates" and "groupby" query sets as can be seen in this Actions run. I've since reproduced the effect several times interactively on Ubuntu 24.04 Actions runners. It doesn't always hang on any one particular query, but considering the high number of queries in each set it's possible that many queries have some similar characteristic that's triggering the issue. What consistently happens is that during tests the CPU consumption goes from 100% busy to 100% idle, and a super process that was running one of the queries continues to hang around.

Suspecting some kind of deadlock, @nwt suggested the approach of sending the super process a kill -QUIT, and so that resulted in the attached stacktrace.txt.

In the event its analysis saves us time, I threw the stack trace at Claude AI and here's what it had to say:

Classic WaitGroup Deadlock

Goroutine 1 (main goroutine) is stuck in sync.WaitGroup.Wait():

goroutine 1 [sync.WaitGroup.Wait]:
...
github.com/brimdata/super/runtime.(*Context).Cancel(0xc00047e000)
   /home/runner/super/runtime/context.go:38 +0x2a

Goroutine 53 is the worker that should be calling Done(), but it's stuck trying to receive from a nil channel:

goroutine 53 [chan receive (nil chan)]:
...
github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).run(0xc00003fb80)
  /home/runner/super/runtime/sam/op/aggregate/aggregate.go:195 +0xa2

The Problem

  1. Goroutine 53 is blocked receiving from a nil channel (likely from the scanner)
  2. Because it's blocked, it never completes and never calls wg.Done()
  3. Goroutine 1 is waiting for the WaitGroup counter to reach zero in Context.Cancel()
  4. Deadlock: goroutine 1 waits forever for goroutine 53, which is permanently blocked

Root Cause
Looking at the stack trace, the issue is in the file scanner pipeline:

github.com/brimdata/super/sio/bsupio.(*scanner).Pull
   → chan receive (nil chan)

The scanner's output channel appears to be nil, which means receiving from it will block forever.

What to Check

  1. aggregate.go:163-195 - Check the goroutine spawn and channel initialization
  2. scanner.go:82 - The Pull method is receiving from a nil channel
  3. Look for a race condition where the channel gets set to nil during cancellation
  4. Check if there's a missing context cancellation signal to the worker goroutine

Likely Fix
You probably need to add a select statement with a context cancellation case:

select {
case val := <-ch:
   // process
case <-ctx.Done():
   return ctx.Err()
}

This would allow the blocked goroutine to exit when Context.Cancel() is called, letting the WaitGroup complete properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions