-
Notifications
You must be signed in to change notification settings - Fork 71
Description
When attempting to run the sqllogic-ztests on BSUP input data, I've found they consistently hang on certain query sets. I've got a stack trace from a hanging super process that was taken by sending it a kill -QUIT.
Details
Repro is with super commit 550cb59.
I bumped into while testing the changes merged in #6434. Previously the sqllogictests exclusively used Parquet for input since it was the only way to get schema info and type checking from external inputs. But since #6434 makes it possible to get this with all input sources, I ran a set of tests where I changed all the input files in sqllogic-ztests from Parquet to SUP/BSUP/CSUP, e.g., by running this on my checkout before starting tests:
for file in $(find sqllogic-ztests -name \*.parquet); do
super/dist/super -f bsup -o "$(dirname $file)/$(basename -s .parquet $file).bsup" "$file"
ln -sf "$(basename -s .parquet $file).bsup" "$(dirname $file)/$(basename -s .parquet $file)"
rm "$file"
done
While the SUP and CSUP ones finished ok, for some reason the BSUP ones hung on the random "aggregates" and "groupby" query sets as can be seen in this Actions run. I've since reproduced the effect several times interactively on Ubuntu 24.04 Actions runners. It doesn't always hang on any one particular query, but considering the high number of queries in each set it's possible that many queries have some similar characteristic that's triggering the issue. What consistently happens is that during tests the CPU consumption goes from 100% busy to 100% idle, and a super process that was running one of the queries continues to hang around.
Suspecting some kind of deadlock, @nwt suggested the approach of sending the super process a kill -QUIT, and so that resulted in the attached stacktrace.txt.
In the event its analysis saves us time, I threw the stack trace at Claude AI and here's what it had to say:
Classic WaitGroup Deadlock
Goroutine 1 (main goroutine) is stuck in
sync.WaitGroup.Wait():goroutine 1 [sync.WaitGroup.Wait]: ... github.com/brimdata/super/runtime.(*Context).Cancel(0xc00047e000) /home/runner/super/runtime/context.go:38 +0x2aGoroutine 53 is the worker that should be calling
Done(), but it's stuck trying to receive from a nil channel:goroutine 53 [chan receive (nil chan)]: ... github.com/brimdata/super/runtime/sam/op/aggregate.(*Op).run(0xc00003fb80) /home/runner/super/runtime/sam/op/aggregate/aggregate.go:195 +0xa2The Problem
- Goroutine 53 is blocked receiving from a nil channel (likely from the scanner)
- Because it's blocked, it never completes and never calls
wg.Done()- Goroutine 1 is waiting for the WaitGroup counter to reach zero in
Context.Cancel()- Deadlock: goroutine 1 waits forever for goroutine 53, which is permanently blocked
Root Cause
Looking at the stack trace, the issue is in the file scanner pipeline:github.com/brimdata/super/sio/bsupio.(*scanner).Pull → chan receive (nil chan)The scanner's output channel appears to be nil, which means receiving from it will block forever.
What to Check
- aggregate.go:163-195 - Check the goroutine spawn and channel initialization
- scanner.go:82 - The Pull method is receiving from a nil channel
- Look for a race condition where the channel gets set to nil during cancellation
- Check if there's a missing context cancellation signal to the worker goroutine
Likely Fix
You probably need to add aselectstatement with a context cancellation case:select { case val := <-ch: // process case <-ctx.Done(): return ctx.Err() }This would allow the blocked goroutine to exit when
Context.Cancel()is called, letting the WaitGroup complete properly.