-
Notifications
You must be signed in to change notification settings - Fork 22
Description
We use this crate in lancedb's python bindings with a tokio runtime. We have users sometimes reporting a crash on exit when they are doing small subprocess tasks. They are using spawn
based multiprocessing so it launches a subprocess, runs a small task, and exits. Sometimes that exit crashes with the following error:
Fatal Python error: PyGILState_Release: thread state 0x7fec9803b600 must be current when releasing
Python runtime state: finalizing (tstate=0x0000000000ba5048)
Thread 0x00007fed47523080 (most recent call first):
<no Python frame>
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pyarrow.lib, pyarrow._compute, pyarrow._acero, pyarrow._fs, pyarrow._csv, pyarrow._json, pyarrow._substrait, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs (total: 68)
The task looks something like...
def my_task():
lancedb.do_async_thing()
Here do_async_thing
is a function that does loop.run(async_thing())
where async_thing
is a function that awaits the result of future_into_py
. The loop
here is a global event loop running on a daemon thread that is shut down on exit with an atexit
hook.
I'm able to debug into the core dump and get the following stack trace:
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007e525b04527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007e525b0288ff in __GI_abort () at ./stdlib/abort.c:79
#5 0x00000000004b0fa7 in fatal_error_exit (status=-1) at ../Python/pylifecycle.c:2735
#6 fatal_error (fd=fd@entry=2, header=header@entry=0, prefix=prefix@entry=0x0, msg=msg@entry=0x0, status=status@entry=-1) at ../Python/pylifecycle.c:2846
#7 0x00000000004b278e in _Py_FatalErrorFormat (func=func@entry=0x78cb70 <__func__.2> "PyGILState_Release", format=format@entry=0x730350 "thread state %p must be current when releasing")
at ../Python/pylifecycle.c:2962
#8 0x00000000004b2b74 in PyGILState_Release (oldstate=PyGILState_UNLOCKED) at ../Python/pystate.c:2265
#9 0x00007e52562bd3ca in <pyo3_async_runtimes::tokio::TokioRuntime as pyo3_async_runtimes::generic::Runtime>::spawn::{{closure}} ()
from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#10 0x00007e5256225f9d in tokio::runtime::task::raw::poll () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#11 0x00007e5259c72520 in tokio::runtime::scheduler::multi_thread::worker::Context::run_task () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#12 0x00007e5259c7aa2f in tokio::runtime::task::raw::poll () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#13 0x00007e5259c63f68 in std::sys::backtrace::__rust_begin_short_backtrace () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#14 0x00007e5259c63bdc in core::ops::function::FnOnce::call_once{{vtable.shim}} () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#15 0x00007e5259c5abbb in std::sys::pal::unix::thread::Thread::new::thread_start () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#16 0x00007e525b09caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007e525b129c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
It seems that some tokio task is still in the queue as the python finalization begins. This task attempt to call PyGILState_Release
but since finalization has already begun this turns into an abort.
I think one potential solution might be to have some way to shutdown the pyo3 tokio runtime. I don't think I can do that today because I can only get a reference to the runtime and shutting it down requires ownership.