Skip to content

Do we need some kind of shutdown method? #40

@westonpace

Description

@westonpace

We use this crate in lancedb's python bindings with a tokio runtime. We have users sometimes reporting a crash on exit when they are doing small subprocess tasks. They are using spawn based multiprocessing so it launches a subprocess, runs a small task, and exits. Sometimes that exit crashes with the following error:

Fatal Python error: PyGILState_Release: thread state 0x7fec9803b600 must be current when releasing
Python runtime state: finalizing (tstate=0x0000000000ba5048)

Thread 0x00007fed47523080 (most recent call first):
  <no Python frame>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pyarrow.lib, pyarrow._compute, pyarrow._acero, pyarrow._fs, pyarrow._csv, pyarrow._json, pyarrow._substrait, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs (total: 68)

The task looks something like...

def my_task():
    lancedb.do_async_thing()

Here do_async_thing is a function that does loop.run(async_thing()) where async_thing is a function that awaits the result of future_into_py. The loop here is a global event loop running on a daemon thread that is shut down on exit with an atexit hook.

I'm able to debug into the core dump and get the following stack trace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007e525b04527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007e525b0288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00000000004b0fa7 in fatal_error_exit (status=-1) at ../Python/pylifecycle.c:2735
#6  fatal_error (fd=fd@entry=2, header=header@entry=0, prefix=prefix@entry=0x0, msg=msg@entry=0x0, status=status@entry=-1) at ../Python/pylifecycle.c:2846
#7  0x00000000004b278e in _Py_FatalErrorFormat (func=func@entry=0x78cb70 <__func__.2> "PyGILState_Release", format=format@entry=0x730350 "thread state %p must be current when releasing")
    at ../Python/pylifecycle.c:2962
#8  0x00000000004b2b74 in PyGILState_Release (oldstate=PyGILState_UNLOCKED) at ../Python/pystate.c:2265
#9  0x00007e52562bd3ca in <pyo3_async_runtimes::tokio::TokioRuntime as pyo3_async_runtimes::generic::Runtime>::spawn::{{closure}} ()
   from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#10 0x00007e5256225f9d in tokio::runtime::task::raw::poll () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#11 0x00007e5259c72520 in tokio::runtime::scheduler::multi_thread::worker::Context::run_task () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#12 0x00007e5259c7aa2f in tokio::runtime::task::raw::poll () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#13 0x00007e5259c63f68 in std::sys::backtrace::__rust_begin_short_backtrace () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#14 0x00007e5259c63bdc in core::ops::function::FnOnce::call_once{{vtable.shim}} () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#15 0x00007e5259c5abbb in std::sys::pal::unix::thread::Thread::new::thread_start () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#16 0x00007e525b09caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007e525b129c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

It seems that some tokio task is still in the queue as the python finalization begins. This task attempt to call PyGILState_Release but since finalization has already begun this turns into an abort.

I think one potential solution might be to have some way to shutdown the pyo3 tokio runtime. I don't think I can do that today because I can only get a reference to the runtime and shutting it down requires ownership.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions