Do we need some kind of shutdown method?

We use this crate in lancedb's python bindings with a tokio runtime.  We have users sometimes reporting a crash on exit when they are doing small subprocess tasks.  They are using `spawn` based multiprocessing so it launches a subprocess, runs a small task, and exits.  Sometimes that exit crashes with the following error:

```
Fatal Python error: PyGILState_Release: thread state 0x7fec9803b600 must be current when releasing
Python runtime state: finalizing (tstate=0x0000000000ba5048)

Thread 0x00007fed47523080 (most recent call first):
  <no Python frame>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pyarrow.lib, pyarrow._compute, pyarrow._acero, pyarrow._fs, pyarrow._csv, pyarrow._json, pyarrow._substrait, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs (total: 68)
```

The task looks something like...

```
def my_task():
    lancedb.do_async_thing()
```

Here `do_async_thing` is a function that does `loop.run(async_thing())` where `async_thing` is a function that awaits the result of `future_into_py`.  The `loop` here is a global event loop running on a daemon thread that is shut down on exit with an `atexit` hook.

I'm able to debug into the core dump and get the following stack trace:

```
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007e525b04527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007e525b0288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x00000000004b0fa7 in fatal_error_exit (status=-1) at ../Python/pylifecycle.c:2735
#6  fatal_error (fd=fd@entry=2, header=header@entry=0, prefix=prefix@entry=0x0, msg=msg@entry=0x0, status=status@entry=-1) at ../Python/pylifecycle.c:2846
#7  0x00000000004b278e in _Py_FatalErrorFormat (func=func@entry=0x78cb70 <__func__.2> "PyGILState_Release", format=format@entry=0x730350 "thread state %p must be current when releasing")
    at ../Python/pylifecycle.c:2962
#8  0x00000000004b2b74 in PyGILState_Release (oldstate=PyGILState_UNLOCKED) at ../Python/pystate.c:2265
#9  0x00007e52562bd3ca in <pyo3_async_runtimes::tokio::TokioRuntime as pyo3_async_runtimes::generic::Runtime>::spawn::{{closure}} ()
   from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#10 0x00007e5256225f9d in tokio::runtime::task::raw::poll () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#11 0x00007e5259c72520 in tokio::runtime::scheduler::multi_thread::worker::Context::run_task () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#12 0x00007e5259c7aa2f in tokio::runtime::task::raw::poll () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#13 0x00007e5259c63f68 in std::sys::backtrace::__rust_begin_short_backtrace () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#14 0x00007e5259c63bdc in core::ops::function::FnOnce::call_once{{vtable.shim}} () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#15 0x00007e5259c5abbb in std::sys::pal::unix::thread::Thread::new::thread_start () from /home/pace/dev/lancedb/python/python/lancedb/_lancedb.abi3.so
#16 0x00007e525b09caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007e525b129c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
```

It seems that some tokio task is still in the queue as the python finalization begins.  This task attempt to call `PyGILState_Release` but since finalization has already begun this turns into an abort.

I think one potential solution might be to have some way to shutdown the pyo3 tokio runtime.  I don't think I can do that today because I can only get a reference to the runtime and shutting it down requires ownership.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do we need some kind of shutdown method? #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Do we need some kind of shutdown method? #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions