You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Destroy all Edenfs_watcher instances on server exit
Summary:
This diff addresses a problem affecting the interaction of the OCaml runtime, Tokio, and Folly:
- Our own `ocamlrep` crate allows us to register finalizers for the `Custom<>` types owned by OCaml. Usually, these just call `drop` on the Rust side.
- When an OCaml program terminates, it does not call the finalizers of the objects that are alive at that point (see https://ocaml.org/docs/garbage-collection#finalisation-and-the-weak-module)
- When running Hack with `edenfs_file_watcher_enabled`, the server env contains an (opaque) `Custom<EdenfsWatcherInstanceHandle>`, which owns the Tokio runtime executing a worker thread. Due to the points above, when calling `Exit.exit` in the server, the Tokio runtime and worker thread continue running.
- Folly registers a low-level hook that is run when the program actually terminates (after the OCaml runtime has finished its own shutdown logic)
- This low-level hook can deadlock when the `EdenfsWatcherInstance` is still running. I'm not sure what exactly is causing the deadlock (the fact that the worker thread is still running or the mere existence of the Tokio runtime, ...)
Concretely, this problems manifests itself by the server sometimes getting stuck when calling `Exit.exit`.
This diff changes the initialization code for `Edenfs_watcher` instances such that whenever we create an instance, we call `Exit.add_hook_upon_clean_exit` to register a hook that will properly shut down the instance. Note that this hook mechanism is part of our own code, and is run before the lower-level exit handler installed by Folly.
## Alternatives considered:
OCaml has a "cleanup_on_exit mode, which among other things should call all finalizers (see https://ocaml.org/manual/5.3/runtime.html#s:ocamlrun-options). However, it seems to be buggy in OCaml 5.x (see ocaml/ocaml#10865 (comment)) and running hh_server with `OCAMLRUNPARAM=c` doesn't fix our problem.
I'm not sure if we would want to use it anyway, as it may slow down server restarts.
# Facebook
Here's a stack trace where we get stuck. I've obtained it by attaching gdb to the server process:
```
#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=3432193, futex_word=0x7fffb4000910) at futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=<optimized out>, expected=3432193, futex_word=0x7fffb4000910) at futex-internal.c:87
facebook#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=0x7fffb4000910, expected=3432193, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at futex-internal.c:139
facebook#3 0x00007ffff729c793 in __pthread_clockjoin_ex (threadid=140736213288512, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:105
facebook#4 0x00007ffff7cdf84f in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/include/x86_64-facebook-linux/bits/gthr-default.h:669
facebook#5 std::thread::join (this=0x7fffbe209110) at ../../../.././libstdc++-v3/src/c++11/thread.cc:112
facebook#6 0x00000000062a810d in folly::ThreadPoolExecutor::joinStoppedThreads(unsigned long) ()
facebook#7 0x00000000062a89ca in folly::ThreadPoolExecutor::stopAndJoinAllThreads(bool) ()
facebook#8 0x00000000062a395b in folly::IOThreadPoolExecutor::~IOThreadPoolExecutor() ()
facebook#9 0x00000000062a1975 in folly::IOThreadPoolExecutor::~IOThreadPoolExecutor() ()
facebook#10 0x0000000006452c09 in folly::detail::SingletonHolder<folly::IOThreadPoolExecutor>::destroyInstance() ()
facebook#11 0x0000000006fa3c0c in folly::SingletonVault::destroyInstances() ()
facebook#12 0x00007ffff72478b8 in __run_exit_handlers (status=4, listp=0x7ffff7412658 <__exit_funcs>, run_list_atexit=<optimized out>, run_dtors=<optimized out>) at exit.c:113
facebook#13 0x00007ffff72479ca in __GI_exit (status=<optimized out>) at exit.c:143
facebook#14 0x000000001f401e62 in caml_do_exit (retcode=4) at runtime/sys.c:200
facebook#15 0x000000001f4020dc in caml_sys_exit (retcode=<optimized out>) at runtime/sys.c:205
facebook#16 <signal handler called>
facebook#17 0x000000001f33c3f8 in camlStdlib.exit_1534 () at stdlib.ml:580
facebook#18 0x000000001f030ec0 in camlExit.exit_24 () at fbcode/hphp/hack/src/utils/exit.ml:66
facebook#19 0x000000001c2bb13b in camlServerMain.exit_if_critical_update_249 () at fbcode/hphp/hack/src/server/serverMain.ml:108
facebook#20 0x000000001c2bb885 in camlServerMain.query_notifier_549 () at fbcode/hphp/hack/src/server/serverMain.ml:225
facebook#21 0x000000001c2bbe32 in camlServerMain.recheck_until_no_changes_left_916 () at fbcode/hphp/hack/src/server/serverMain.ml:331
```
So we are shutting down the `folly::IOThreadPoolExecutor`, which then gets stuck waiting for something.
A comment inside `__pthread_clockjoin_ex` at the point where we get stucks reads
says
```
/* The kernel notifies a process which uses CLONE_CHILD_CLEARTID via
futex wake-up when the clone terminates. The memory location
contains the thread ID while the clone is running and is reset to
zero by the kernel afterwards. The kernel up to version 3.16.3
does not use the private futex operations for futex wake-up when
the clone terminates. */
```
Differential Revision: D76737597
fbshipit-source-id: 979b4e9b3ae88a07fcf62f0958fe41372624d00b
0 commit comments