Skip to content

Commit 3fec44a

Browse files
frank-emrichfacebook-github-bot
authored andcommitted
Destroy all Edenfs_watcher instances on server exit
Summary: This diff addresses a problem affecting the interaction of the OCaml runtime, Tokio, and Folly: - Our own `ocamlrep` crate allows us to register finalizers for the `Custom<>` types owned by OCaml. Usually, these just call `drop` on the Rust side. - When an OCaml program terminates, it does not call the finalizers of the objects that are alive at that point (see https://ocaml.org/docs/garbage-collection#finalisation-and-the-weak-module) - When running Hack with `edenfs_file_watcher_enabled`, the server env contains an (opaque) `Custom<EdenfsWatcherInstanceHandle>`, which owns the Tokio runtime executing a worker thread. Due to the points above, when calling `Exit.exit` in the server, the Tokio runtime and worker thread continue running. - Folly registers a low-level hook that is run when the program actually terminates (after the OCaml runtime has finished its own shutdown logic) - This low-level hook can deadlock when the `EdenfsWatcherInstance` is still running. I'm not sure what exactly is causing the deadlock (the fact that the worker thread is still running or the mere existence of the Tokio runtime, ...) Concretely, this problems manifests itself by the server sometimes getting stuck when calling `Exit.exit`. This diff changes the initialization code for `Edenfs_watcher` instances such that whenever we create an instance, we call `Exit.add_hook_upon_clean_exit` to register a hook that will properly shut down the instance. Note that this hook mechanism is part of our own code, and is run before the lower-level exit handler installed by Folly. ## Alternatives considered: OCaml has a "cleanup_on_exit mode, which among other things should call all finalizers (see https://ocaml.org/manual/5.3/runtime.html#s:ocamlrun-options). However, it seems to be buggy in OCaml 5.x (see ocaml/ocaml#10865 (comment)) and running hh_server with `OCAMLRUNPARAM=c` doesn't fix our problem. I'm not sure if we would want to use it anyway, as it may slow down server restarts. # Facebook Here's a stack trace where we get stuck. I've obtained it by attaching gdb to the server process: ``` #0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=3432193, futex_word=0x7fffb4000910) at futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=<optimized out>, expected=3432193, futex_word=0x7fffb4000910) at futex-internal.c:87 facebook#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=0x7fffb4000910, expected=3432193, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at futex-internal.c:139 facebook#3 0x00007ffff729c793 in __pthread_clockjoin_ex (threadid=140736213288512, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:105 facebook#4 0x00007ffff7cdf84f in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/include/x86_64-facebook-linux/bits/gthr-default.h:669 facebook#5 std::thread::join (this=0x7fffbe209110) at ../../../.././libstdc++-v3/src/c++11/thread.cc:112 facebook#6 0x00000000062a810d in folly::ThreadPoolExecutor::joinStoppedThreads(unsigned long) () facebook#7 0x00000000062a89ca in folly::ThreadPoolExecutor::stopAndJoinAllThreads(bool) () facebook#8 0x00000000062a395b in folly::IOThreadPoolExecutor::~IOThreadPoolExecutor() () facebook#9 0x00000000062a1975 in folly::IOThreadPoolExecutor::~IOThreadPoolExecutor() () facebook#10 0x0000000006452c09 in folly::detail::SingletonHolder<folly::IOThreadPoolExecutor>::destroyInstance() () facebook#11 0x0000000006fa3c0c in folly::SingletonVault::destroyInstances() () facebook#12 0x00007ffff72478b8 in __run_exit_handlers (status=4, listp=0x7ffff7412658 <__exit_funcs>, run_list_atexit=<optimized out>, run_dtors=<optimized out>) at exit.c:113 facebook#13 0x00007ffff72479ca in __GI_exit (status=<optimized out>) at exit.c:143 facebook#14 0x000000001f401e62 in caml_do_exit (retcode=4) at runtime/sys.c:200 facebook#15 0x000000001f4020dc in caml_sys_exit (retcode=<optimized out>) at runtime/sys.c:205 facebook#16 <signal handler called> facebook#17 0x000000001f33c3f8 in camlStdlib.exit_1534 () at stdlib.ml:580 facebook#18 0x000000001f030ec0 in camlExit.exit_24 () at fbcode/hphp/hack/src/utils/exit.ml:66 facebook#19 0x000000001c2bb13b in camlServerMain.exit_if_critical_update_249 () at fbcode/hphp/hack/src/server/serverMain.ml:108 facebook#20 0x000000001c2bb885 in camlServerMain.query_notifier_549 () at fbcode/hphp/hack/src/server/serverMain.ml:225 facebook#21 0x000000001c2bbe32 in camlServerMain.recheck_until_no_changes_left_916 () at fbcode/hphp/hack/src/server/serverMain.ml:331 ``` So we are shutting down the `folly::IOThreadPoolExecutor`, which then gets stuck waiting for something. A comment inside `__pthread_clockjoin_ex` at the point where we get stucks reads says ``` /* The kernel notifies a process which uses CLONE_CHILD_CLEARTID via futex wake-up when the clone terminates. The memory location contains the thread ID while the clone is running and is reset to zero by the kernel afterwards. The kernel up to version 3.16.3 does not use the private futex operations for futex wake-up when the clone terminates. */ ``` Differential Revision: D76737597 fbshipit-source-id: 979b4e9b3ae88a07fcf62f0958fe41372624d00b
1 parent d3f1c8d commit 3fec44a

File tree

1 file changed

+23
-1
lines changed

1 file changed

+23
-1
lines changed

hphp/hack/test/integration/test_edenfs_file_watcher.py

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,12 @@ def assertServerLogContains(driver: CommonTestDriver, needle: str) -> None:
9797
driver.assertTrue(contains)
9898

9999

100+
def assertMonitorLogContains(driver: CommonTestDriver, needle: str) -> None:
101+
monitor_log = driver.get_all_logs(driver.repo_dir).all_monitor_logs
102+
103+
driver.assertTrue(needle in monitor_log)
104+
105+
100106
def assertEdenFsWatcherInitialized(driver: CommonTestDriver) -> None:
101107
"""Checks that an Edenfs_watcher instance was initialized
102108
@@ -111,7 +117,6 @@ def assertServerNotCrashed(driver: CommonTestDriver) -> None:
111117
monitor_log = driver.get_all_logs(driver.repo_dir).all_monitor_logs
112118

113119
driver.assertFalse("Exit_status.Edenfs_watcher_failed" in monitor_log)
114-
driver.assertFalse("writing crash log" in monitor_log)
115120

116121

117122
class EdenfsWatcherTestDriver(common_tests.CommonTestDriver):
@@ -498,6 +503,23 @@ def test_folder_rename(self) -> None:
498503
]
499504
)
500505

506+
def test_hhconfig_change(self) -> None:
507+
self.test_driver.start_hh_server()
508+
509+
hhconfig_file = os.path.join(self.test_driver.repo_dir, ".hhconfig")
510+
with open(hhconfig_file, "a") as f:
511+
f.write("# a comment to change hhconfig")
512+
513+
# This will make us wait until the new server has come up.
514+
# Error message should be the same as before.
515+
self.test_driver.check_cmd(["No errors!"])
516+
517+
# Let's make sure that we did indeed restart due to the hhconfig change
518+
assertMonitorLogContains(
519+
self.test_driver,
520+
"Exit_status.Hhconfig_changed",
521+
)
522+
501523
# Note that we inherit all tests from CommonTests and run them,
502524
# but using EdenfsWatcherTestDriver
503525

0 commit comments

Comments
 (0)