Destroy all Edenfs_watcher instances on server exit

frank-emrich · facebook-github-bot · commit 3fec44a7c031 · 2025-06-17T11:55:43.000-07:00
Summary: This diff addresses a problem affecting the interaction of the OCaml runtime, Tokio, and Folly: - Our own `ocamlrep` crate allows us to register finalizers for the `Custom<>` types owned by OCaml. Usually, these just call `drop` on the Rust side. - When an OCaml program terminates, it does not call the finalizers of the objects that are alive at that point (see https://ocaml.org/docs/garbage-collection#finalisation-and-the-weak-module) - When running Hack with `edenfs_file_watcher_enabled`, the server env contains an (opaque) `Custom<EdenfsWatcherInstanceHandle>`, which owns the Tokio runtime executing a worker thread. Due to the points above, when calling `Exit.exit` in the server, the Tokio runtime and worker thread continue running. - Folly registers a low-level hook that is run when the program actually terminates (after the OCaml runtime has finished its own shutdown logic) - This low-level hook can deadlock when the `EdenfsWatcherInstance` is still running. I'm not sure what exactly is causing the deadlock (the fact that the worker thread is still running or the mere existence of the Tokio runtime, ...) Concretely, this problems manifests itself by the server sometimes getting stuck when calling `Exit.exit`. This diff changes the initialization code for `Edenfs_watcher` instances such that whenever we create an instance, we call `Exit.add_hook_upon_clean_exit` to register a hook that will properly shut down the instance. Note that this hook mechanism is part of our own code, and is run before the lower-level exit handler installed by Folly. ## Alternatives considered: OCaml has a "cleanup_on_exit mode, which among other things should call all finalizers (see https://ocaml.org/manual/5.3/runtime.html#s:ocamlrun-options). However, it seems to be buggy in OCaml 5.x (see ocaml/ocaml#10865 (comment)) and running hh_server with `OCAMLRUNPARAM=c` doesn't fix our problem. I'm not sure if we would want to use it anyway, as it may slow down server restarts. # Facebook Here's a stack trace where we get stuck. I've obtained it by attaching gdb to the server process: ``` #0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=3432193, futex_word=0x7fffb4000910) at futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=<optimized out>, expected=3432193, futex_word=0x7fffb4000910) at futex-internal.c:87 facebook#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=0x7fffb4000910, expected=3432193, clockid=<optimized out>, abstime=0x0, private=<optimized out>) at futex-internal.c:139 facebook#3 0x00007ffff729c793 in __pthread_clockjoin_ex (threadid=140736213288512, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:105 facebook#4 0x00007ffff7cdf84f in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>) at /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/include/x86_64-facebook-linux/bits/gthr-default.h:669 facebook#5 std::thread::join (this=0x7fffbe209110) at ../../../.././libstdc++-v3/src/c++11/thread.cc:112 facebook#6 0x00000000062a810d in folly::ThreadPoolExecutor::joinStoppedThreads(unsigned long) () facebook#7 0x00000000062a89ca in folly::ThreadPoolExecutor::stopAndJoinAllThreads(bool) () facebook#8 0x00000000062a395b in folly::IOThreadPoolExecutor::~IOThreadPoolExecutor() () facebook#9 0x00000000062a1975 in folly::IOThreadPoolExecutor::~IOThreadPoolExecutor() () facebook#10 0x0000000006452c09 in folly::detail::SingletonHolder<folly::IOThreadPoolExecutor>::destroyInstance() () facebook#11 0x0000000006fa3c0c in folly::SingletonVault::destroyInstances() () facebook#12 0x00007ffff72478b8 in __run_exit_handlers (status=4, listp=0x7ffff7412658 <__exit_funcs>, run_list_atexit=<optimized out>, run_dtors=<optimized out>) at exit.c:113 facebook#13 0x00007ffff72479ca in __GI_exit (status=<optimized out>) at exit.c:143 facebook#14 0x000000001f401e62 in caml_do_exit (retcode=4) at runtime/sys.c:200 facebook#15 0x000000001f4020dc in caml_sys_exit (retcode=<optimized out>) at runtime/sys.c:205 facebook#16 <signal handler called> facebook#17 0x000000001f33c3f8 in camlStdlib.exit_1534 () at stdlib.ml:580 facebook#18 0x000000001f030ec0 in camlExit.exit_24 () at fbcode/hphp/hack/src/utils/exit.ml:66 facebook#19 0x000000001c2bb13b in camlServerMain.exit_if_critical_update_249 () at fbcode/hphp/hack/src/server/serverMain.ml:108 facebook#20 0x000000001c2bb885 in camlServerMain.query_notifier_549 () at fbcode/hphp/hack/src/server/serverMain.ml:225 facebook#21 0x000000001c2bbe32 in camlServerMain.recheck_until_no_changes_left_916 () at fbcode/hphp/hack/src/server/serverMain.ml:331 ``` So we are shutting down the `folly::IOThreadPoolExecutor`, which then gets stuck waiting for something. A comment inside `__pthread_clockjoin_ex` at the point where we get stucks reads says ``` /* The kernel notifies a process which uses CLONE_CHILD_CLEARTID via futex wake-up when the clone terminates. The memory location contains the thread ID while the clone is running and is reset to zero by the kernel afterwards. The kernel up to version 3.16.3 does not use the private futex operations for futex wake-up when the clone terminates. */ ``` Differential Revision: D76737597 fbshipit-source-id: 979b4e9b3ae88a07fcf62f0958fe41372624d00b
diff --git a/hphp/hack/test/integration/test_edenfs_file_watcher.py b/hphp/hack/test/integration/test_edenfs_file_watcher.py
@@ -97,6 +97,12 @@ def assertServerLogContains(driver: CommonTestDriver, needle: str) -> None:
     driver.assertTrue(contains)
 
 
+def assertMonitorLogContains(driver: CommonTestDriver, needle: str) -> None:
+    monitor_log = driver.get_all_logs(driver.repo_dir).all_monitor_logs
+
+    driver.assertTrue(needle in monitor_log)
+
+
 def assertEdenFsWatcherInitialized(driver: CommonTestDriver) -> None:
     """Checks that an Edenfs_watcher instance was initialized
 
@@ -111,7 +117,6 @@ def assertServerNotCrashed(driver: CommonTestDriver) -> None:
     monitor_log = driver.get_all_logs(driver.repo_dir).all_monitor_logs
 
     driver.assertFalse("Exit_status.Edenfs_watcher_failed" in monitor_log)
-    driver.assertFalse("writing crash log" in monitor_log)
 
 
 class EdenfsWatcherTestDriver(common_tests.CommonTestDriver):
@@ -498,6 +503,23 @@ def test_folder_rename(self) -> None:
             ]
         )
 
+    def test_hhconfig_change(self) -> None:
+        self.test_driver.start_hh_server()
+
+        hhconfig_file = os.path.join(self.test_driver.repo_dir, ".hhconfig")
+        with open(hhconfig_file, "a") as f:
+            f.write("# a comment to change hhconfig")
+
+        # This will make us wait until the new server has come up.
+        # Error message should be the same as before.
+        self.test_driver.check_cmd(["No errors!"])
+
+        # Let's make sure that we did indeed restart due to the hhconfig change
+        assertMonitorLogContains(
+            self.test_driver,
+            "Exit_status.Hhconfig_changed",
+        )
+
     # Note that we inherit all tests from CommonTests and run them,
     # but using EdenfsWatcherTestDriver