|
| 1 | +--- |
| 2 | +title: The File Descriptor Store |
| 3 | +category: Interfaces |
| 4 | +layout: default |
| 5 | +SPDX-License-Identifier: LGPL-2.1-or-later |
| 6 | +--- |
| 7 | + |
| 8 | +# The File Descriptor Store |
| 9 | + |
| 10 | +*TL;DR: The systemd service manager may optionally maintain a set of file |
| 11 | +descriptors for each service, that are under control of the service and that |
| 12 | +help making service restarts without losing connectivity or context easier to |
| 13 | +implement.* |
| 14 | + |
| 15 | +Since its inception `systemd` has supported the *socket* *activation* |
| 16 | +mechanism: the service manager creates and listens on some sockets (and similar |
| 17 | +UNIX file descriptors) on behalf of a service, and then passes them to the |
| 18 | +service during activation of the service via UNIX file descriptor (short: *fd*) |
| 19 | +passing over `execve()`. This is primarily exposed in the |
| 20 | +[.socket](https://www.freedesktop.org/software/systemd/man/systemd.socket.html) |
| 21 | +unit type. |
| 22 | + |
| 23 | +The *file* *descriptor* *store* (short: *fdstore*) extends this concept, and |
| 24 | +allows services to *upload* during runtime additional fds to the service |
| 25 | +manager that it shall keep on its behalf. File descriptors are passed back to |
| 26 | +the service on subsequent activations, the same way as any socket activation |
| 27 | +fds are passed. |
| 28 | + |
| 29 | +If a service fd is passed to the fdstore logic of the service manager it only |
| 30 | +maintains a duplicate of it (in the sense of UNIX |
| 31 | +[`dup(2)`](https://man7.org/linux/man-pages/man2/dup.2.html)), the fd remains |
| 32 | +also in possession of the service itself, and it may (and is expected to) |
| 33 | +invoke any operations on it that it likes. |
| 34 | + |
| 35 | +The primary usecase of this logic is to permit services to restart seamlessly |
| 36 | +(for example to update them to a newer version), without losing execution |
| 37 | +context, dropping pinned resources, terminating established connections or even |
| 38 | +just momentarily losing connectivity. In fact, as the file descriptors can be |
| 39 | +uploaded freely at any time during the service runtime, this can even be used to |
| 40 | +implement services that robustly handle abnormal termination and can recover |
| 41 | +from that without losing pinned resources. |
| 42 | + |
| 43 | +Note that Linux supports the |
| 44 | +[`memfd`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) concept |
| 45 | +that allows associating a memory-backed fd with arbitrary data. This may |
| 46 | +conveniently be used to serialize service state into and then place in the |
| 47 | +fdstore, in order to implement service restarts with full service state being |
| 48 | +passed over. |
| 49 | + |
| 50 | +# Basic Mechanism |
| 51 | + |
| 52 | +The fdstore is enabled per-service via the |
| 53 | +[`FileDescriptorStoreMax=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStoreMax=) |
| 54 | +service setting. It defaults to zero (which means the fdstore logic is turned |
| 55 | +off), but can take an unsigned integer value that controls how many fds to |
| 56 | +permit the service to upload to the service manager to keep simultaneously. |
| 57 | + |
| 58 | +If set to values > 0, the fdstore is enabled. When invoked the service may now |
| 59 | +(asynchronously) upload file descriptors to the fdstore via the |
| 60 | +[`sd_pid_notify_with_fds()`](https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html) |
| 61 | +API call (or an equivalent reimplementation). When uploading the fds it is |
| 62 | +necessary to set the `FDSTORE=1` field in the message, to indicate what the fd |
| 63 | +is intended for. It's recommended to also set the `FDNAME=…` field to any |
| 64 | +string of choice, which may be used to identify the fd later. |
| 65 | + |
| 66 | +Whenever the service is restarted the fds in its fdstore will be passed to the |
| 67 | +new instance following the same protocol as for socket activation fds. i.e. the |
| 68 | +`$LISTEN_FDS`, `$LISTEN_PIDS`, `$LISTEN_FDNAMES` environment variables will be |
| 69 | +set (the latter will be populated from the `FDNAME=…` field mentioned |
| 70 | +above). See |
| 71 | +[`sd_listen_fds()`](https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html) |
| 72 | +for details on receiving such fds in a service. (Note that the name set in |
| 73 | +`FDNAME=…` does not need to be unique, which is useful when operating with |
| 74 | +multiple fully equivalent sockets or similar, for example for a service that |
| 75 | +both operates on IPv4 and IPv6 and treats both more or less the same.). |
| 76 | + |
| 77 | +And that's already the gist of it. |
| 78 | + |
| 79 | +# Seamless Service Restarts |
| 80 | + |
| 81 | +A system service that provides a client-facing interface that shall be able to |
| 82 | +seamlessly restart can make use of this in a scheme like the following: |
| 83 | +whenever a new connection comes in it uploads its fd immediately into its |
| 84 | +fdstore. At approporate times it also serializes its state into a memfd it |
| 85 | +uploads to the service manager — either whenever the state changed |
| 86 | +sufficiently, or simply right before it terminates. (The latter of course means |
| 87 | +that state only survives on *clean* restarts and abnormal termination implies the |
| 88 | +state is lost completely — while the former would mean there's a good chance the |
| 89 | +next restart after an abnormal termination could continue where it left off |
| 90 | +with only some context lost.) |
| 91 | + |
| 92 | +Using the fdstore for such seamless service restarts is generally recommended |
| 93 | +over implementations that attempt to leave a process from the old service |
| 94 | +instance around until after the new instance already started, so that the old |
| 95 | +then communicates with the new service instance, and passes the fds over |
| 96 | +directly. Typically service restarts are a mechanism for implementing *code* |
| 97 | +updates, hence leaving two version of the service running at the same time is |
| 98 | +generally problematic. It also collides with the systemd service manager's |
| 99 | +general principle of guaranteeing a pristine execution environment, a pristine |
| 100 | +security context, and a pristine resource management context for freshly |
| 101 | +started services, without uncontrolled "left-overs" from previous runs. For |
| 102 | +example: leaving processes from previous runs generally negatively affects |
| 103 | +lifecycle management (i.e. `KillMode=none` must be set), which disables large |
| 104 | +parts of the service managers state tracking, resource management (as resource |
| 105 | +counters cannot start at zero during service activation anymore, since the old |
| 106 | +processes remaining skew them), security policies (as processes with possibly |
| 107 | +out-of-date security policies – selinux, AppArmor, any LSM, seccomp, BPF — in |
| 108 | +effect remain), and similar. |
| 109 | + |
| 110 | +# File Descriptor Store Lifecycle |
| 111 | + |
| 112 | +By default any file descriptor stored in the fdstore for which a `POLLHUP` or |
| 113 | +`POLLERR` is seen is automatically closed and removed from the fdstore. This |
| 114 | +behaviour can be turned off, by setting the `FDPOLL=0` field when uploading the |
| 115 | +fd via `sd_notify_with_fds()`. |
| 116 | + |
| 117 | +The fdstore is automatically closed whenever the service is fully deactivated |
| 118 | +and no jobs are queued for it anymore. This means that a restart job for a |
| 119 | +service will leave the fdstore intact, but a separate stop and start job for |
| 120 | +it — executed synchronously one after the other — will likely not. |
| 121 | + |
| 122 | +This behaviour can be modified via the |
| 123 | +[`FileDescriptorStorePreserve=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStorePreserve=) |
| 124 | +setting in service unit files. If set to `yes` the fdstore will be kept as long |
| 125 | +as the service definition is loaded into memory by the service manager, i.e. as |
| 126 | +long as at least one other loaded unit has a reference to it. |
| 127 | + |
| 128 | +The `systemctl clean --what=fdstore …` command may be used to explicitly clear |
| 129 | +the fdstore of a service. This is only allowed when the service is fully |
| 130 | +deactivated, and is hence primarily useful in case |
| 131 | +`FileDescriptorStorePreserve=yes` is set (because the fdstore is otherwise |
| 132 | +fully closed anyway in this state). |
| 133 | + |
| 134 | +Individual file descriptors may be removed from the fdstore via the |
| 135 | +`sd_notify()` mechanism, by sending an `FDSTOREREMOVE=1` message, accompanied |
| 136 | +by an `FDNAME=…` string identifying the fds to remove. (The name does not have |
| 137 | +to be unique, as mentioned, in which case *all* matching fds are |
| 138 | +closed). Generally it's a good idea to send such messages to the service |
| 139 | +manager during initialization of the service whenever an unrecognized fd is |
| 140 | +received, to make the service robust for code updates: if an old version |
| 141 | +uploaded an fd that the new version doesn't recognize anymore it's good idea to |
| 142 | +close it both in the service and in the fdstore. |
| 143 | + |
| 144 | +Note that storing a duplicate of an fd in the fdstore means the fd remains |
| 145 | +pinned even if the service closes it. This in particular means that peers on a |
| 146 | +connection socket uploaded this way will not receive an automatic `POLLHUP` |
| 147 | +event anymore if the service code issues `close()` on the socket. It must |
| 148 | +accompany it with an `FDSTOREREMOVE=1` notification to the service manager, so |
| 149 | +that the fd is comprehensively closed. |
| 150 | + |
| 151 | +# Access Control |
| 152 | + |
| 153 | +Access to the fds in the file descriptor store is generally restricted to the |
| 154 | +service code itself. Pushing fds into or removing fds from the fdstore is |
| 155 | +subject to the access control restrictions of any other `sd_notify()` message, |
| 156 | +which is controlled via |
| 157 | +[`NotifyAccess=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#NotifyAccess=). |
| 158 | + |
| 159 | +By default only the main service process hence can push/remove fds, but by |
| 160 | +setting `NotifyAccess=cgroup` this may be relaxed to allow arbitrary service |
| 161 | +child processes to do the same. |
| 162 | + |
| 163 | +# Soft Reboot |
| 164 | + |
| 165 | +The fdstore is particularly interesting in [soft |
| 166 | +reboot](https://www.freedesktop.org/software/systemd/man/systemd-soft-reboot.service.html) |
| 167 | +scenarios, as per `systemctl soft-reboot` (which restarts userspace like in a |
| 168 | +real reboot, but leaves the kernel running). File descriptor stores that remain |
| 169 | +loaded at the very end of the system cycle — just before the soft-reboot – are |
| 170 | +passed over to the next system cycle, and propagated to services they originate |
| 171 | +from there. This enables updating the full userspace of a system during |
| 172 | +runtime, fully replacing all processes without losing pinning resources, |
| 173 | +interrupting connectivity or established connections and similar. |
| 174 | + |
| 175 | +This mechanism can be enabled either by making sure the service survives until |
| 176 | +the very end (i.e. by setting `DefaultDependencies=no` so that it keeps running |
| 177 | +for the whole system lifetime without being regularly deactivated at shutdown) |
| 178 | +or by setting `FileDescriptorStorePresever=yes` (and referencing the unit |
| 179 | +continously). |
| 180 | + |
| 181 | +# Debugging |
| 182 | + |
| 183 | +The |
| 184 | +[`systemd-analyze`](https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20fdstore%20%5BUNIT...%5D) |
| 185 | +tool may be used to list the current contents of the fdstore of any running |
| 186 | +service. |
| 187 | + |
| 188 | +The |
| 189 | +[`systemd-run`](https://www.freedesktop.org/software/systemd/man/systemd-run.html) |
| 190 | +tool may be used to quickly start a testing binary or similar as a service. Use |
| 191 | +`-p FileDescriptorStore=4711` to enable the fdstore from `systemd-run`'s |
| 192 | +command line. By using the `-t` switch you can even interactively communicate |
| 193 | +via processes spawned that way, via the TTY. |
0 commit comments