|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +=============== |
| 4 | +fwctl subsystem |
| 5 | +=============== |
| 6 | + |
| 7 | +:Author: Jason Gunthorpe |
| 8 | + |
| 9 | +Overview |
| 10 | +======== |
| 11 | + |
| 12 | +Modern devices contain extensive amounts of FW, and in many cases, are largely |
| 13 | +software-defined pieces of hardware. The evolution of this approach is largely a |
| 14 | +reaction to Moore's Law where a chip tape out is now highly expensive, and the |
| 15 | +chip design is extremely large. Replacing fixed HW logic with a flexible and |
| 16 | +tightly coupled FW/HW combination is an effective risk mitigation against chip |
| 17 | +respin. Problems in the HW design can be counteracted in device FW. This is |
| 18 | +especially true for devices which present a stable and backwards compatible |
| 19 | +interface to the operating system driver (such as NVMe). |
| 20 | + |
| 21 | +The FW layer in devices has grown to incredible size and devices frequently |
| 22 | +integrate clusters of fast processors to run it. For example, mlx5 devices have |
| 23 | +over 30MB of FW code, and big configurations operate with over 1GB of FW managed |
| 24 | +runtime state. |
| 25 | + |
| 26 | +The availability of such a flexible layer has created quite a variety in the |
| 27 | +industry where single pieces of silicon are now configurable software-defined |
| 28 | +devices and can operate in substantially different ways depending on the need. |
| 29 | +Further, we often see cases where specific sites wish to operate devices in ways |
| 30 | +that are highly specialized and require applications that have been tailored to |
| 31 | +their unique configuration. |
| 32 | + |
| 33 | +Further, devices have become multi-functional and integrated to the point they |
| 34 | +no longer fit neatly into the kernel's division of subsystems. Modern |
| 35 | +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many |
| 36 | +subsystems while sharing the underlying hardware using the auxiliary device |
| 37 | +system. |
| 38 | + |
| 39 | +All together this creates a challenge for the operating system, where devices |
| 40 | +have an expansive FW environment that needs robust device-specific debugging |
| 41 | +support, and FW-driven functionality that is not well suited to “generic” |
| 42 | +interfaces. fwctl seeks to allow access to the full device functionality from |
| 43 | +user space in the areas of debuggability, management, and first-boot/nth-boot |
| 44 | +provisioning. |
| 45 | + |
| 46 | +fwctl is aimed at the common device design pattern where the OS and FW |
| 47 | +communicate via an RPC message layer constructed with a queue or mailbox scheme. |
| 48 | +In this case the driver will typically have some layer to deliver RPC messages |
| 49 | +and collect RPC responses from device FW. The in-kernel subsystem drivers that |
| 50 | +operate the device for its primary purposes will use these RPCs to build their |
| 51 | +drivers, but devices also usually have a set of ancillary RPCs that don't really |
| 52 | +fit into any specific subsystem. For example, a HW RAID controller is primarily |
| 53 | +operated by the block layer but also comes with a set of RPCs to administer the |
| 54 | +construction of drives within the HW RAID. |
| 55 | + |
| 56 | +In the past when devices were more single function, individual subsystems would |
| 57 | +grow different approaches to solving some of these common problems. For instance |
| 58 | +monitoring device health, manipulating its FLASH, debugging the FW, |
| 59 | +provisioning, all have various unique interfaces across the kernel. |
| 60 | + |
| 61 | +fwctl's purpose is to define a common set of limited rules, described below, |
| 62 | +that allow user space to securely construct and execute RPCs inside device FW. |
| 63 | +The rules serve as an agreement between the operating system and FW on how to |
| 64 | +correctly design the RPC interface. As a uAPI the subsystem provides a thin |
| 65 | +layer of discovery and a generic uAPI to deliver the RPCs and collect the |
| 66 | +response. It supports a system of user space libraries and tools which will |
| 67 | +use this interface to control the device using the device native protocols. |
| 68 | + |
| 69 | +Scope of Action |
| 70 | +--------------- |
| 71 | + |
| 72 | +fwctl drivers are strictly restricted to being a way to operate the device FW. |
| 73 | +It is not an avenue to access random kernel internals, or other operating system |
| 74 | +SW states. |
| 75 | + |
| 76 | +fwctl instances must operate on a well-defined device function, and the device |
| 77 | +should have a well-defined security model for what scope within the physical |
| 78 | +device the function is permitted to access. For instance, the most complex PCIe |
| 79 | +device today may broadly have several function-level scopes: |
| 80 | + |
| 81 | + 1. A privileged function with full access to the on-device global state and |
| 82 | + configuration |
| 83 | + |
| 84 | + 2. Multiple hypervisor functions with control over itself and child functions |
| 85 | + used with VMs |
| 86 | + |
| 87 | + 3. Multiple VM functions tightly scoped within the VM |
| 88 | + |
| 89 | +The device may create a logical parent/child relationship between these scopes. |
| 90 | +For instance a child VM's FW may be within the scope of the hypervisor FW. It is |
| 91 | +quite common in the VFIO world that the hypervisor environment has a complex |
| 92 | +provisioning/profiling/configuration responsibility for the function VFIO |
| 93 | +assigns to the VM. |
| 94 | + |
| 95 | +Further, within the function, devices often have RPC commands that fall within |
| 96 | +some general scopes of action (see enum fwctl_rpc_scope): |
| 97 | + |
| 98 | + 1. Access to function & child configuration, FLASH, etc. that becomes live at a |
| 99 | + function reset. Access to function & child runtime configuration that is |
| 100 | + transparent or non-disruptive to any driver or VM. |
| 101 | + |
| 102 | + 2. Read-only access to function debug information that may report on FW objects |
| 103 | + in the function & child, including FW objects owned by other kernel |
| 104 | + subsystems. |
| 105 | + |
| 106 | + 3. Write access to function & child debug information strictly compatible with |
| 107 | + the principles of kernel lockdown and kernel integrity protection. Triggers |
| 108 | + a kernel Taint. |
| 109 | + |
| 110 | + 4. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO. |
| 111 | + |
| 112 | +User space will provide a scope label on each RPC and the kernel must enforce the |
| 113 | +above CAPs and taints based on that scope. A combination of kernel and FW can |
| 114 | +enforce that RPCs are placed in the correct scope by user space. |
| 115 | + |
| 116 | +Denied behavior |
| 117 | +--------------- |
| 118 | + |
| 119 | +There are many things this interface must not allow user space to do (without a |
| 120 | +Taint or CAP), broadly derived from the principles of kernel lockdown. Some |
| 121 | +examples: |
| 122 | + |
| 123 | + 1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with |
| 124 | + untrusted code, or otherwise compromise device or system security and |
| 125 | + integrity. |
| 126 | + |
| 127 | + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel |
| 128 | + objects owned by kernel drivers. |
| 129 | + |
| 130 | + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel |
| 131 | + driver can react to the device configuration at function reset/driver load |
| 132 | + time, but otherwise must not be coupled to fwctl. |
| 133 | + |
| 134 | + 4. Operate the HW in a way that overlaps with the core purpose of another |
| 135 | + primary kernel subsystem, such as read/write to LBAs, send/receive of |
| 136 | + network packets, or operate an accelerator's data plane. |
| 137 | + |
| 138 | +fwctl is not a replacement for device direct access subsystems like uacce or |
| 139 | +VFIO. |
| 140 | + |
| 141 | +Operations exposed through fwctl's non-taining interfaces should be fully |
| 142 | +sharable with other users of the device. For instance exposing a RPC through |
| 143 | +fwctl should never prevent a kernel subsystem from also concurrently using that |
| 144 | +same RPC or hardware unit down the road. In such cases fwctl will be less |
| 145 | +important than proper kernel subsystems that eventually emerge. Mistakes in this |
| 146 | +area resulting in clashes will be resolved in favour of a kernel implementation. |
| 147 | + |
| 148 | +fwctl User API |
| 149 | +============== |
| 150 | + |
| 151 | +.. kernel-doc:: include/uapi/fwctl/fwctl.h |
| 152 | + |
| 153 | +sysfs Class |
| 154 | +----------- |
| 155 | + |
| 156 | +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices |
| 157 | +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device |
| 158 | +operates the iotcl uAPI described above. |
| 159 | + |
| 160 | +fwctl devices can be related to driver components in other subsystems through |
| 161 | +sysfs:: |
| 162 | + |
| 163 | + $ ls /sys/class/fwctl/fwctl0/device/infiniband/ |
| 164 | + ibp0s10f0 |
| 165 | + |
| 166 | + $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/ |
| 167 | + fwctl0/ |
| 168 | + |
| 169 | + $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0 |
| 170 | + dev device power subsystem uevent |
| 171 | + |
| 172 | +User space Community |
| 173 | +-------------------- |
| 174 | + |
| 175 | +Drawing inspiration from nvme-cli, participating in the kernel side must come |
| 176 | +with a user space in a common TBD git tree, at a minimum to usefully operate the |
| 177 | +kernel driver. Providing such an implementation is a pre-condition to merging a |
| 178 | +kernel driver. |
| 179 | + |
| 180 | +The goal is to build user space community around some of the shared problems |
| 181 | +we all have, and ideally develop some common user space programs with some |
| 182 | +starting themes of: |
| 183 | + |
| 184 | + - Device in-field debugging |
| 185 | + |
| 186 | + - HW provisioning |
| 187 | + |
| 188 | + - VFIO child device profiling before VM boot |
| 189 | + |
| 190 | + - Confidential Compute topics (attestation, secure provisioning) |
| 191 | + |
| 192 | +that stretch across all subsystems in the kernel. fwupd is a great example of |
| 193 | +how an excellent user space experience can emerge out of kernel-side diversity. |
| 194 | + |
| 195 | +fwctl Kernel API |
| 196 | +================ |
| 197 | + |
| 198 | +.. kernel-doc:: drivers/fwctl/main.c |
| 199 | + :export: |
| 200 | +.. kernel-doc:: include/linux/fwctl.h |
| 201 | + |
| 202 | +fwctl Driver design |
| 203 | +------------------- |
| 204 | + |
| 205 | +In many cases a fwctl driver is going to be part of a larger cross-subsystem |
| 206 | +device possibly using the auxiliary_device mechanism. In that case several |
| 207 | +subsystems are going to be sharing the same device and FW interface layer so the |
| 208 | +device design must already provide for isolation and cooperation between kernel |
| 209 | +subsystems. fwctl should fit into that same model. |
| 210 | + |
| 211 | +Part of the driver should include a description of how its scope restrictions |
| 212 | +and security model work. The driver and FW together must ensure that RPCs |
| 213 | +provided by user space are mapped to the appropriate scope. If the validation is |
| 214 | +done in the driver then the validation can read a 'command effects' report from |
| 215 | +the device, or hardwire the enforcement. If the validation is done in the FW, |
| 216 | +then the driver should pass the fwctl_rpc_scope to the FW along with the command. |
| 217 | + |
| 218 | +The driver and FW must cooperate to ensure that either fwctl cannot allocate |
| 219 | +any FW resources, or any resources it does allocate are freed on FD closure. A |
| 220 | +driver primarily constructed around FW RPCs may find that its core PCI function |
| 221 | +and RPC layer belongs under fwctl with auxiliary devices connecting to other |
| 222 | +subsystems. |
| 223 | + |
| 224 | +Each device type must be mindful of Linux's philosophy for stable ABI. The FW |
| 225 | +RPC interface does not have to meet a strictly stable ABI, but it does need to |
| 226 | +meet an expectation that userspace tools that are deployed and in significant |
| 227 | +use don't needlessly break. FW upgrade and kernel upgrade should keep widely |
| 228 | +deployed tooling working. |
| 229 | + |
| 230 | +Development and debugging focused RPCs under more permissive scopes can have |
| 231 | +less stabilitiy if the tools using them are only run under exceptional |
| 232 | +circumstances and not for every day use of the device. Debugging tools may even |
| 233 | +require exact version matching as they may require something similar to DWARF |
| 234 | +debug information from the FW binary. |
| 235 | + |
| 236 | +Security Response |
| 237 | +================= |
| 238 | + |
| 239 | +The kernel remains the gatekeeper for this interface. If violations of the |
| 240 | +scopes, security or isolation principles are found, we have options to let |
| 241 | +devices fix them with a FW update, push a kernel patch to parse and block RPC |
| 242 | +commands or push a kernel patch to block entire firmware versions/devices. |
| 243 | + |
| 244 | +While the kernel can always directly parse and restrict RPCs, it is expected |
| 245 | +that the existing kernel pattern of allowing drivers to delegate validation to |
| 246 | +FW to be a useful design. |
| 247 | + |
| 248 | +Existing Similar Examples |
| 249 | +========================= |
| 250 | + |
| 251 | +The approach described in this document is not a new idea. Direct, or near |
| 252 | +direct device access has been offered by the kernel in different areas for |
| 253 | +decades. With more devices wanting to follow this design pattern it is becoming |
| 254 | +clear that it is not entirely well understood and, more importantly, the |
| 255 | +security considerations are not well defined or agreed upon. |
| 256 | + |
| 257 | +Some examples: |
| 258 | + |
| 259 | + - HW RAID controllers. This includes RPCs to do things like compose drives into |
| 260 | + a RAID volume, configure RAID parameters, monitor the HW and more. |
| 261 | + |
| 262 | + - Baseboard managers. RPCs for configuring settings in the device and more |
| 263 | + |
| 264 | + - NVMe vendor command capsules. nvme-cli provides access to some monitoring |
| 265 | + functions that different products have defined, but more exist. |
| 266 | + |
| 267 | + - CXL also has a NVMe-like vendor command system. |
| 268 | + |
| 269 | + - DRM allows user space drivers to send commands to the device via kernel |
| 270 | + mediation |
| 271 | + |
| 272 | + - RDMA allows user space drivers to directly push commands to the device |
| 273 | + without kernel involvement |
| 274 | + |
| 275 | + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc. |
| 276 | + |
| 277 | +The first 4 are examples of areas that fwctl intends to cover. The latter three |
| 278 | +are examples of denied behavior as they fully overlap with the primary purpose |
| 279 | +of a kernel subsystem. |
| 280 | + |
| 281 | +Some key lessons learned from these past efforts are the importance of having a |
| 282 | +common user space project to use as a pre-condition for obtaining a kernel |
| 283 | +driver. Developing good community around useful software in user space is key to |
| 284 | +getting companies to fund participation to enable their products. |
0 commit comments