|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +=============== |
| 4 | +DMA and swiotlb |
| 5 | +=============== |
| 6 | + |
| 7 | +swiotlb is a memory buffer allocator used by the Linux kernel DMA layer. It is |
| 8 | +typically used when a device doing DMA can't directly access the target memory |
| 9 | +buffer because of hardware limitations or other requirements. In such a case, |
| 10 | +the DMA layer calls swiotlb to allocate a temporary memory buffer that conforms |
| 11 | +to the limitations. The DMA is done to/from this temporary memory buffer, and |
| 12 | +the CPU copies the data between the temporary buffer and the original target |
| 13 | +memory buffer. This approach is generically called "bounce buffering", and the |
| 14 | +temporary memory buffer is called a "bounce buffer". |
| 15 | + |
| 16 | +Device drivers don't interact directly with swiotlb. Instead, drivers inform |
| 17 | +the DMA layer of the DMA attributes of the devices they are managing, and use |
| 18 | +the normal DMA map, unmap, and sync APIs when programming a device to do DMA. |
| 19 | +These APIs use the device DMA attributes and kernel-wide settings to determine |
| 20 | +if bounce buffering is necessary. If so, the DMA layer manages the allocation, |
| 21 | +freeing, and sync'ing of bounce buffers. Since the DMA attributes are per |
| 22 | +device, some devices in a system may use bounce buffering while others do not. |
| 23 | + |
| 24 | +Because the CPU copies data between the bounce buffer and the original target |
| 25 | +memory buffer, doing bounce buffering is slower than doing DMA directly to the |
| 26 | +original memory buffer, and it consumes more CPU resources. So it is used only |
| 27 | +when necessary for providing DMA functionality. |
| 28 | + |
| 29 | +Usage Scenarios |
| 30 | +--------------- |
| 31 | +swiotlb was originally created to handle DMA for devices with addressing |
| 32 | +limitations. As physical memory sizes grew beyond 4 GiB, some devices could |
| 33 | +only provide 32-bit DMA addresses. By allocating bounce buffer memory below |
| 34 | +the 4 GiB line, these devices with addressing limitations could still work and |
| 35 | +do DMA. |
| 36 | + |
| 37 | +More recently, Confidential Computing (CoCo) VMs have the guest VM's memory |
| 38 | +encrypted by default, and the memory is not accessible by the host hypervisor |
| 39 | +and VMM. For the host to do I/O on behalf of the guest, the I/O must be |
| 40 | +directed to guest memory that is unencrypted. CoCo VMs set a kernel-wide option |
| 41 | +to force all DMA I/O to use bounce buffers, and the bounce buffer memory is set |
| 42 | +up as unencrypted. The host does DMA I/O to/from the bounce buffer memory, and |
| 43 | +the Linux kernel DMA layer does "sync" operations to cause the CPU to copy the |
| 44 | +data to/from the original target memory buffer. The CPU copying bridges between |
| 45 | +the unencrypted and the encrypted memory. This use of bounce buffers allows |
| 46 | +device drivers to "just work" in a CoCo VM, with no modifications |
| 47 | +needed to handle the memory encryption complexity. |
| 48 | + |
| 49 | +Other edge case scenarios arise for bounce buffers. For example, when IOMMU |
| 50 | +mappings are set up for a DMA operation to/from a device that is considered |
| 51 | +"untrusted", the device should be given access only to the memory containing |
| 52 | +the data being transferred. But if that memory occupies only part of an IOMMU |
| 53 | +granule, other parts of the granule may contain unrelated kernel data. Since |
| 54 | +IOMMU access control is per-granule, the untrusted device can gain access to |
| 55 | +the unrelated kernel data. This problem is solved by bounce buffering the DMA |
| 56 | +operation and ensuring that unused portions of the bounce buffers do not |
| 57 | +contain any unrelated kernel data. |
| 58 | + |
| 59 | +Core Functionality |
| 60 | +------------------ |
| 61 | +The primary swiotlb APIs are swiotlb_tbl_map_single() and |
| 62 | +swiotlb_tbl_unmap_single(). The "map" API allocates a bounce buffer of a |
| 63 | +specified size in bytes and returns the physical address of the buffer. The |
| 64 | +buffer memory is physically contiguous. The expectation is that the DMA layer |
| 65 | +maps the physical memory address to a DMA address, and returns the DMA address |
| 66 | +to the driver for programming into the device. If a DMA operation specifies |
| 67 | +multiple memory buffer segments, a separate bounce buffer must be allocated for |
| 68 | +each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a |
| 69 | +CPU copy) to initialize the bounce buffer to match the contents of the original |
| 70 | +buffer. |
| 71 | + |
| 72 | +swiotlb_tbl_unmap_single() does the reverse. If the DMA operation might have |
| 73 | +updated the bounce buffer memory and DMA_ATTR_SKIP_CPU_SYNC is not set, the |
| 74 | +unmap does a "sync" operation to cause a CPU copy of the data from the bounce |
| 75 | +buffer back to the original buffer. Then the bounce buffer memory is freed. |
| 76 | + |
| 77 | +swiotlb also provides "sync" APIs that correspond to the dma_sync_*() APIs that |
| 78 | +a driver may use when control of a buffer transitions between the CPU and the |
| 79 | +device. The swiotlb "sync" APIs cause a CPU copy of the data between the |
| 80 | +original buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb |
| 81 | +"sync" APIs support doing a partial sync, where only a subset of the bounce |
| 82 | +buffer is copied to/from the original buffer. |
| 83 | + |
| 84 | +Core Functionality Constraints |
| 85 | +------------------------------ |
| 86 | +The swiotlb map/unmap/sync APIs must operate without blocking, as they are |
| 87 | +called by the corresponding DMA APIs which may run in contexts that cannot |
| 88 | +block. Hence the default memory pool for swiotlb allocations must be |
| 89 | +pre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlb |
| 90 | +allocations must be physically contiguous, the entire default memory pool is |
| 91 | +allocated as a single contiguous block. |
| 92 | + |
| 93 | +The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff. |
| 94 | +The pool should be large enough to ensure that bounce buffer requests can |
| 95 | +always be satisfied, as the non-blocking requirement means requests can't wait |
| 96 | +for space to become available. But a large pool potentially wastes memory, as |
| 97 | +this pre-allocated memory is not available for other uses in the system. The |
| 98 | +tradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMA |
| 99 | +I/O. These VMs use a heuristic to set the default pool size to ~6% of memory, |
| 100 | +with a max of 1 GiB, which has the potential to be very wasteful of memory. |
| 101 | +Conversely, the heuristic might produce a size that is insufficient, depending |
| 102 | +on the I/O patterns of the workload in the VM. The dynamic swiotlb feature |
| 103 | +described below can help, but has limitations. Better management of the swiotlb |
| 104 | +default memory pool size remains an open issue. |
| 105 | + |
| 106 | +A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZE |
| 107 | +bytes, which is 256 KiB with current definitions. When a device's DMA settings |
| 108 | +are such that the device might use swiotlb, the maximum size of a DMA segment |
| 109 | +must be limited to that 256 KiB. This value is communicated to higher-level |
| 110 | +kernel code via dma_map_mapping_size() and swiotlb_max_mapping_size(). If the |
| 111 | +higher-level code fails to account for this limit, it may make requests that |
| 112 | +are too large for swiotlb, and get a "swiotlb full" error. |
| 113 | + |
| 114 | +A key device DMA setting is "min_align_mask", which is a power of 2 minus 1 |
| 115 | +so that some number of low order bits are set, or it may be zero. swiotlb |
| 116 | +allocations ensure these min_align_mask bits of the physical address of the |
| 117 | +bounce buffer match the same bits in the address of the original buffer. When |
| 118 | +min_align_mask is non-zero, it may produce an "alignment offset" in the address |
| 119 | +of the bounce buffer that slightly reduces the maximum size of an allocation. |
| 120 | +This potential alignment offset is reflected in the value returned by |
| 121 | +swiotlb_max_mapping_size(), which can show up in places like |
| 122 | +/sys/block/<device>/queue/max_sectors_kb. For example, if a device does not use |
| 123 | +swiotlb, max_sectors_kb might be 512 KiB or larger. If a device might use |
| 124 | +swiotlb, max_sectors_kb will be 256 KiB. When min_align_mask is non-zero, |
| 125 | +max_sectors_kb might be even smaller, such as 252 KiB. |
| 126 | + |
| 127 | +swiotlb_tbl_map_single() also takes an "alloc_align_mask" parameter. This |
| 128 | +parameter specifies the allocation of bounce buffer space must start at a |
| 129 | +physical address with the alloc_align_mask bits set to zero. But the actual |
| 130 | +bounce buffer might start at a larger address if min_align_mask is non-zero. |
| 131 | +Hence there may be pre-padding space that is allocated prior to the start of |
| 132 | +the bounce buffer. Similarly, the end of the bounce buffer is rounded up to an |
| 133 | +alloc_align_mask boundary, potentially resulting in post-padding space. Any |
| 134 | +pre-padding or post-padding space is not initialized by swiotlb code. The |
| 135 | +"alloc_align_mask" parameter is used by IOMMU code when mapping for untrusted |
| 136 | +devices. It is set to the granule size - 1 so that the bounce buffer is |
| 137 | +allocated entirely from granules that are not used for any other purpose. |
| 138 | + |
| 139 | +Data structures concepts |
| 140 | +------------------------ |
| 141 | +Memory used for swiotlb bounce buffers is allocated from overall system memory |
| 142 | +as one or more "pools". The default pool is allocated during system boot with a |
| 143 | +default size of 64 MiB. The default pool size may be modified with the |
| 144 | +"swiotlb=" kernel boot line parameter. The default size may also be adjusted |
| 145 | +due to other conditions, such as running in a CoCo VM, as described above. If |
| 146 | +CONFIG_SWIOTLB_DYNAMIC is enabled, additional pools may be allocated later in |
| 147 | +the life of the system. Each pool must be a contiguous range of physical |
| 148 | +memory. The default pool is allocated below the 4 GiB physical address line so |
| 149 | +it works for devices that can only address 32-bits of physical memory (unless |
| 150 | +architecture-specific code provides the SWIOTLB_ANY flag). In a CoCo VM, the |
| 151 | +pool memory must be decrypted before swiotlb is used. |
| 152 | + |
| 153 | +Each pool is divided into "slots" of size IO_TLB_SIZE, which is 2 KiB with |
| 154 | +current definitions. IO_TLB_SEGSIZE contiguous slots (128 slots) constitute |
| 155 | +what might be called a "slot set". When a bounce buffer is allocated, it |
| 156 | +occupies one or more contiguous slots. A slot is never shared by multiple |
| 157 | +bounce buffers. Furthermore, a bounce buffer must be allocated from a single |
| 158 | +slot set, which leads to the maximum bounce buffer size being IO_TLB_SIZE * |
| 159 | +IO_TLB_SEGSIZE. Multiple smaller bounce buffers may co-exist in a single slot |
| 160 | +set if the alignment and size constraints can be met. |
| 161 | + |
| 162 | +Slots are also grouped into "areas", with the constraint that a slot set exists |
| 163 | +entirely in a single area. Each area has its own spin lock that must be held to |
| 164 | +manipulate the slots in that area. The division into areas avoids contending |
| 165 | +for a single global spin lock when swiotlb is heavily used, such as in a CoCo |
| 166 | +VM. The number of areas defaults to the number of CPUs in the system for |
| 167 | +maximum parallelism, but since an area can't be smaller than IO_TLB_SEGSIZE |
| 168 | +slots, it might be necessary to assign multiple CPUs to the same area. The |
| 169 | +number of areas can also be set via the "swiotlb=" kernel boot parameter. |
| 170 | + |
| 171 | +When allocating a bounce buffer, if the area associated with the calling CPU |
| 172 | +does not have enough free space, areas associated with other CPUs are tried |
| 173 | +sequentially. For each area tried, the area's spin lock must be obtained before |
| 174 | +trying an allocation, so contention may occur if swiotlb is relatively busy |
| 175 | +overall. But an allocation request does not fail unless all areas do not have |
| 176 | +enough free space. |
| 177 | + |
| 178 | +IO_TLB_SIZE, IO_TLB_SEGSIZE, and the number of areas must all be powers of 2 as |
| 179 | +the code uses shifting and bit masking to do many of the calculations. The |
| 180 | +number of areas is rounded up to a power of 2 if necessary to meet this |
| 181 | +requirement. |
| 182 | + |
| 183 | +The default pool is allocated with PAGE_SIZE alignment. If an alloc_align_mask |
| 184 | +argument to swiotlb_tbl_map_single() specifies a larger alignment, one or more |
| 185 | +initial slots in each slot set might not meet the alloc_align_mask criterium. |
| 186 | +Because a bounce buffer allocation can't cross a slot set boundary, eliminating |
| 187 | +those initial slots effectively reduces the max size of a bounce buffer. |
| 188 | +Currently, there's no problem because alloc_align_mask is set based on IOMMU |
| 189 | +granule size, and granules cannot be larger than PAGE_SIZE. But if that were to |
| 190 | +change in the future, the initial pool allocation might need to be done with |
| 191 | +alignment larger than PAGE_SIZE. |
| 192 | + |
| 193 | +Dynamic swiotlb |
| 194 | +--------------- |
| 195 | +When CONFIG_DYNAMIC_SWIOTLB is enabled, swiotlb can do on-demand expansion of |
| 196 | +the amount of memory available for allocation as bounce buffers. If a bounce |
| 197 | +buffer request fails due to lack of available space, an asynchronous background |
| 198 | +task is kicked off to allocate memory from general system memory and turn it |
| 199 | +into an swiotlb pool. Creating an additional pool must be done asynchronously |
| 200 | +because the memory allocation may block, and as noted above, swiotlb requests |
| 201 | +are not allowed to block. Once the background task is kicked off, the bounce |
| 202 | +buffer request creates a "transient pool" to avoid returning an "swiotlb full" |
| 203 | +error. A transient pool has the size of the bounce buffer request, and is |
| 204 | +deleted when the bounce buffer is freed. Memory for this transient pool comes |
| 205 | +from the general system memory atomic pool so that creation does not block. |
| 206 | +Creating a transient pool has relatively high cost, particularly in a CoCo VM |
| 207 | +where the memory must be decrypted, so it is done only as a stopgap until the |
| 208 | +background task can add another non-transient pool. |
| 209 | + |
| 210 | +Adding a dynamic pool has limitations. Like with the default pool, the memory |
| 211 | +must be physically contiguous, so the size is limited to MAX_PAGE_ORDER pages |
| 212 | +(e.g., 4 MiB on a typical x86 system). Due to memory fragmentation, a max size |
| 213 | +allocation may not be available. The dynamic pool allocator tries smaller sizes |
| 214 | +until it succeeds, but with a minimum size of 1 MiB. Given sufficient system |
| 215 | +memory fragmentation, dynamically adding a pool might not succeed at all. |
| 216 | + |
| 217 | +The number of areas in a dynamic pool may be different from the number of areas |
| 218 | +in the default pool. Because the new pool size is typically a few MiB at most, |
| 219 | +the number of areas will likely be smaller. For example, with a new pool size |
| 220 | +of 4 MiB and the 256 KiB minimum area size, only 16 areas can be created. If |
| 221 | +the system has more than 16 CPUs, multiple CPUs must share an area, creating |
| 222 | +more lock contention. |
| 223 | + |
| 224 | +New pools added via dynamic swiotlb are linked together in a linear list. |
| 225 | +swiotlb code frequently must search for the pool containing a particular |
| 226 | +swiotlb physical address, so that search is linear and not performant with a |
| 227 | +large number of dynamic pools. The data structures could be improved for |
| 228 | +faster searches. |
| 229 | + |
| 230 | +Overall, dynamic swiotlb works best for small configurations with relatively |
| 231 | +few CPUs. It allows the default swiotlb pool to be smaller so that memory is |
| 232 | +not wasted, with dynamic pools making more space available if needed (as long |
| 233 | +as fragmentation isn't an obstacle). It is less useful for large CoCo VMs. |
| 234 | + |
| 235 | +Data Structure Details |
| 236 | +---------------------- |
| 237 | +swiotlb is managed with four primary data structures: io_tlb_mem, io_tlb_pool, |
| 238 | +io_tlb_area, and io_tlb_slot. io_tlb_mem describes a swiotlb memory allocator, |
| 239 | +which includes the default memory pool and any dynamic or transient pools |
| 240 | +linked to it. Limited statistics on swiotlb usage are kept per memory allocator |
| 241 | +and are stored in this data structure. These statistics are available under |
| 242 | +/sys/kernel/debug/swiotlb when CONFIG_DEBUG_FS is set. |
| 243 | + |
| 244 | +io_tlb_pool describes a memory pool, either the default pool, a dynamic pool, |
| 245 | +or a transient pool. The description includes the start and end addresses of |
| 246 | +the memory in the pool, a pointer to an array of io_tlb_area structures, and a |
| 247 | +pointer to an array of io_tlb_slot structures that are associated with the pool. |
| 248 | + |
| 249 | +io_tlb_area describes an area. The primary field is the spin lock used to |
| 250 | +serialize access to slots in the area. The io_tlb_area array for a pool has an |
| 251 | +entry for each area, and is accessed using a 0-based area index derived from the |
| 252 | +calling processor ID. Areas exist solely to allow parallel access to swiotlb |
| 253 | +from multiple CPUs. |
| 254 | + |
| 255 | +io_tlb_slot describes an individual memory slot in the pool, with size |
| 256 | +IO_TLB_SIZE (2 KiB currently). The io_tlb_slot array is indexed by the slot |
| 257 | +index computed from the bounce buffer address relative to the starting memory |
| 258 | +address of the pool. The size of struct io_tlb_slot is 24 bytes, so the |
| 259 | +overhead is about 1% of the slot size. |
| 260 | + |
| 261 | +The io_tlb_slot array is designed to meet several requirements. First, the DMA |
| 262 | +APIs and the corresponding swiotlb APIs use the bounce buffer address as the |
| 263 | +identifier for a bounce buffer. This address is returned by |
| 264 | +swiotlb_tbl_map_single(), and then passed as an argument to |
| 265 | +swiotlb_tbl_unmap_single() and the swiotlb_sync_*() functions. The original |
| 266 | +memory buffer address obviously must be passed as an argument to |
| 267 | +swiotlb_tbl_map_single(), but it is not passed to the other APIs. Consequently, |
| 268 | +swiotlb data structures must save the original memory buffer address so that it |
| 269 | +can be used when doing sync operations. This original address is saved in the |
| 270 | +io_tlb_slot array. |
| 271 | + |
| 272 | +Second, the io_tlb_slot array must handle partial sync requests. In such cases, |
| 273 | +the argument to swiotlb_sync_*() is not the address of the start of the bounce |
| 274 | +buffer but an address somewhere in the middle of the bounce buffer, and the |
| 275 | +address of the start of the bounce buffer isn't known to swiotlb code. But |
| 276 | +swiotlb code must be able to calculate the corresponding original memory buffer |
| 277 | +address to do the CPU copy dictated by the "sync". So an adjusted original |
| 278 | +memory buffer address is populated into the struct io_tlb_slot for each slot |
| 279 | +occupied by the bounce buffer. An adjusted "alloc_size" of the bounce buffer is |
| 280 | +also recorded in each struct io_tlb_slot so a sanity check can be performed on |
| 281 | +the size of the "sync" operation. The "alloc_size" field is not used except for |
| 282 | +the sanity check. |
| 283 | + |
| 284 | +Third, the io_tlb_slot array is used to track available slots. The "list" field |
| 285 | +in struct io_tlb_slot records how many contiguous available slots exist starting |
| 286 | +at that slot. A "0" indicates that the slot is occupied. A value of "1" |
| 287 | +indicates only the current slot is available. A value of "2" indicates the |
| 288 | +current slot and the next slot are available, etc. The maximum value is |
| 289 | +IO_TLB_SEGSIZE, which can appear in the first slot in a slot set, and indicates |
| 290 | +that the entire slot set is available. These values are used when searching for |
| 291 | +available slots to use for a new bounce buffer. They are updated when allocating |
| 292 | +a new bounce buffer and when freeing a bounce buffer. At pool creation time, the |
| 293 | +"list" field is initialized to IO_TLB_SEGSIZE down to 1 for the slots in every |
| 294 | +slot set. |
| 295 | + |
| 296 | +Fourth, the io_tlb_slot array keeps track of any "padding slots" allocated to |
| 297 | +meet alloc_align_mask requirements described above. When |
| 298 | +swiotlb_tlb_map_single() allocates bounce buffer space to meet alloc_align_mask |
| 299 | +requirements, it may allocate pre-padding space across zero or more slots. But |
| 300 | +when swiotbl_tlb_unmap_single() is called with the bounce buffer address, the |
| 301 | +alloc_align_mask value that governed the allocation, and therefore the |
| 302 | +allocation of any padding slots, is not known. The "pad_slots" field records |
| 303 | +the number of padding slots so that swiotlb_tbl_unmap_single() can free them. |
| 304 | +The "pad_slots" value is recorded only in the first non-padding slot allocated |
| 305 | +to the bounce buffer. |
| 306 | + |
| 307 | +Restricted pools |
| 308 | +---------------- |
| 309 | +The swiotlb machinery is also used for "restricted pools", which are pools of |
| 310 | +memory separate from the default swiotlb pool, and that are dedicated for DMA |
| 311 | +use by a particular device. Restricted pools provide a level of DMA memory |
| 312 | +protection on systems with limited hardware protection capabilities, such as |
| 313 | +those lacking an IOMMU. Such usage is specified by DeviceTree entries and |
| 314 | +requires that CONFIG_DMA_RESTRICTED_POOL is set. Each restricted pool is based |
| 315 | +on its own io_tlb_mem data structure that is independent of the main swiotlb |
| 316 | +io_tlb_mem. |
| 317 | + |
| 318 | +Restricted pools add swiotlb_alloc() and swiotlb_free() APIs, which are called |
| 319 | +from the dma_alloc_*() and dma_free_*() APIs. The swiotlb_alloc/free() APIs |
| 320 | +allocate/free slots from/to the restricted pool directly and do not go through |
| 321 | +swiotlb_tbl_map/unmap_single(). |
0 commit comments