|
| 1 | +# Using the Firecracker `virtio-pmem` device |
| 2 | + |
| 3 | +## What is a persistent memory device |
| 4 | + |
| 5 | +Persistent memory is a type of non-volatile, CPU accessible (with usual |
| 6 | +load/store instructions) memory that does not lose its content on power loss. In |
| 7 | +other words all writes to the memory persist over the power cycle. In hardware |
| 8 | +this known as NVDIMM memory (Non Volatile Double Inline Memory Module). |
| 9 | + |
| 10 | +## What is a `virtio-pmem` device: |
| 11 | + |
| 12 | +[`virtio-pmem`](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-68900019) |
| 13 | +is a device which emulates a persistent memory device without requiring a |
| 14 | +physical NVDIMM device be present on the host system. `virtio-pmem` is backed by |
| 15 | +a memory mapped file on the host side and is exposed to the guest kernel as an |
| 16 | +region in the guest physical memory. This allows the guest to directly access |
| 17 | +the host memory pages without a need to use guest driver or interact with VMM. |
| 18 | +From guest user-space perspective `virtio-pmem` devices are presented as normal |
| 19 | +block device like `/dev/pmem0`. This allows `virtio-pmem` to be used as rootfs |
| 20 | +device and make VM boot from it. |
| 21 | + |
| 22 | +> [!NOTE] |
| 23 | +> |
| 24 | +> Since `virtio-pmem` is located fully in memory, when used as a block device |
| 25 | +> there is no need to use guest page cache for it's operations. This behaviour |
| 26 | +> can be configured by using `DAX` feature of the kernel. |
| 27 | +> |
| 28 | +> - To mount a device with `DAX` add `--flags=dax` to the `mount` command. |
| 29 | +> - To configure a root device with `DAX` append `rootflags=dax` to the kernel |
| 30 | +> arguments. |
| 31 | +> |
| 32 | +> `DAX` support is not uniform for all file systems. Check the kernel |
| 33 | +> [documentation](https://github.com/torvalds/linux/blob/master/Documentation/filesystems/dax.rst) |
| 34 | +> for more information. |
| 35 | +
|
| 36 | +## Prerequisites |
| 37 | + |
| 38 | +In order to use `virtio-pmem` device, guest kernel needs to built with support |
| 39 | +for it. The full list of configuration options needed for `virtio-pmem` and |
| 40 | +`DAX`: |
| 41 | + |
| 42 | +``` |
| 43 | +# Needed for DAX on aarch64. Will be ignored on x86_64 |
| 44 | +CONFIG_ARM64_PMEM=y |
| 45 | +
|
| 46 | +CONFIG_DEVICE_MIGRATION=y |
| 47 | +CONFIG_ZONE_DEVICE=y |
| 48 | +CONFIG_VIRTIO_PMEM=y |
| 49 | +CONFIG_LIBNVDIMM=y |
| 50 | +CONFIG_BLK_DEV_PMEM=y |
| 51 | +CONFIG_ND_CLAIM=y |
| 52 | +CONFIG_ND_BTT=y |
| 53 | +CONFIG_BTT=y |
| 54 | +CONFIG_ND_PFN=y |
| 55 | +CONFIG_NVDIMM_PFN=y |
| 56 | +CONFIG_NVDIMM_DAX=y |
| 57 | +CONFIG_OF_PMEM=y |
| 58 | +CONFIG_NVDIMM_KEYS=y |
| 59 | +CONFIG_DAX=y |
| 60 | +CONFIG_DEV_DAX=y |
| 61 | +CONFIG_DEV_DAX_PMEM=y |
| 62 | +CONFIG_DEV_DAX_KMEM=y |
| 63 | +CONFIG_FS_DAX=y |
| 64 | +CONFIG_FS_DAX_PMD=y |
| 65 | +``` |
| 66 | + |
| 67 | +## Configuration |
| 68 | + |
| 69 | +Firecracker implementation exposes these config options for the `virtio-pmem` |
| 70 | +device: |
| 71 | + |
| 72 | +- `id` - id of the device for internal use |
| 73 | +- `path_on_host` - path to the backing file |
| 74 | +- `root_device` - toggle to use this device as root device. Device will be |
| 75 | + marked as `rw` in the kernel arguments |
| 76 | +- `read_only` - tells Firecracker to `mmap` the backing file in read-only mode. |
| 77 | + If this device is also configured as `root_device`, it will be marked as `ro` |
| 78 | + in the kernel arguments |
| 79 | + |
| 80 | +> [!NOTE] |
| 81 | +> |
| 82 | +> Devices will be exposed to the guest in the order in which they are configured |
| 83 | +> with sequential names in the for `/dev/pmem{N}` like: `/dev/pmem0`, |
| 84 | +> `/dev/pmem1` ... |
| 85 | +
|
| 86 | +> [!WARNING] |
| 87 | +> |
| 88 | +> Setting `virtio-pmem` device to `read-only` mode can lead to VM shutting down |
| 89 | +> on any attempt to write to the device. This is because from guest kernel |
| 90 | +> perspective `virtio-pmem` is always `read-write` capable. Use `read-only` mode |
| 91 | +> only if you want to ensure the underlying file is never written to. |
| 92 | +> |
| 93 | +> To mount the `pmem` device with `read-only` options add `-o ro` to the `mount` |
| 94 | +> command. |
| 95 | +> |
| 96 | +> The exact behaviour differs per platform: |
| 97 | +> |
| 98 | +> - x86_64 - if KVM is able to decode the write instruction used by the guest, |
| 99 | +> it will return a MMIO_WRITE to the Firecracker where it will be discarded |
| 100 | +> and the warning log will be printed. |
| 101 | +> - aarch64 - the instruction emulation is much stricter. Writes will result in |
| 102 | +> an internal KVM error which will be returned to Firecracker in a form of an |
| 103 | +> `ENOSYS` error. This will make Firecracker stop the VM with appropriate log |
| 104 | +> message. |
| 105 | +
|
| 106 | +> [!WARNING] |
| 107 | +> |
| 108 | +> `virtio-pmem` requires for the guest exposed memory region to be 2MB aligned. |
| 109 | +> This requirement is transitively carried to the backing file of the |
| 110 | +> `virtio-pmem`. Firecracker allows users to configure `virtio-pmem` with |
| 111 | +> backing file of any size and fills the memory gap between the end of the file |
| 112 | +> and the 2MB boundary with empty `PRIVATE | ANONYMOUS` memory pages. Users must |
| 113 | +> be careful to not write to this memory gap since it will not be synchronized |
| 114 | +> with backing file. This is not an issue if `virtio-pmem` is configured in |
| 115 | +> `read-only` mode. |
| 116 | +
|
| 117 | +### Config file |
| 118 | + |
| 119 | +Configuration of the `virtio-pmem` device from config file follows similar |
| 120 | +pattern to `virtio-block` section. Here is an example configuration for a single |
| 121 | +`virtio-pmem` device: |
| 122 | + |
| 123 | +```json |
| 124 | +"pmem": [ |
| 125 | + { |
| 126 | + "id": "pmem0", |
| 127 | + "path_on_host": "./some_file", |
| 128 | + "root_device": true, |
| 129 | + "read_only": false |
| 130 | + } |
| 131 | +] |
| 132 | +``` |
| 133 | + |
| 134 | +### API |
| 135 | + |
| 136 | +Similar to other devices `virtio-pmem` can be configured with API calls. An |
| 137 | +example of configuration request: |
| 138 | + |
| 139 | +```console |
| 140 | +curl --unix-socket $socket_location -i \ |
| 141 | + -X PUT 'http://localhost/pmem/pmem0' \ |
| 142 | + -H 'Accept: application/json' \ |
| 143 | + -H 'Content-Type: application/json' \ |
| 144 | + -d "{ |
| 145 | + \"id\": \"pmem0\", |
| 146 | + \"path_on_host\": \"./some_file\", |
| 147 | + \"root_device\": true, |
| 148 | + \"read_only\": false |
| 149 | + }" |
| 150 | +``` |
| 151 | + |
| 152 | +## Security |
| 153 | + |
| 154 | +`virtio-pmem` can used for sharing of underlying backing file between multiple |
| 155 | +VMs by providing same backing file to `virtio-pmem` devices of corresponding |
| 156 | +VMs. This scenario imposes a security risk of side channel attacks between VMs. |
| 157 | +Users are encouraged to evaluate risks before using `virtio-pmem` for such |
| 158 | +scenarios. |
| 159 | + |
| 160 | +## Snapshot support |
| 161 | + |
| 162 | +`virtio-pmem` works with snapshot functionality of Firecracker. Snapshot will |
| 163 | +contain the configuration options provided by the user. During restoration |
| 164 | +process, Firecracker will attempt to restore `virtio-pmem` device by opening |
| 165 | +same backing file as it was configured in the first place. This means all |
| 166 | +`virtio-pmem` backing files should be present in the same locations during |
| 167 | +restore as they were during initial `virtio-pmem` configuration. |
| 168 | + |
| 169 | +## Performance |
| 170 | + |
| 171 | +Event thought `virtio-pmem` allows for the direct access of host pages from the |
| 172 | +guest, the performance of the first access of each page will suffer from the |
| 173 | +internal KVM page fault which will have to set up Guest physical address to Host |
| 174 | +Virtual address translation. Consecutive accesses will not need to go through |
| 175 | +this process again. |
| 176 | + |
| 177 | +Since the number of page faults correlate to the size of the pages used to back |
| 178 | +`virtio-pmem` memory, it is possible to use huge pages to reduce number of |
| 179 | +required page fault. This can be done by using |
| 180 | +[`tmpfs`](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html) with |
| 181 | +transparent huge pages enabled or by using |
| 182 | +[`hugetblfs`](https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html) |
| 183 | +if `virtio-pmem` is used for memory sharing. |
| 184 | + |
| 185 | +## Memory usage |
| 186 | + |
| 187 | +Since `virtio-pmem` resides in host memory it does increase the maximum possible |
| 188 | +memory usage of a VM since now VM can use all of its RAM and access all of the |
| 189 | +`virtio-pmem` memory. In order to minimize the overhead, it is highly |
| 190 | +recommended to use `DAX` mode to avoid unnecessary duplication of data in guest |
| 191 | +page cache. |
| 192 | + |
| 193 | +As an example, a single VM with 128MB of memory booted from `virtio-pmem` device |
| 194 | +without `DAX` has `RSS` value of ~120MB, while with `DAX` it is ~96MB. The ~96MB |
| 195 | +is similar to memory usage of a VM booted using `virtio-block` as a root device. |
| 196 | + |
| 197 | +In the case where multiple VMs have `virtio-pmem` devices that point to the same |
| 198 | +underlying file the memory overhead can be amortized since total maximum memory |
| 199 | +usage will only include a single instance of `virtio-pmem` memory. |
| 200 | + |
| 201 | +As an example 2 VMs configured with 128MB of RAM without `virtio-pmem` devices |
| 202 | +can consume maximum of 128 + 128 = 256MB of host memory. If each of VMs will |
| 203 | +have a 100MB `virtio-pmem` device attached with shared backing file, the maximum |
| 204 | +memory consumption will be 128 + 128 + 100 = 356MB because 100MB of |
| 205 | +`virtio-pmem` will be shared between VMs. |
0 commit comments