-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[virtio-pmem] Implementation #5463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
2c5826f
c5f3315
d561635
a1d3a50
fd0e77d
9feee2e
5aae885
2b269f1
21baf9f
e88c4c0
bce75c9
a01a1b8
4ab4d2e
8bee1fc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
# Using the Firecracker `virtio-pmem` device | ||
|
||
## What is a persistent memory device | ||
|
||
Persistent memory is a type of non-volatile, CPU accessible (with usual | ||
load/store instructions) memory that does not lose its content on power loss. In | ||
other words all writes to the memory persist over the power cycle. In hardware | ||
this known as NVDIMM memory (Non Volatile Double Inline Memory Module). | ||
|
||
## What is a `virtio-pmem` device: | ||
|
||
[`virtio-pmem`](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-68900019) | ||
is a device which emulates a persistent memory device without requiring a | ||
physical NVDIMM device be present on the host system. `virtio-pmem` is backed by | ||
a memory mapped file on the host side and is exposed to the guest kernel as an | ||
region in the guest physical memory. This allows the guest to directly access | ||
the host memory pages without a need to use guest driver or interact with VMM. | ||
From guest user-space perspective `virtio-pmem` devices are presented as normal | ||
block device like `/dev/pmem0`. This allows `virtio-pmem` to be used as rootfs | ||
device and make VM boot from it. | ||
|
||
> [!NOTE] | ||
> | ||
> Since `virtio-pmem` is located fully in memory, when used as a block device | ||
> there is no need to use guest page cache for it's operations. This behaviour | ||
> can be configured by using `DAX` feature of the kernel. | ||
> | ||
> - To mount a device with `DAX` add `--flags=dax` to the `mount` command. | ||
> - To configure a root device with `DAX` append `rootflags=dax` to the kernel | ||
> arguments. | ||
> | ||
> `DAX` support is not uniform for all file systems. Check the documentation for | ||
> the file system you want to use before enabling `DAX`. | ||
## Prerequisites | ||
|
||
In order to use `virtio-pmem` device, guest kernel needs to built with support | ||
for it. The full list of configuration options needed for `virtio-pmem` and | ||
`DAX`: | ||
|
||
``` | ||
# Needed for DAX on aarch64. Will be ignored on x86_64 | ||
CONFIG_ARM64_PMEM=y | ||
CONFIG_DEVICE_MIGRATION=y | ||
CONFIG_ZONE_DEVICE=y | ||
CONFIG_VIRTIO_PMEM=y | ||
CONFIG_LIBNVDIMM=y | ||
CONFIG_BLK_DEV_PMEM=y | ||
CONFIG_ND_CLAIM=y | ||
CONFIG_ND_BTT=y | ||
CONFIG_BTT=y | ||
CONFIG_ND_PFN=y | ||
CONFIG_NVDIMM_PFN=y | ||
CONFIG_NVDIMM_DAX=y | ||
CONFIG_OF_PMEM=y | ||
CONFIG_NVDIMM_KEYS=y | ||
CONFIG_DAX=y | ||
CONFIG_DEV_DAX=y | ||
CONFIG_DEV_DAX_PMEM=y | ||
CONFIG_DEV_DAX_KMEM=y | ||
CONFIG_FS_DAX=y | ||
CONFIG_FS_DAX_PMD=y | ||
``` | ||
|
||
## Configuration | ||
|
||
Firecracker implementation exposes these config options for the `virtio-pmem` | ||
device: | ||
|
||
- `id` - id of the device for internal use | ||
- `path_on_host` - path to the backing file | ||
- `root_device` - toggle to use this device as root device. Device will be | ||
marked as `rw` in the kernel arguments | ||
- `read_only` - tells Firecracker to `mmap` the backing file in read-only mode. | ||
If this device is also configured as `root_device`, it will be marked as `ro` | ||
in the kernel arguments | ||
|
||
> [!NOTE] | ||
> | ||
> Devices will be exposed to the guest in the order in which they are configured | ||
> with sequential names in the for `/dev/pmem{N}` like: `/dev/pmem0`, | ||
> `/dev/pmem1` ... | ||
> [!WARNING] | ||
> | ||
> Setting `virtio-pmem` device to `read-only` mode can lead to VM shutting down | ||
> on any attempt to write to the device. This is because from guest kernel | ||
> perspective `virtio-pmem` is always `read-write` capable. Use `read-only` mode | ||
> only if you want to ensure the underlying file is never written to. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should add instructions to mount as readonly There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
> | ||
> To mount the `pmem` device with `read-only` options add `-o ro` to the `mount` | ||
> command. | ||
> | ||
> The exact behaviour differs per platform: | ||
> | ||
> - x86_64 - if KVM is able to decode the write instruction used by the guest, | ||
> it will return a MMIO_WRITE to the Firecracker where it will be discarded | ||
> and the warning log will be printed. | ||
> - aarch64 - the instruction emulation is much stricter. Writes will result in | ||
> an internal KVM error which will be returned to Firecracker in a form of an | ||
> `ENOSYS` error. This will make Firecracker stop the VM with appropriate log | ||
> message. | ||
> [!WARNING] | ||
> | ||
> `virtio-pmem` requires for the guest exposed memory region to be 2MB aligned. | ||
> This requirement is transitively carried to the backing file of the | ||
> `virtio-pmem`. Firecracker allows users to configure `virtio-pmem` with | ||
> backing file of any size and fills the memory gap between the end of the file | ||
> and the 2MB boundary with empty `PRIVATE | ANONYMOUS` memory pages. Users must | ||
> be careful to not write to this memory gap since it will not be synchronized | ||
> with backing file. This is not an issue if `virtio-pmem` is configured in | ||
> `read-only` mode. | ||
### Config file | ||
|
||
Configuration of the `virtio-pmem` device from config file follows similar | ||
pattern to `virtio-block` section. Here is an example configuration for a single | ||
`virtio-pmem` device: | ||
|
||
```json | ||
"pmem": [ | ||
{ | ||
"id": "pmem0", | ||
"path_on_host": "./some_file", | ||
"root_device": true, | ||
"read_only": false | ||
} | ||
] | ||
``` | ||
|
||
### API | ||
|
||
Similar to other devices `virtio-pmem` can be configured with API calls. An | ||
example of configuration request: | ||
|
||
```console | ||
curl --unix-socket $socket_location -i \ | ||
-X PUT 'http://localhost/pmem/pmem0' \ | ||
-H 'Accept: application/json' \ | ||
-H 'Content-Type: application/json' \ | ||
-d "{ | ||
\"id\": \"pmem0\", | ||
\"path_on_host\": \"./some_file\", | ||
\"root_device\": true, | ||
\"read_only\": false | ||
}" | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should probably mention snapshot/restore behaviour as well and also security considerations about sharing memory (which we do not recommend). We can also mention performance considerations: ie that even though pages are in memory, the guest still needs to exit to the kernel to set up the pagetable mappings. Using hugetlbfs to back the file would be faster (but will consume memory). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added snapshot, security and performance sections There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding hugetlbfs: its main usage is to make sharable memory region and it does not support writes (e.g. you cannot copy file to the hugetblfs, only create and resize them). So I don't think we need to explicitly mention it as a backing for pmem since the main use we expect is to use actual files as backing storage. |
||
|
||
## Security | ||
|
||
`virtio-pmem` can used for sharing of underlying backing file between multiple | ||
VMs by providing same backing file to `virtio-pmem` devices of corresponding | ||
VMs. This scenario imposes a security risk of side channel attacks between VMs. | ||
Users are encouraged to evaluate risks before using `virtio-pmem` for such | ||
scenarios. | ||
|
||
## Snapshot support | ||
|
||
`virtio-pmem` works with snapshot functionality of Firecracker. Snapshot will | ||
contain the configuration options provided by the user. During restoration | ||
process, Firecracker will attempt to restore `virtio-pmem` device by opening | ||
same backing file as it was configured in the first place. This means all | ||
`virtio-pmem` backing files should be present in the same locations during | ||
restore as they were during initial `virtio-pmem` configuration. | ||
|
||
## Performance | ||
|
||
Event thought `virtio-pmem` allows for the direct access of host pages from the | ||
guest, the performance of the first access of each page will suffer from the | ||
internal KVM page fault which will have to set up Guest physical address to Host | ||
Virtual address translation. Consecutive accesses will not need to go through | ||
this process again. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
use vmm::logger::{IncMetric, METRICS}; | ||
use vmm::rpc_interface::VmmAction; | ||
use vmm::vmm_config::pmem::PmemConfig; | ||
|
||
use super::super::parsed_request::{ParsedRequest, RequestError, checked_id}; | ||
use super::{Body, StatusCode}; | ||
|
||
pub(crate) fn parse_put_pmem( | ||
body: &Body, | ||
id_from_path: Option<&str>, | ||
) -> Result<ParsedRequest, RequestError> { | ||
METRICS.put_api_requests.pmem_count.inc(); | ||
let id = if let Some(id) = id_from_path { | ||
checked_id(id)? | ||
} else { | ||
METRICS.put_api_requests.pmem_fails.inc(); | ||
return Err(RequestError::EmptyID); | ||
}; | ||
|
||
let device_cfg = serde_json::from_slice::<PmemConfig>(body.raw()).inspect_err(|_| { | ||
METRICS.put_api_requests.pmem_fails.inc(); | ||
})?; | ||
|
||
if id != device_cfg.id { | ||
METRICS.put_api_requests.pmem_fails.inc(); | ||
Err(RequestError::Generic( | ||
StatusCode::BadRequest, | ||
"The id from the path does not match the id from the body!".to_string(), | ||
)) | ||
} else { | ||
Ok(ParsedRequest::new_sync(VmmAction::InsertPmemDevice( | ||
device_cfg, | ||
))) | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::*; | ||
use crate::api_server::parsed_request::tests::vmm_action_from_request; | ||
|
||
#[test] | ||
fn test_parse_put_pmem_request() { | ||
parse_put_pmem(&Body::new("invalid_payload"), None).unwrap_err(); | ||
parse_put_pmem(&Body::new("invalid_payload"), Some("id")).unwrap_err(); | ||
|
||
let body = r#"{ | ||
"id": "bar", | ||
}"#; | ||
parse_put_pmem(&Body::new(body), Some("1")).unwrap_err(); | ||
let body = r#"{ | ||
"foo": "1", | ||
}"#; | ||
parse_put_pmem(&Body::new(body), Some("1")).unwrap_err(); | ||
|
||
let body = r#"{ | ||
"id": "1000", | ||
"path_on_host": "dummy", | ||
"root_device": true, | ||
"read_only": true | ||
}"#; | ||
let r = vmm_action_from_request(parse_put_pmem(&Body::new(body), Some("1000")).unwrap()); | ||
|
||
let expected_config = PmemConfig { | ||
id: "1000".to_string(), | ||
path_on_host: "dummy".to_string(), | ||
root_device: true, | ||
read_only: true, | ||
}; | ||
assert_eq!(r, VmmAction::InsertPmemDevice(expected_config)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works on ext4, right? does it need any specific options (ie 4096 block size) or just works?