Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ and this project adheres to

### Added

- [#5463](https://github.com/firecracker-microvm/firecracker/pull/5463): Added
support for `virtio-pmem` devices. See [documentation](docs/pmem.md) for more
information.

### Changed

### Deprecated
Expand Down
174 changes: 174 additions & 0 deletions docs/pmem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Using the Firecracker `virtio-pmem` device

## What is a persistent memory device

Persistent memory is a type of non-volatile, CPU accessible (with usual
load/store instructions) memory that does not lose its content on power loss. In
other words all writes to the memory persist over the power cycle. In hardware
this known as NVDIMM memory (Non Volatile Double Inline Memory Module).

## What is a `virtio-pmem` device:

[`virtio-pmem`](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-68900019)
is a device which emulates a persistent memory device without requiring a
physical NVDIMM device be present on the host system. `virtio-pmem` is backed by
a memory mapped file on the host side and is exposed to the guest kernel as an
region in the guest physical memory. This allows the guest to directly access
the host memory pages without a need to use guest driver or interact with VMM.
From guest user-space perspective `virtio-pmem` devices are presented as normal
block device like `/dev/pmem0`. This allows `virtio-pmem` to be used as rootfs
device and make VM boot from it.

> [!NOTE]
>
> Since `virtio-pmem` is located fully in memory, when used as a block device
> there is no need to use guest page cache for it's operations. This behaviour
> can be configured by using `DAX` feature of the kernel.
>
> - To mount a device with `DAX` add `--flags=dax` to the `mount` command.
> - To configure a root device with `DAX` append `rootflags=dax` to the kernel
> arguments.
>
> `DAX` support is not uniform for all file systems. Check the documentation for
> the file system you want to use before enabling `DAX`.
Comment on lines +32 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works on ext4, right? does it need any specific options (ie 4096 block size) or just works?

## Prerequisites

In order to use `virtio-pmem` device, guest kernel needs to built with support
for it. The full list of configuration options needed for `virtio-pmem` and
`DAX`:

```
# Needed for DAX on aarch64. Will be ignored on x86_64
CONFIG_ARM64_PMEM=y
CONFIG_DEVICE_MIGRATION=y
CONFIG_ZONE_DEVICE=y
CONFIG_VIRTIO_PMEM=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=y
CONFIG_ND_CLAIM=y
CONFIG_ND_BTT=y
CONFIG_BTT=y
CONFIG_ND_PFN=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_OF_PMEM=y
CONFIG_NVDIMM_KEYS=y
CONFIG_DAX=y
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=y
CONFIG_DEV_DAX_KMEM=y
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
```

## Configuration

Firecracker implementation exposes these config options for the `virtio-pmem`
device:

- `id` - id of the device for internal use
- `path_on_host` - path to the backing file
- `root_device` - toggle to use this device as root device. Device will be
marked as `rw` in the kernel arguments
- `read_only` - tells Firecracker to `mmap` the backing file in read-only mode.
If this device is also configured as `root_device`, it will be marked as `ro`
in the kernel arguments

> [!NOTE]
>
> Devices will be exposed to the guest in the order in which they are configured
> with sequential names in the for `/dev/pmem{N}` like: `/dev/pmem0`,
> `/dev/pmem1` ...
> [!WARNING]
>
> Setting `virtio-pmem` device to `read-only` mode can lead to VM shutting down
> on any attempt to write to the device. This is because from guest kernel
> perspective `virtio-pmem` is always `read-write` capable. Use `read-only` mode
> only if you want to ensure the underlying file is never written to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add instructions to mount as readonly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

>
> To mount the `pmem` device with `read-only` options add `-o ro` to the `mount`
> command.
>
> The exact behaviour differs per platform:
>
> - x86_64 - if KVM is able to decode the write instruction used by the guest,
> it will return a MMIO_WRITE to the Firecracker where it will be discarded
> and the warning log will be printed.
> - aarch64 - the instruction emulation is much stricter. Writes will result in
> an internal KVM error which will be returned to Firecracker in a form of an
> `ENOSYS` error. This will make Firecracker stop the VM with appropriate log
> message.
> [!WARNING]
>
> `virtio-pmem` requires for the guest exposed memory region to be 2MB aligned.
> This requirement is transitively carried to the backing file of the
> `virtio-pmem`. Firecracker allows users to configure `virtio-pmem` with
> backing file of any size and fills the memory gap between the end of the file
> and the 2MB boundary with empty `PRIVATE | ANONYMOUS` memory pages. Users must
> be careful to not write to this memory gap since it will not be synchronized
> with backing file. This is not an issue if `virtio-pmem` is configured in
> `read-only` mode.
### Config file

Configuration of the `virtio-pmem` device from config file follows similar
pattern to `virtio-block` section. Here is an example configuration for a single
`virtio-pmem` device:

```json
"pmem": [
{
"id": "pmem0",
"path_on_host": "./some_file",
"root_device": true,
"read_only": false
}
]
```

### API

Similar to other devices `virtio-pmem` can be configured with API calls. An
example of configuration request:

```console
curl --unix-socket $socket_location -i \
-X PUT 'http://localhost/pmem/pmem0' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d "{
\"id\": \"pmem0\",
\"path_on_host\": \"./some_file\",
\"root_device\": true,
\"read_only\": false
}"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably mention snapshot/restore behaviour as well

and also security considerations about sharing memory (which we do not recommend).

We can also mention performance considerations: ie that even though pages are in memory, the guest still needs to exit to the kernel to set up the pagetable mappings. Using hugetlbfs to back the file would be faster (but will consume memory).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added snapshot, security and performance sections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding hugetlbfs: its main usage is to make sharable memory region and it does not support writes (e.g. you cannot copy file to the hugetblfs, only create and resize them). So I don't think we need to explicitly mention it as a backing for pmem since the main use we expect is to use actual files as backing storage.


## Security

`virtio-pmem` can used for sharing of underlying backing file between multiple
VMs by providing same backing file to `virtio-pmem` devices of corresponding
VMs. This scenario imposes a security risk of side channel attacks between VMs.
Users are encouraged to evaluate risks before using `virtio-pmem` for such
scenarios.

## Snapshot support

`virtio-pmem` works with snapshot functionality of Firecracker. Snapshot will
contain the configuration options provided by the user. During restoration
process, Firecracker will attempt to restore `virtio-pmem` device by opening
same backing file as it was configured in the first place. This means all
`virtio-pmem` backing files should be present in the same locations during
restore as they were during initial `virtio-pmem` configuration.

## Performance

Event thought `virtio-pmem` allows for the direct access of host pages from the
guest, the performance of the first access of each page will suffer from the
internal KVM page fault which will have to set up Guest physical address to Host
Virtual address translation. Consecutive accesses will not need to go through
this process again.
13 changes: 13 additions & 0 deletions resources/seccomp/aarch64-unknown-linux-musl.json
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,19 @@
"syscall": "madvise",
"comment": "Used by the VirtIO balloon device and by musl for some customer workloads. It is also used by aws-lc during random number generation. They setup a memory page that mark with MADV_WIPEONFORK to be able to detect forks. They also call it with -1 to see if madvise is supported in certain platforms."
},
{
"syscall": "msync",
"comment": "Used by the VirtIO pmem device to sync the file content with the backing file.",
"args": [
{
"index": 2,
"type": "dword",
"op": "eq",
"val": 4,
"comment": "libc::MS_SYNC"
}
]
},
{
"syscall": "mmap",
"comment": "Used by the VirtIO balloon device",
Expand Down
13 changes: 13 additions & 0 deletions resources/seccomp/x86_64-unknown-linux-musl.json
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,19 @@
"syscall": "madvise",
"comment": "Used by the VirtIO balloon device and by musl for some customer workloads. It is also used by aws-lc during random number generation. They setup a memory page that mark with MADV_WIPEONFORK to be able to detect forks. They also call it with -1 to see if madvise is supported in certain platforms."
},
{
"syscall": "msync",
"comment": "Used by the VirtIO pmem device to sync the file content with the backing file.",
"args": [
{
"index": 2,
"type": "dword",
"op": "eq",
"val": 4,
"comment": "libc::MS_SYNC"
}
]
},
{
"syscall": "mmap",
"comment": "Used by the VirtIO balloon device",
Expand Down
2 changes: 2 additions & 0 deletions src/firecracker/src/api_server/parsed_request.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ use super::request::machine_configuration::{
use super::request::metrics::parse_put_metrics;
use super::request::mmds::{parse_get_mmds, parse_patch_mmds, parse_put_mmds};
use super::request::net::{parse_patch_net, parse_put_net};
use super::request::pmem::parse_put_pmem;
use super::request::snapshot::{parse_patch_vm_state, parse_put_snapshot};
use super::request::version::parse_get_version;
use super::request::vsock::parse_put_vsock;
Expand Down Expand Up @@ -90,6 +91,7 @@ impl TryFrom<&Request> for ParsedRequest {
(Method::Put, "boot-source", Some(body)) => parse_put_boot_source(body),
(Method::Put, "cpu-config", Some(body)) => parse_put_cpu_config(body),
(Method::Put, "drives", Some(body)) => parse_put_drive(body, path_tokens.next()),
(Method::Put, "pmem", Some(body)) => parse_put_pmem(body, path_tokens.next()),
(Method::Put, "logger", Some(body)) => parse_put_logger(body),
(Method::Put, "serial", Some(body)) => parse_put_serial(body),
(Method::Put, "machine-config", Some(body)) => parse_put_machine_config(body),
Expand Down
1 change: 1 addition & 0 deletions src/firecracker/src/api_server/request/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ pub mod machine_configuration;
pub mod metrics;
pub mod mmds;
pub mod net;
pub mod pmem;
pub mod serial;
pub mod snapshot;
pub mod version;
Expand Down
75 changes: 75 additions & 0 deletions src/firecracker/src/api_server/request/pmem.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0

use vmm::logger::{IncMetric, METRICS};
use vmm::rpc_interface::VmmAction;
use vmm::vmm_config::pmem::PmemConfig;

use super::super::parsed_request::{ParsedRequest, RequestError, checked_id};
use super::{Body, StatusCode};

pub(crate) fn parse_put_pmem(
body: &Body,
id_from_path: Option<&str>,
) -> Result<ParsedRequest, RequestError> {
METRICS.put_api_requests.pmem_count.inc();
let id = if let Some(id) = id_from_path {
checked_id(id)?
} else {
METRICS.put_api_requests.pmem_fails.inc();
return Err(RequestError::EmptyID);
};

let device_cfg = serde_json::from_slice::<PmemConfig>(body.raw()).inspect_err(|_| {
METRICS.put_api_requests.pmem_fails.inc();
})?;

if id != device_cfg.id {
METRICS.put_api_requests.pmem_fails.inc();
Err(RequestError::Generic(
StatusCode::BadRequest,
"The id from the path does not match the id from the body!".to_string(),
))
} else {
Ok(ParsedRequest::new_sync(VmmAction::InsertPmemDevice(
device_cfg,
)))
}
}

#[cfg(test)]
mod tests {
use super::*;
use crate::api_server::parsed_request::tests::vmm_action_from_request;

#[test]
fn test_parse_put_pmem_request() {
parse_put_pmem(&Body::new("invalid_payload"), None).unwrap_err();
parse_put_pmem(&Body::new("invalid_payload"), Some("id")).unwrap_err();

let body = r#"{
"id": "bar",
}"#;
parse_put_pmem(&Body::new(body), Some("1")).unwrap_err();
let body = r#"{
"foo": "1",
}"#;
parse_put_pmem(&Body::new(body), Some("1")).unwrap_err();

let body = r#"{
"id": "1000",
"path_on_host": "dummy",
"root_device": true,
"read_only": true
}"#;
let r = vmm_action_from_request(parse_put_pmem(&Body::new(body), Some("1000")).unwrap());

let expected_config = PmemConfig {
id: "1000".to_string(),
path_on_host: "dummy".to_string(),
root_device: true,
read_only: true,
};
assert_eq!(r, VmmAction::InsertPmemDevice(expected_config));
}
}
46 changes: 46 additions & 0 deletions src/firecracker/swagger/firecracker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,38 @@ paths:
schema:
$ref: "#/definitions/Error"

/pmem/{id}:
put:
summary: Creates or updates a pmem device. Pre-boot only.
description:
Creates new pmem device with ID specified by id parameter.
If a pmem device with the specified ID already exists, updates its state based on new input.
Will fail if update is not possible.
operationId: putGuestPmemByID
parameters:
- name: id
in: path
description: The id of the guest pmem device
required: true
type: string
- name: body
in: body
description: Guest pmem device properties
required: true
schema:
$ref: "#/definitions/Pmem"
responses:
204:
description: Pmem device is created/updated
400:
description: Pmem device cannot be created/updated due to bad input
schema:
$ref: "#/definitions/Error"
default:
description: Internal server error.
schema:
$ref: "#/definitions/Error"

/logger:
put:
summary: Initializes the logger by specifying a named pipe or a file for the logs output.
Expand Down Expand Up @@ -934,6 +966,20 @@ definitions:
Path to the socket of vhost-user-block backend.
This field is required for vhost-user-block config should be omitted for virtio-block configuration.

Pmem:
type: object
required:
- id
- is_root_device
- shared
properties:
id:
type: string
is_root_device:
type: boolean
shared:
type: boolean

Error:
type: object
properties:
Expand Down
2 changes: 2 additions & 0 deletions src/vmm/src/arch/aarch64/layout.rs
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,5 @@ pub const MEM_64BIT_DEVICES_START: u64 = MMIO64_MEM_START;
pub const MEM_64BIT_DEVICES_SIZE: u64 = MMIO64_MEM_SIZE;
/// First address past the 64-bit MMIO gap
pub const FIRST_ADDR_PAST_64BITS_MMIO: u64 = MMIO64_MEM_START + MMIO64_MEM_SIZE;
/// Size of the memory past 64-bit MMIO gap
pub const PAST_64BITS_MMIO_SIZE: u64 = 512 << 30;
Loading