Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion net_util/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ fn create_unix_socket() -> Result<net::UdpSocket> {
Ok(unsafe { net::UdpSocket::from_raw_fd(sock) })
}

fn vnet_hdr_len() -> usize {
pub fn vnet_hdr_len() -> usize {
std::mem::size_of::<virtio_net_hdr_v1>()
}

Expand Down
14 changes: 14 additions & 0 deletions virtio-devices/src/device.rs
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,12 @@ pub trait VirtioDevice: Send {
/// Set the access platform trait to let the device perform address
/// translations if needed.
fn set_access_platform(&mut self, _access_platform: Arc<dyn AccessPlatform>) {}

/// Some devices can announce their location after a live migration to
/// speed up normal execution.
fn post_migration_announcer(&mut self) -> std::option::Option<Box<dyn PostMigrationAnnouncer>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Option and not the fully qualified path

None
}
}

/// Trait to define address translation for devices managed by virtio-iommu
Expand Down Expand Up @@ -338,3 +344,11 @@ impl Pausable for VirtioCommon {
Ok(())
}
}

/// A PostMigrationAnnouncer is used to inform other devices about the new
/// location of a VM after a live migration.
pub trait PostMigrationAnnouncer: Send + Sync {
// Sending the announces is done on a best-effort basis, so we ignore
// errors.
fn announce_once(&self);
}
2 changes: 1 addition & 1 deletion virtio-devices/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ pub use self::block::{Block, BlockState};
pub use self::console::{Console, ConsoleResizer, Endpoint};
pub use self::device::{
DmaRemapping, VirtioCommon, VirtioDevice, VirtioInterrupt, VirtioInterruptType,
VirtioSharedMemoryList,
VirtioSharedMemoryList, PostMigrationAnnouncer
};
pub use self::epoll_helper::{
EPOLL_HELPER_EVENT_LAST, EpollHelper, EpollHelperError, EpollHelperHandler,
Expand Down
74 changes: 72 additions & 2 deletions virtio-devices/src/net.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ use log::{debug, error, info, trace};
#[cfg(not(fuzzing))]
use net_util::virtio_features_to_tap_offload;
use net_util::{
CtrlQueue, MacAddr, NetCounters, NetQueuePair, OpenTapError, RxVirtio, Tap, TapError, TxVirtio,
VirtioNetConfig, build_net_config_space, build_net_config_space_with_mq, open_tap,
CtrlQueue, MAC_ADDR_LEN, MacAddr, NetCounters, NetQueuePair, OpenTapError, RxVirtio, Tap,
TapError, TxVirtio, VirtioNetConfig, build_net_config_space, build_net_config_space_with_mq,
open_tap, vnet_hdr_len,
};
use seccompiler::SeccompAction;
use serde::{Deserialize, Serialize};
Expand All @@ -40,6 +41,7 @@ use super::{
EpollHelperHandler, Error as DeviceError, RateLimiterConfig, VirtioCommon, VirtioDevice,
VirtioDeviceType, VirtioInterruptType,
};
use crate::device::PostMigrationAnnouncer;
use crate::seccomp_filters::Thread;
use crate::thread_helper::spawn_virtio_thread;
use crate::{GuestMemoryMmap, VirtioInterrupt};
Expand Down Expand Up @@ -655,6 +657,38 @@ impl Net {
pub fn wait_for_epoll_threads(&mut self) {
self.common.wait_for_epoll_threads();
}

fn build_rarp_announce(&self) -> [u8; 60] {
const ETH_P_RARP: u16 = 0x8035; // Ethertype RARP
const ARP_HTYPE_ETH: u16 = 0x1; // Hardware type Ethernet
const ARP_PTYPE_IP: u16 = 0x0800; // Protocol type IPv4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do the same for the ARP equivalent of IPv6?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know this works for IPv4 and IPv6. We never put any IP addresses into the packet, only MAC addresses.

Copy link
Member

@phip1611 phip1611 Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But IPv6 doesn't use ARP AFAICT. ipv6 uses the Neighbor Discovery Protocol (NDP).

I am not 100% with that, but we should clarify and make sure your PR reaches your desired network discovery speedup if guest VMs only uses IPv6!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the IPv6 message would need the IPv6 Address of the Link, and we have no way of knowing this address. Also this is mainly for the switches in the network, which only need the MAC addresses.

Copy link
Member

@phip1611 phip1611 Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is: When the VM has an IPv6, and someone wants to connect the VM right after migration, some network node will use NDP (not ARP!) to ask "who has $myipv6". Therefore, we also need to populate NDP caches in all switches and not just ARP - correct? (Again, I am not 100% sure, but it seems very plausible to me!).

I think: After a live migration, switches and routers may still have the VM’s old MAC-to-port mappings, causing traffic drops. For IPv4, sending "hello" ARP updates the ARP caches in the network. For IPv6, you similarly need to send NDP messages so that neighbor caches in all relevant nodes and switches learn the VM's new location.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Layer 2 switch learning RARP is enough, because the switches only care about MAC addresses.

For Layer 3 neighbor learning, we would always need an IP address. We do not have those, so we can only fix the Layer 2 switches.

This comment was marked as outdated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you convinced me :) seems about right.

const ARP_OP_REQUEST_REV: u16 = 0x0003; // RARP Request opcode

const IPV4_ADDR_LENGTH: usize = 4; // Size of an IPv4 address

let mut buf = [0u8; 60];

// Ethernet header
buf[0..6].copy_from_slice(&[0xff; MAC_ADDR_LEN]); // This is a broadcast
buf[6..12].copy_from_slice(&self.config.mac); // Src is this NIC
buf[12..14].copy_from_slice(&ETH_P_RARP.to_be_bytes()); // This is a RARP packet

// ARP Header
buf[14..16].copy_from_slice(&ARP_HTYPE_ETH.to_be_bytes());
buf[16..18].copy_from_slice(&ARP_PTYPE_IP.to_be_bytes());
buf[18] = MAC_ADDR_LEN as u8; // Hardware address length (ethernet)
buf[19] = IPV4_ADDR_LENGTH as u8; // Protocol address length (IPv4)
// This is a "fake RARP" packet, we don't want to perform a real RARP lookup.
// Thus the content of the next fields is largely irrelevant. Setting sha = tha
// is fine according to RFC 903.
buf[20..22].copy_from_slice(&ARP_OP_REQUEST_REV.to_be_bytes());
buf[22..28].copy_from_slice(&self.config.mac); // Source hardware address
buf[28..32].copy_from_slice(&[0x00; IPV4_ADDR_LENGTH]); // Source protocol address
buf[32..38].copy_from_slice(&self.config.mac); // Target hardware address
buf[38..42].copy_from_slice(&[0x00; IPV4_ADDR_LENGTH]); // Target protocol address

buf
}
}

impl Drop for Net {
Expand Down Expand Up @@ -870,6 +904,13 @@ impl VirtioDevice for Net {
fn set_access_platform(&mut self, access_platform: Arc<dyn AccessPlatform>) {
self.common.set_access_platform(access_platform);
}

fn post_migration_announcer(&mut self) -> std::option::Option<Box<dyn PostMigrationAnnouncer>> {
Some(Box::new(TapRarpAnnouncer::new(
self.build_rarp_announce(),
self.taps.clone(),
)))
}
}

impl Pausable for Net {
Expand Down Expand Up @@ -898,3 +939,32 @@ impl Snapshottable for Net {
}
impl Transportable for Net {}
impl Migratable for Net {}

pub struct TapRarpAnnouncer {
announce: [u8; 60],
taps: Vec<Tap>,
}

impl TapRarpAnnouncer {
pub fn new(announce: [u8; 60], taps: Vec<Tap>) -> Self {
Self { announce, taps }
}
}

impl PostMigrationAnnouncer for TapRarpAnnouncer {
fn announce_once(&self) {
// We have to add a virtio-net header to the announce.
let mut buf = vec![0u8; vnet_hdr_len() + self.announce.len()];
buf[vnet_hdr_len()..].copy_from_slice(&self.announce);

for tap in &self.taps {
let _ = unsafe {
libc::write(
tap.as_raw_fd(),
buf.as_ptr() as *const libc::c_void,
buf.len(),
)
};
Comment on lines +961 to +967

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the function signature it seems like there is in principle nothing stopping this from being called in parallel from multiple threads. Is that safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is missing a // SAFETY comment (see cargo clippy).

It is safe, but we need a nice compact comment explaining why it is :)

good point Oliver!

}
}
}
48 changes: 45 additions & 3 deletions vmm/src/device_manager.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@ use std::num::Wrapping;
use std::os::unix::fs::OpenOptionsExt;
use std::os::unix::io::{AsRawFd, FromRawFd};
use std::path::{Path, PathBuf};
use std::result;
use std::sync::{Arc, Mutex};
use std::time::Duration;
#[cfg(not(target_arch = "riscv64"))]
use std::time::Instant;
use std::{result, thread};

use acpi_tables::sdt::GenericAddress;
use acpi_tables::{Aml, aml};
Expand Down Expand Up @@ -90,8 +91,8 @@ use vfio_ioctls::{VfioContainer, VfioDevice, VfioDeviceFd};
use virtio_devices::transport::{VirtioPciDevice, VirtioPciDeviceActivator, VirtioTransport};
use virtio_devices::vhost_user::VhostUserConfig;
use virtio_devices::{
AccessPlatformMapping, ActivateError, Block, Endpoint, IommuMapping, VdpaDmaMapping,
VirtioMemMappingSource,
AccessPlatformMapping, ActivateError, Block, Endpoint, IommuMapping, PostMigrationAnnouncer,
VdpaDmaMapping, VirtioMemMappingSource,
};
use vm_allocator::{AddressAllocator, SystemAllocator};
use vm_device::dma_mapping::ExternalDmaMapping;
Expand Down Expand Up @@ -5063,6 +5064,47 @@ impl DeviceManager {
self.vfio_container = None;
}
}

// Calls the PostMigrationAnnouncers of each device that has one.
pub fn post_migration_announce(&self) {
let announcers: Vec<Box<dyn PostMigrationAnnouncer>> = self
.virtio_devices
.iter()
.map(|dev| dev.virtio_device.lock().unwrap().post_migration_announcer())
.flatten()
.collect();

announcers.iter().for_each(|a| a.announce_once());
schedule_post_migration_announces(announcers, 4, 50, 100, 450);
}
}

// We could make this announcer configurable.
fn schedule_post_migration_announces(
announcers: Vec<Box<dyn PostMigrationAnnouncer>>,
rounds: u32,
initial_ms: u64,
step_ms: u64,
max_ms: u64,
) {
if announcers.is_empty() || rounds == 0 {
return;
}

let _ = thread::Builder::new()
.name("post-migration-announcers".to_string())
.spawn(move || {
for round in 0..rounds {
// The first announce is done synchronous, thus we sleep at the
// start of the loop.

let delay = (initial_ms + (round as u64) * step_ms).min(max_ms);
let delay = Duration::from_millis(delay);
thread::sleep(delay);

announcers.iter().for_each(|a| a.announce_once());
}
});
}

#[cfg(feature = "ivshmem")]
Expand Down
1 change: 1 addition & 0 deletions vmm/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1847,6 +1847,7 @@ impl Vmm {
// The unwrap is safe, because the state machine makes sure we called
// vm_receive_state before, which creates the VM.
let vm = self.vm.vm_mut().unwrap();
vm.announce();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very surprised about this fully-blown internal abstraction. I think this can be build on the existing resume() mechanism that Cloud Hypervisor already has?

Before I review this further, we should discuss this fundamental design discussion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put this somewhere else, I can also put this into the resume function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't resume also used for when you just pause the VM which does not necessarily have to entail a live migration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but I don't see how these packets could be harmful and why we need an extra condition whether to send them or not. Also, the traffic is negligible. Right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are only a few packets per NIC, so I don't think we cause harm by adding it to resume.

Copy link

@olivereanderson olivereanderson Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am not against using resume here, I would much rather do that directly in the resume method of the implementers than to introduce one more abstraction, but it is also not ideal. I think this serves as one more data point backing the popular claim that every abstraction is a leaky abstraction 😅

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olivereanderson you mean adding code to the resume method of the virtio device without fiddling with the device manager? I also thought about that, but then we would have to start a thread per virtio device, and that seemed a bit overkill to me.

vm.resume()?;
Ok(Completed)
}
Expand Down
12 changes: 12 additions & 0 deletions vmm/src/vm.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2757,6 +2757,18 @@ impl Vm {
.nmi()
.map_err(|_| Error::ErrorNmi);
}

/// Announces this VMs location to the network. This is done by sending
/// reverse ARP packets for every NIC.
///
/// Should be done during a live migration just before the VM resumes its
/// execution. That way the network learns the new location of the VM.
pub fn announce(&self) {
self.device_manager
.lock()
.unwrap()
.post_migration_announce()
}
}

impl Pausable for Vm {
Expand Down
Loading